help@rskworld.in +91 93305 39277
RSK World
  • Home
  • Development
    • Web Development
    • Mobile Apps
    • Software
    • Games
    • Project
  • Technologies
    • Data Science
    • AI Development
    • Cloud Development
    • Blockchain
    • Cyber Security
    • Dev Tools
    • Testing Tools
  • About
  • Contact

Theme Settings

Color Scheme
Display Options
Font Size
100%
Back to Project
RSK World
tensorflow-deeplearning
/
data
RSK World
tensorflow-deeplearning
Deep learning with TensorFlow and Keras
data
  • .gitkeep373 B
  • README.md3.4 KB
  • classification_X.npy78.3 KB
  • classification_metadata.json175 B
  • classification_y.npy7.9 KB
  • images_X.npy612.6 KB
  • images_metadata.json173 B
  • images_y.npy1.7 KB
  • regression_X.npy39.2 KB
  • regression_metadata.json173 B
  • regression_y.npy4 KB
  • sequences_X.npy976.7 KB
  • sequences_metadata.json176 B
  • sequences_y.npy4 KB
  • tabular.csv78.9 KB
data_preprocessing.pyREADME.mdimages_metadata.json
src/data_preprocessing.py
Raw Download
Find: Go to:
"""
Data Preprocessing Pipeline for TensorFlow
Author: RSK World
Website: https://rskworld.in
Email: help@rskworld.in
Phone: +91 93305 39277

This module provides comprehensive data preprocessing utilities.
"""

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import os

class ImagePreprocessor:
    """
    Image preprocessing utilities.
    Author: RSK World - https://rskworld.in
    """
    
    @staticmethod
    def load_and_preprocess_image(image_path, target_size=(224, 224)):
        """
        Load and preprocess a single image.
        
        Args:
            image_path: Path to image file
            target_size: Target image size
        
        Returns:
            Preprocessed image tensor
        """
        img = tf.io.read_file(image_path)
        img = tf.image.decode_image(img, channels=3)
        img = tf.image.resize(img, target_size)
        img = tf.cast(img, tf.float32) / 255.0
        return img
    
    @staticmethod
    def create_image_dataset(image_dir, batch_size=32, target_size=(224, 224), validation_split=0.2):
        """
        Create image dataset from directory.
        
        Args:
            image_dir: Directory containing images
            batch_size: Batch size
            target_size: Target image size
            validation_split: Validation split ratio
        
        Returns:
            Training and validation datasets
        """
        train_ds = keras.utils.image_dataset_from_directory(
            image_dir,
            validation_split=validation_split,
            subset='training',
            seed=123,
            image_size=target_size,
            batch_size=batch_size
        )
        
        val_ds = keras.utils.image_dataset_from_directory(
            image_dir,
            validation_split=validation_split,
            subset='validation',
            seed=123,
            image_size=target_size,
            batch_size=batch_size
        )
        
        # Normalize pixel values
        normalization_layer = layers.Rescaling(1./255)
        train_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
        val_ds = val_ds.map(lambda x, y: (normalization_layer(x), y))
        
        return train_ds, val_ds
    
    @staticmethod
    def create_augmentation_pipeline():
        """
        Create data augmentation pipeline.
        
        Returns:
            Sequential model with augmentation layers
        """
        return keras.Sequential([
            layers.RandomFlip("horizontal"),
            layers.RandomRotation(0.1),
            layers.RandomZoom(0.1),
            layers.RandomContrast(0.1),
        ])

class TextPreprocessor:
    """
    Text preprocessing utilities.
    Author: RSK World - https://rskworld.in
    """
    
    @staticmethod
    def create_text_vectorization_layer(vocab_size=10000, max_length=100, output_mode='int'):
        """
        Create text vectorization layer.
        
        Args:
            vocab_size: Vocabulary size
            max_length: Maximum sequence length
            output_mode: Output mode ('int', 'binary', 'count', 'tf_idf')
        
        Returns:
            TextVectorization layer
        """
        return layers.TextVectorization(
            max_tokens=vocab_size,
            output_mode=output_mode,
            output_sequence_length=max_length
        )
    
    @staticmethod
    def pad_sequences(sequences, max_length=None, padding='post', truncating='post'):
        """
        Pad sequences to the same length.
        
        Args:
            sequences: List of sequences
            max_length: Maximum length
            padding: Padding type ('pre' or 'post')
            truncating: Truncating type ('pre' or 'post')
        
        Returns:
            Padded sequences
        """
        return pad_sequences(sequences, maxlen=max_length, padding=padding, truncating=truncating)
    
    @staticmethod
    def create_tokenizer(texts, num_words=10000):
        """
        Create tokenizer from texts.
        
        Args:
            texts: List of text strings
            num_words: Maximum number of words
        
        Returns:
            Tokenizer object
        """
        tokenizer = keras.preprocessing.text.Tokenizer(num_words=num_words, oov_token="<OOV>")
        tokenizer.fit_on_texts(texts)
        return tokenizer

class TabularPreprocessor:
    """
    Tabular data preprocessing utilities.
    Author: RSK World - https://rskworld.in
    """
    
    @staticmethod
    def normalize_features(X, method='standard'):
        """
        Normalize features.
        
        Args:
            X: Feature matrix
            method: Normalization method ('standard' or 'minmax')
        
        Returns:
            Normalized features and scaler
        """
        if method == 'standard':
            scaler = StandardScaler()
        elif method == 'minmax':
            scaler = MinMaxScaler()
        else:
            raise ValueError(f"Unknown method: {method}")
        
        X_normalized = scaler.fit_transform(X)
        return X_normalized, scaler
    
    @staticmethod
    def encode_categorical_features(df, columns):
        """
        Encode categorical features.
        
        Args:
            df: DataFrame
            columns: List of categorical column names
        
        Returns:
            DataFrame with encoded features and encoders
        """
        encoders = {}
        df_encoded = df.copy()
        
        for col in columns:
            le = LabelEncoder()
            df_encoded[col] = le.fit_transform(df[col])
            encoders[col] = le
        
        return df_encoded, encoders
    
    @staticmethod
    def handle_missing_values(df, strategy='mean'):
        """
        Handle missing values.
        
        Args:
            df: DataFrame
            strategy: Strategy ('mean', 'median', 'mode', 'drop')
        
        Returns:
            DataFrame with handled missing values
        """
        df_clean = df.copy()
        
        if strategy == 'drop':
            df_clean = df_clean.dropna()
        elif strategy == 'mean':
            df_clean = df_clean.fillna(df_clean.mean())
        elif strategy == 'median':
            df_clean = df_clean.fillna(df_clean.median())
        elif strategy == 'mode':
            df_clean = df_clean.fillna(df_clean.mode().iloc[0])
        
        return df_clean

class DataPipeline:
    """
    Complete data preprocessing pipeline.
    Author: RSK World - https://rskworld.in
    """
    
    def __init__(self):
        self.preprocessors = {}
    
    def add_preprocessor(self, name, preprocessor):
        """
        Add a preprocessor to the pipeline.
        
        Args:
            name: Preprocessor name
            preprocessor: Preprocessor function
        """
        self.preprocessors[name] = preprocessor
    
    def process(self, data, steps=None):
        """
        Process data through the pipeline.
        
        Args:
            data: Input data
            steps: List of preprocessing steps to apply
        
        Returns:
            Processed data
        """
        if steps is None:
            steps = list(self.preprocessors.keys())
        
        processed_data = data
        for step in steps:
            if step in self.preprocessors:
                processed_data = self.preprocessors[step](processed_data)
        
        return processed_data

def create_tf_dataset(X, y=None, batch_size=32, shuffle=True, buffer_size=1000):
    """
    Create TensorFlow dataset from numpy arrays.
    
    Args:
        X: Features
        y: Labels (optional)
        batch_size: Batch size
        shuffle: Whether to shuffle
        buffer_size: Buffer size for shuffling
    
    Returns:
        TensorFlow dataset
    """
    if y is not None:
        dataset = tf.data.Dataset.from_tensor_slices((X, y))
    else:
        dataset = tf.data.Dataset.from_tensor_slices(X)
    
    if shuffle:
        dataset = dataset.shuffle(buffer_size=buffer_size)
    
    dataset = dataset.batch(batch_size)
    dataset = dataset.prefetch(tf.data.AUTOTUNE)
    
    return dataset

def example_usage():
    """
    Example usage of data preprocessing functions.
    """
    # Image preprocessing example
    print("Image Preprocessing Example:")
    image_preprocessor = ImagePreprocessor()
    augmentation = image_preprocessor.create_augmentation_pipeline()
    print("Augmentation pipeline created")
    
    # Text preprocessing example
    print("\nText Preprocessing Example:")
    text_preprocessor = TextPreprocessor()
    vectorization = text_preprocessor.create_text_vectorization_layer(
        vocab_size=10000, max_length=100
    )
    print("Text vectorization layer created")
    
    # Tabular preprocessing example
    print("\nTabular Preprocessing Example:")
    tabular_preprocessor = TabularPreprocessor()
    X = np.random.randn(1000, 10)
    X_normalized, scaler = tabular_preprocessor.normalize_features(X, method='standard')
    print(f"Normalized features shape: {X_normalized.shape}")
    
    # Create TF dataset
    print("\nCreating TensorFlow Dataset:")
    dataset = create_tf_dataset(X_normalized, batch_size=32, shuffle=True)
    print("Dataset created successfully")
    
    return dataset

if __name__ == '__main__':
    print("Data Preprocessing Pipeline for TensorFlow")
    print("Author: RSK World - https://rskworld.in")
    dataset = example_usage()
335 lines•9.9 KB
python
data/README.md
Raw Download

README.md

# Generated Data Directory

**Author**: RSK World
**Website**: https://rskworld.in
**Email**: help@rskworld.in
**Phone**: +91 93305 39277

This directory contains generated sample datasets for the TensorFlow Deep Learning project.

## Generated Datasets

### 1. Classification Data
- **Files**: `classification_X.npy`, `classification_y.npy`
- **Description**: Synthetic classification dataset
- **Shape**: (1000, 20) features, (1000,) labels
- **Classes**: 3
- **Usage**: For testing neural network classification models

### 2. Regression Data
- **Files**: `regression_X.npy`, `regression_y.npy`
- **Description**: Synthetic regression dataset
- **Shape**: (1000, 10) features, (1000,) targets
- **Usage**: For testing regression models

### 3. Image Data
- **Files**: `images_X.npy`, `images_y.npy`
- **Description**: Synthetic image dataset
- **Shape**: (200, 28, 28) images, (200,) labels
- **Classes**: 10
- **Usage**: For testing CNN models

### 4. Sequence Data
- **Files**: `sequences_X.npy`, `sequences_y.npy`
- **Description**: Synthetic sequence data for RNN/LSTM
- **Shape**: (500, 50, 10) sequences, (500,) labels
- **Classes**: 3
- **Usage**: For testing RNN, LSTM, GRU models

### 5. Time Series Data
- **Files**: `time_series_X.npy`, `time_series_y.npy`
- **Description**: Synthetic time series data
- **Shape**: (500, 100, 1) sequences, (500,) targets
- **Usage**: For testing time series prediction models

### 6. Tabular Data
- **File**: `tabular.csv`
- **Description**: Synthetic tabular dataset with multiple features
- **Shape**: (1000, 11) including target
- **Features**: age, income, education, experience, etc.
- **Usage**: For testing tabular data models

### 7. MNIST Subset
- **Files**: `mnist_train_subset_X.npy`, `mnist_train_subset_y.npy`
- **Files**: `mnist_test_subset_X.npy`, `mnist_test_subset_y.npy`
- **Description**: Subset of MNIST dataset (if available)
- **Shape**: (5000, 784) train, (1000, 784) test
- **Usage**: For quick testing without downloading full MNIST

## Loading Data

### Python Example

```python
import numpy as np
import pandas as pd

# Load numpy arrays
X = np.load('data/classification_X.npy')
y = np.load('data/classification_y.npy')

# Load CSV
df = pd.read_csv('data/tabular.csv')

# Load with metadata
import json
with open('data/classification_metadata.json', 'r') as f:
metadata = json.load(f)
```

### TensorFlow/Keras Example

```python
import numpy as np
from tensorflow import keras

# Load data
X_train = np.load('data/images_X.npy')
y_train = np.load('data/images_y.npy')

# Use in model
model.fit(X_train, y_train, epochs=10)
```

## Regenerating Data

To regenerate all data:

```bash
python scripts/generate_data.py
```

Or use the module directly:

```python
from src.data_generator import generate_all_sample_data
generate_all_sample_data(data_dir='./data')
```

## Data Statistics

Each dataset includes a metadata JSON file with:
- Dataset name
- Shape information
- Data types
- Number of samples

## Visualizations

Check `data/visualizations/` for sample visualizations of the generated data.

## Notes

- All data is generated with a fixed random seed (42) for reproducibility
- Data is normalized and ready to use
- Synthetic data is for testing and demonstration purposes
- For production, use real datasets appropriate to your use case
data/images_metadata.json
Raw Download
Find: Go to:
{
  "name": "images",
  "X_shape": [
    200,
    28,
    28
  ],
  "y_shape": [
    200
  ],
  "X_dtype": "float32",
  "y_dtype": "int64",
  "n_samples": 200
}
14 lines•173 B
json

About RSK World

Founded by Molla Samser, with Designer & Tester Rima Khatun, RSK World is your one-stop destination for free programming resources, source code, and development tools.

Founder: Molla Samser
Designer & Tester: Rima Khatun

Development

  • Game Development
  • Web Development
  • Mobile Development
  • AI Development
  • Development Tools

Legal

  • Terms & Conditions
  • Privacy Policy
  • Disclaimer

Contact Info

Nutanhat, Mongolkote
Purba Burdwan, West Bengal
India, 713147

+91 93305 39277

hello@rskworld.in
support@rskworld.in

© 2026 RSK World. All rights reserved.

Content used for educational purposes only. View Disclaimer