help@rskworld.in +91 93305 39277
RSK World
  • Home
  • Development
    • Web Development
    • Mobile Apps
    • Software
    • Games
    • Project
  • Technologies
    • Data Science
    • AI Development
    • Cloud Development
    • Blockchain
    • Cyber Security
    • Dev Tools
    • Testing Tools
  • About
  • Contact

Theme Settings

Color Scheme
Display Options
Font Size
100%
Back to Project
RSK World
statsmodels-statistical
RSK World
statsmodels-statistical
Statistical Modeling with Statsmodels
statsmodels-statistical
  • __pycache__
  • data
  • examples
  • notebooks
  • .gitignore458 B
  • CHANGELOG.md4 KB
  • FEATURES.md6.3 KB
  • LICENSE1.2 KB
  • PROJECT_INFO.md2.2 KB
  • PROJECT_SUMMARY.md4.2 KB
  • README.md7.4 KB
  • RELEASE_NOTES_v1.0.0.md6.5 KB
  • UNIQUE_FEATURES.md5.3 KB
  • advanced_time_series.py9.8 KB
  • automated_reporting.py8.3 KB
  • bayesian_statistics.py7.5 KB
  • data_preprocessing.py8.2 KB
  • econometric_modeling.py9.8 KB
  • hypothesis_testing.py12.5 KB
  • index.html10.8 KB
  • model_evaluation.py9.1 KB
  • model_persistence.py6.5 KB
  • model_selection.py9.7 KB
  • panel_data_analysis.py7.3 KB
  • performance_benchmarking.py7.3 KB
  • regression_analysis.py9 KB
  • requirements.txt361 B
  • statistical_diagnostics.py13.8 KB
  • statsmodels-statistical.png284 B
  • time_series_analysis.py10.3 KB
  • visualization_utils.py8.9 KB
data_preprocessing.py
data_preprocessing.py
Raw Download
Find: Go to:
"""
Data Preprocessing Utilities

Author: RSK World
Website: https://rskworld.in
Email: help@rskworld.in
Phone: +91 93305 39277
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from scipy import stats
import warnings
warnings.filterwarnings('ignore')


class DataPreprocessor:
    """
    Data Preprocessing Utilities
    
    Author: RSK World
    Website: https://rskworld.in
    Email: help@rskworld.in
    Phone: +91 93305 39277
    """
    
    def __init__(self):
        self.scalers = {}
        self.transformations = {}
    
    def handle_missing_values(self, data, method='mean'):
        """
        Handle missing values
        
        Parameters:
        -----------
        data : DataFrame or array
            Data with missing values
        method : str
            'mean', 'median', 'mode', 'drop', or 'forward_fill'
        """
        if isinstance(data, pd.DataFrame):
            df = data.copy()
        else:
            df = pd.DataFrame(data)
        
        if method == 'mean':
            return df.fillna(df.mean())
        elif method == 'median':
            return df.fillna(df.median())
        elif method == 'mode':
            return df.fillna(df.mode().iloc[0])
        elif method == 'drop':
            return df.dropna()
        elif method == 'forward_fill':
            return df.ffill()
        else:
            raise ValueError("method must be 'mean', 'median', 'mode', 'drop', or 'forward_fill'")
    
    def detect_outliers(self, data, method='iqr', threshold=1.5):
        """
        Detect outliers
        
        Parameters:
        -----------
        data : array-like
            Data to check for outliers
        method : str
            'iqr' or 'zscore'
        threshold : float
            Threshold for outlier detection
        """
        data = np.array(data)
        outliers = np.zeros_like(data, dtype=bool)
        
        if method == 'iqr':
            Q1 = np.percentile(data, 25)
            Q3 = np.percentile(data, 75)
            IQR = Q3 - Q1
            lower_bound = Q1 - threshold * IQR
            upper_bound = Q3 + threshold * IQR
            outliers = (data < lower_bound) | (data > upper_bound)
        elif method == 'zscore':
            z_scores = np.abs(stats.zscore(data))
            outliers = z_scores > threshold
        
        return outliers
    
    def remove_outliers(self, data, method='iqr', threshold=1.5):
        """Remove outliers from data"""
        if isinstance(data, pd.DataFrame):
            df = data.copy()
            for col in df.columns:
                outliers = self.detect_outliers(df[col], method, threshold)
                df = df[~outliers]
            return df
        else:
            outliers = self.detect_outliers(data, method, threshold)
            return data[~outliers]
    
    def scale_data(self, data, method='standard', fit=True):
        """
        Scale data
        
        Parameters:
        -----------
        data : array-like
            Data to scale
        method : str
            'standard', 'minmax', or 'robust'
        fit : bool
            Whether to fit scaler or use existing
        """
        data = np.array(data)
        
        if method == 'standard':
            if fit or method not in self.scalers:
                self.scalers[method] = StandardScaler()
                scaled = self.scalers[method].fit_transform(data.reshape(-1, 1))
            else:
                scaled = self.scalers[method].transform(data.reshape(-1, 1))
        elif method == 'minmax':
            if fit or method not in self.scalers:
                self.scalers[method] = MinMaxScaler()
                scaled = self.scalers[method].fit_transform(data.reshape(-1, 1))
            else:
                scaled = self.scalers[method].transform(data.reshape(-1, 1))
        elif method == 'robust':
            if fit or method not in self.scalers:
                self.scalers[method] = RobustScaler()
                scaled = self.scalers[method].fit_transform(data.reshape(-1, 1))
            else:
                scaled = self.scalers[method].transform(data.reshape(-1, 1))
        else:
            raise ValueError("method must be 'standard', 'minmax', or 'robust'")
        
        return scaled.flatten()
    
    def transform_to_stationary(self, data, method='diff'):
        """
        Transform data to stationary
        
        Parameters:
        -----------
        data : array-like or Series
            Time series data
        method : str
            'diff', 'log_diff', or 'detrend'
        """
        data = pd.Series(data) if not isinstance(data, pd.Series) else data
        
        if method == 'diff':
            return data.diff().dropna()
        elif method == 'log_diff':
            return np.log(data).diff().dropna()
        elif method == 'detrend':
            from scipy import signal
            detrended = signal.detrend(data.values)
            return pd.Series(detrended, index=data.index)
        else:
            raise ValueError("method must be 'diff', 'log_diff', or 'detrend'")
    
    def create_lags(self, data, n_lags=5):
        """
        Create lagged features
        
        Parameters:
        -----------
        data : array-like or Series
            Time series data
        n_lags : int
            Number of lags to create
        """
        data = pd.Series(data) if not isinstance(data, pd.Series) else data
        df = pd.DataFrame({'original': data})
        
        for lag in range(1, n_lags + 1):
            df[f'lag_{lag}'] = data.shift(lag)
        
        return df.dropna()
    
    def create_rolling_features(self, data, window=5):
        """
        Create rolling window features
        
        Parameters:
        -----------
        data : array-like or Series
            Time series data
        window : int
            Window size
        """
        data = pd.Series(data) if not isinstance(data, pd.Series) else data
        
        features = pd.DataFrame({
            'original': data,
            'rolling_mean': data.rolling(window=window).mean(),
            'rolling_std': data.rolling(window=window).std(),
            'rolling_min': data.rolling(window=window).min(),
            'rolling_max': data.rolling(window=window).max()
        })
        
        return features.dropna()
    
    def summary_statistics(self, data):
        """Generate comprehensive summary statistics"""
        if isinstance(data, pd.DataFrame):
            return data.describe()
        else:
            data = pd.Series(data)
            stats_dict = {
                'count': len(data),
                'mean': data.mean(),
                'std': data.std(),
                'min': data.min(),
                '25%': data.quantile(0.25),
                '50%': data.median(),
                '75%': data.quantile(0.75),
                'max': data.max(),
                'skewness': data.skew(),
                'kurtosis': data.kurtosis()
            }
            return pd.Series(stats_dict)


if __name__ == "__main__":
    # Example usage
    print("Data Preprocessing Example")
    print("=" * 70)
    
    preprocessor = DataPreprocessor()
    
    # Generate sample data with outliers
    np.random.seed(42)
    data = np.random.normal(100, 15, 100)
    data = np.append(data, [200, 50, 250])  # Add outliers
    
    print("Original Data Statistics:")
    print(preprocessor.summary_statistics(data))
    
    # Detect outliers
    outliers = preprocessor.detect_outliers(data, method='iqr')
    print(f"\nNumber of outliers detected: {outliers.sum()}")
    
    # Remove outliers
    cleaned_data = preprocessor.remove_outliers(data, method='iqr')
    print(f"\nCleaned Data Statistics:")
    print(preprocessor.summary_statistics(cleaned_data))
    
    # Scale data
    scaled = preprocessor.scale_data(data, method='standard')
    print(f"\nScaled Data Statistics:")
    print(preprocessor.summary_statistics(scaled))

257 lines•8.2 KB
python

About RSK World

Founded by Molla Samser, with Designer & Tester Rima Khatun, RSK World is your one-stop destination for free programming resources, source code, and development tools.

Founder: Molla Samser
Designer & Tester: Rima Khatun

Development

  • Game Development
  • Web Development
  • Mobile Development
  • AI Development
  • Development Tools

Legal

  • Terms & Conditions
  • Privacy Policy
  • Disclaimer

Contact Info

Nutanhat, Mongolkote
Purba Burdwan, West Bengal
India, 713147

+91 93305 39277

hello@rskworld.in
support@rskworld.in

© 2026 RSK World. All rights reserved.

Content used for educational purposes only. View Disclaimer