help@rskworld.in +91 93305 39277
RSK World
  • Home
  • Development
    • Web Development
    • Mobile Apps
    • Software
    • Games
    • Project
  • Technologies
    • Data Science
    • AI Development
    • Cloud Development
    • Blockchain
    • Cyber Security
    • Dev Tools
    • Testing Tools
  • About
  • Contact

Theme Settings

Color Scheme
Display Options
Font Size
100%
Back to Project
RSK World
text-classification
/
scripts
RSK World
text-classification
Text Classification Dataset - NLP + Multi-Class Classification + Machine Learning
scripts
  • __init__.py2.3 KB
  • active_learning.py26.8 KB
  • api_server.py12.7 KB
  • batch_processor.py16.4 KB
  • data_augmentation.py18.2 KB
  • data_quality.py20 KB
  • deep_learning.py24.2 KB
  • hyperparameter_tuning.py22.5 KB
  • model_explainability.py17.9 KB
  • preprocessing.py8.7 KB
  • train_classifier.py13.8 KB
  • train_transformers.py12.5 KB
  • visualizations.py19 KB
__init__.pyadvanced_utils.pytrain_transformers.py
scripts/__init__.py
Raw Download
Find: Go to:
"""
================================================================================
Text Classification Dataset - Scripts Package
================================================================================
Project: Text Classification Dataset
Category: Text Data / NLP

Author: Molla Samser
Designer & Tester: Rima Khatun
Website: https://rskworld.in
Email: help@rskworld.in | support@rskworld.in
Phone: +91 93305 39277

Copyright (c) 2026 RSK World - All Rights Reserved
================================================================================

This package contains various scripts for text classification:

Modules:
--------
- preprocessing: Text preprocessing utilities
- train_classifier: Traditional ML model training
- train_transformers: Transformer-based model training (BERT)
- data_augmentation: Text data augmentation techniques
- visualizations: Data visualization utilities
- api_server: Flask REST API server
- model_explainability: LIME-based model explanations
- batch_processor: Batch prediction and evaluation
- hyperparameter_tuning: GridSearch and Optuna tuning
- deep_learning: PyTorch deep learning models
- data_quality: Data quality analysis
- active_learning: Active learning module

Usage:
------
from scripts.preprocessing import preprocess_text
from scripts.data_quality import DataQualityAnalyzer
from scripts.active_learning import ActiveLearner

================================================================================
"""

__version__ = "1.0.0"
__author__ = "Molla Samser"
__email__ = "help@rskworld.in"
__website__ = "https://rskworld.in"
__copyright__ = "Copyright (c) 2026 RSK World - All Rights Reserved"

# Module imports for easier access
from .preprocessing import TextPreprocessor, load_and_preprocess

# Create a simple preprocess function for convenience
def preprocess_text(text: str) -> str:
    """
    Simple text preprocessing function.
    
    Args:
        text: Input text to preprocess
        
    Returns:
        Preprocessed text
    """
    preprocessor = TextPreprocessor()
    return preprocessor.preprocess(text)

__all__ = [
    'preprocess_text',
    'TextPreprocessor',
    'load_and_preprocess',
    '__version__',
    '__author__',
    '__email__',
    '__website__',
]

76 lines•2.3 KB
python
scripts/train_transformers.py
Raw Download
Find: Go to:
"""
================================================================================
Text Classification Dataset - Transformer-Based Model Training
================================================================================
Project: Text Classification Dataset
Category: Text Data / NLP

Author: Molla Samser
Designer & Tester: Rima Khatun
Website: https://rskworld.in
Email: help@rskworld.in | support@rskworld.in
Phone: +91 93305 39277

Copyright (c) 2026 RSK World - All Rights Reserved
Content used for educational purposes only.

Requirements:
    pip install transformers torch datasets scikit-learn pandas

Created: December 2026
================================================================================
"""

import os
import argparse
from typing import Dict, List, Optional
from datetime import datetime

import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader, Dataset
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Transformers imports
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback
)
from datasets import Dataset as HFDataset, DatasetDict

# Project information
__author__ = "Molla Samser"
__website__ = "https://rskworld.in"
__email__ = "help@rskworld.in"


# Category labels
CATEGORY_LABELS = {
    0: 'Technology',
    1: 'Sports',
    2: 'Politics',
    3: 'Entertainment',
    4: 'Business',
    5: 'Science'
}

NUM_LABELS = len(CATEGORY_LABELS)


class TextClassificationDataset(Dataset):
    """
    PyTorch Dataset for text classification.
    
    Author: Molla Samser | RSK World (https://rskworld.in)
    """
    
    def __init__(
        self,
        texts: List[str],
        labels: List[int],
        tokenizer,
        max_length: int = 256
    ):
        """
        Initialize the dataset.
        
        Args:
            texts: List of input texts
            labels: List of labels
            tokenizer: HuggingFace tokenizer
            max_length: Maximum sequence length
        """
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        encoding = self.tokenizer(
            text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }


def compute_metrics(eval_pred):
    """
    Compute metrics for evaluation.
    
    Args:
        eval_pred: Tuple of (predictions, labels)
        
    Returns:
        Dictionary of metrics
    """
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    accuracy = accuracy_score(labels, predictions)
    f1_weighted = f1_score(labels, predictions, average='weighted')
    f1_macro = f1_score(labels, predictions, average='macro')
    
    return {
        'accuracy': accuracy,
        'f1_weighted': f1_weighted,
        'f1_macro': f1_macro
    }


def load_data(data_dir: str) -> Dict[str, pd.DataFrame]:
    """
    Load train, validation, and test datasets.
    
    Args:
        data_dir: Path to data directory
        
    Returns:
        Dictionary with dataframes
    """
    data = {}
    
    for split in ['train', 'validation', 'test']:
        filepath = os.path.join(data_dir, 'csv', f'{split}.csv')
        data[split] = pd.read_csv(filepath, comment='#')
        print(f"Loaded {split}: {len(data[split])} samples")
    
    return data


def prepare_hf_dataset(data: Dict[str, pd.DataFrame]) -> DatasetDict:
    """
    Prepare HuggingFace DatasetDict from pandas DataFrames.
    
    Args:
        data: Dictionary of DataFrames
        
    Returns:
        HuggingFace DatasetDict
    """
    datasets = {}
    
    for split, df in data.items():
        datasets[split] = HFDataset.from_pandas(df[['text', 'label']])
    
    return DatasetDict(datasets)


def tokenize_function(examples, tokenizer, max_length=256):
    """Tokenize examples for the model."""
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=max_length
    )


def train_transformer_model(
    model_name: str = 'bert-base-uncased',
    data_dir: str = '../data',
    output_dir: str = './results',
    epochs: int = 3,
    batch_size: int = 16,
    learning_rate: float = 2e-5,
    max_length: int = 256,
    warmup_steps: int = 500,
    weight_decay: float = 0.01,
    use_fp16: bool = True
):
    """
    Train a transformer-based text classification model.
    
    Args:
        model_name: HuggingFace model name
        data_dir: Path to data directory
        output_dir: Output directory for model
        epochs: Number of training epochs
        batch_size: Batch size
        learning_rate: Learning rate
        max_length: Maximum sequence length
        warmup_steps: Warmup steps for scheduler
        weight_decay: Weight decay for optimizer
        use_fp16: Use mixed precision training
        
    Author: Molla Samser | RSK World (https://rskworld.in)
    """
    print(f"\n{'='*60}")
    print("Transformer Model Training - RSK World")
    print(f"Author: {__author__}")
    print(f"Website: {__website__}")
    print(f"Email: {__email__}")
    print(f"{'='*60}\n")
    
    # Check for GPU
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f"Using device: {device}")
    
    # Disable fp16 on CPU
    if device == 'cpu':
        use_fp16 = False
    
    # Load data
    print("\nLoading dataset...")
    data = load_data(data_dir)
    
    # Load tokenizer
    print(f"\nLoading tokenizer: {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Prepare datasets
    print("Preparing datasets...")
    dataset_dict = prepare_hf_dataset(data)
    
    # Tokenize datasets
    tokenized_datasets = dataset_dict.map(
        lambda x: tokenize_function(x, tokenizer, max_length),
        batched=True,
        remove_columns=['text']
    )
    
    # Load model
    print(f"\nLoading model: {model_name}")
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=NUM_LABELS,
        id2label=CATEGORY_LABELS,
        label2id={v: k for k, v in CATEGORY_LABELS.items()}
    )
    
    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        warmup_steps=warmup_steps,
        weight_decay=weight_decay,
        learning_rate=learning_rate,
        logging_dir=f'{output_dir}/logs',
        logging_steps=100,
        eval_strategy='epoch',
        save_strategy='epoch',
        load_best_model_at_end=True,
        metric_for_best_model='f1_weighted',
        greater_is_better=True,
        fp16=use_fp16,
        report_to='none',  # Disable wandb etc.
        seed=42
    )
    
    # Initialize trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets['train'],
        eval_dataset=tokenized_datasets['validation'],
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
    )
    
    # Train
    print(f"\n{'='*60}")
    print("Starting Training...")
    print(f"{'='*60}\n")
    
    train_result = trainer.train()
    
    # Save model
    print(f"\nSaving model to {output_dir}...")
    trainer.save_model()
    tokenizer.save_pretrained(output_dir)
    
    # Evaluate on test set
    print(f"\n{'='*60}")
    print("Evaluating on Test Set...")
    print(f"{'='*60}\n")
    
    test_results = trainer.evaluate(tokenized_datasets['test'])
    
    print(f"Test Accuracy: {test_results['eval_accuracy']:.4f}")
    print(f"Test F1 (Weighted): {test_results['eval_f1_weighted']:.4f}")
    print(f"Test F1 (Macro): {test_results['eval_f1_macro']:.4f}")
    
    # Generate predictions for detailed report
    predictions = trainer.predict(tokenized_datasets['test'])
    pred_labels = np.argmax(predictions.predictions, axis=1)
    true_labels = predictions.label_ids
    
    print("\nClassification Report:")
    print(classification_report(
        true_labels,
        pred_labels,
        target_names=list(CATEGORY_LABELS.values())
    ))
    
    # Save results
    results = {
        'model_name': model_name,
        'train_loss': train_result.training_loss,
        'test_accuracy': test_results['eval_accuracy'],
        'test_f1_weighted': test_results['eval_f1_weighted'],
        'test_f1_macro': test_results['eval_f1_macro'],
        'epochs': epochs,
        'batch_size': batch_size,
        'learning_rate': learning_rate,
        'timestamp': datetime.now().isoformat(),
        'author': __author__,
        'website': __website__
    }
    
    import json
    with open(f'{output_dir}/training_results.json', 'w') as f:
        json.dump(results, f, indent=2)
    
    print(f"\n{'='*60}")
    print("Training Complete!")
    print(f"Model saved to: {output_dir}")
    print(f"{'='*60}\n")
    
    return results


def predict_text(
    text: str,
    model_dir: str,
    device: str = None
) -> Dict:
    """
    Predict category for a single text.
    
    Args:
        text: Input text
        model_dir: Path to saved model
        device: Device to use (auto-detected if None)
        
    Returns:
        Dictionary with prediction results
    """
    if device is None:
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Load model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForSequenceClassification.from_pretrained(model_dir)
    model.to(device)
    model.eval()
    
    # Tokenize
    inputs = tokenizer(
        text,
        padding=True,
        truncation=True,
        max_length=256,
        return_tensors='pt'
    ).to(device)
    
    # Predict
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1)
        pred_label = torch.argmax(probs, dim=-1).item()
        confidence = probs[0][pred_label].item()
    
    return {
        'text': text[:100] + '...' if len(text) > 100 else text,
        'predicted_label': pred_label,
        'predicted_category': CATEGORY_LABELS[pred_label],
        'confidence': confidence,
        'all_probabilities': {
            CATEGORY_LABELS[i]: probs[0][i].item()
            for i in range(NUM_LABELS)
        }
    }


def main():
    parser = argparse.ArgumentParser(
        description='Train Transformer Text Classification - RSK World'
    )
    parser.add_argument(
        '--model', type=str, default='bert-base-uncased',
        help='HuggingFace model name'
    )
    parser.add_argument(
        '--data-dir', type=str, default='../data',
        help='Data directory path'
    )
    parser.add_argument(
        '--output-dir', type=str, default='./transformer_model',
        help='Output directory'
    )
    parser.add_argument(
        '--epochs', type=int, default=3,
        help='Training epochs'
    )
    parser.add_argument(
        '--batch-size', type=int, default=16,
        help='Batch size'
    )
    parser.add_argument(
        '--learning-rate', type=float, default=2e-5,
        help='Learning rate'
    )
    parser.add_argument(
        '--max-length', type=int, default=256,
        help='Max sequence length'
    )
    
    args = parser.parse_args()
    
    train_transformer_model(
        model_name=args.model,
        data_dir=args.data_dir,
        output_dir=args.output_dir,
        epochs=args.epochs,
        batch_size=args.batch_size,
        learning_rate=args.learning_rate,
        max_length=args.max_length
    )


if __name__ == "__main__":
    main()

454 lines•12.5 KB
python

About RSK World

Founded by Molla Samser, with Designer & Tester Rima Khatun, RSK World is your one-stop destination for free programming resources, source code, and development tools.

Founder: Molla Samser
Designer & Tester: Rima Khatun

Development

  • Game Development
  • Web Development
  • Mobile Development
  • AI Development
  • Development Tools

Legal

  • Terms & Conditions
  • Privacy Policy
  • Disclaimer

Contact Info

Nutanhat, Mongolkote
Purba Burdwan, West Bengal
India, 713147

+91 93305 39277

hello@rskworld.in
support@rskworld.in

© 2026 RSK World. All rights reserved.

Content used for educational purposes only. View Disclaimer