README.md

# 📄 Text Classification Dataset

> Multi-class text classification dataset with labeled documents for news categorization, topic classification, and document analysis.

[![Author](https://img.shields.io/badge/Author-Molla%20Samser-red)](https://rskworld.in)
[![Website](https://img.shields.io/badge/Website-rskworld.in-blue)](https://rskworld.in)
[![License](https://img.shields.io/badge/License-Educational%20Use-green)](https://rskworld.in)
[![Difficulty](https://img.shields.io/badge/Difficulty-Advanced-orange)]()
[![Python](https://img.shields.io/badge/Python-3.8+-blue)](https://python.org)

---

## 📋 Project Information

| Property | Value |
|----------|-------|
| **Project** | Text Classification Dataset |
| **Category** | Text Data / NLP |
| **Author** | Molla Samser |
| **Designer & Tester** | Rima Khatun |
| **Website** | [https://rskworld.in](https://rskworld.in) |
| **Email** | help@rskworld.in |
| **Phone** | +91 93305 39277 |

---

## 📖 Description

This dataset includes labeled documents across multiple categories for text classification tasks. Perfect for:

- 📰 **News Categorization** - Classify news articles into categories
- 🏷️ **Topic Classification** - Identify main topics from text
- 📑 **Document Analysis** - Analyze and categorize documents
- 🤖 **NLP Model Training** - Train and fine-tune models

---

## ✨ Features

### Core Features
- ✅ **Multiple document categories** (6 classes)
- ✅ **Large labeled dataset** (240+ training samples)
- ✅ **Train/Validation/Test splits**
- ✅ **Multiple formats** (CSV, JSON, TXT)
- ✅ **Transformer ready format** (BERT, RoBERTa)

### 🆕 Advanced Features
- 🔥 **Interactive Data Explorer** - Visual data exploration tool
- 🔥 **REST API Server** - Flask-based prediction API
- 🔥 **Data Augmentation** - 6 augmentation techniques
- 🔥 **Model Explainability** - LIME-based explanations
- 🔥 **Batch Processing** - High-throughput classification
- 🔥 **Advanced Visualizations** - Word clouds, confusion matrices
- 🔥 **Performance Benchmarking** - Model comparison tools
- 🔥 **Cross-Validation** - Robust model evaluation

---

## 📊 Dataset Statistics

| Metric | Value |
|--------|-------|
| Training Samples | 240 |
| Validation Set | 30 |
| Test Set | 30 |
| Categories | 6 |
| Avg. Text Length | ~20 words |

### Categories

| Label | Category | Description | Color |
|-------|----------|-------------|-------|
| 0 | Technology | Tech news, gadgets, software, AI | 🔵 Blue |
| 1 | Sports | Athletics, competitions, leagues | 🟢 Green |
| 2 | Politics | Government, policy, elections | 🟣 Purple |
| 3 | Entertainment | Movies, music, TV shows, celebrities | 🩷 Pink |
| 4 | Business | Finance, markets, economy | 🟡 Amber |
| 5 | Science | Research, discoveries, space, health | 🔵 Cyan |

---

## 🛠️ Technologies

![CSV](https://img.shields.io/badge/Format-CSV-brightgreen)
![TXT](https://img.shields.io/badge/Format-TXT-blue)
![JSON](https://img.shields.io/badge/Format-JSON-orange)
![Transformers](https://img.shields.io/badge/Framework-Transformers-yellow)
![BERT](https://img.shields.io/badge/Model-BERT-red)
![Flask](https://img.shields.io/badge/API-Flask-lightgrey)
![Scikit-learn](https://img.shields.io/badge/ML-Scikit--learn-blue)

---

## 📁 Project Structure

```
text-classification/
├── index.html # Main showcase page
├── explorer.html # 🆕 Interactive data explorer
├── README.md # Documentation
├── requirements.txt # Python dependencies
├── text-classification.svg # Project logo
│
├── assets/
│ ├── css/
│ │ └── style.css # Styles
│ ├── js/
│ │ └── main.js # Scripts
│ └── favicon.svg # Favicon
│
├── data/
│ ├── csv/
│ │ ├── train.csv # Training data (240 samples)
│ │ ├── validation.csv # Validation data
│ │ ├── test.csv # Test data
│ │ └── full_dataset.csv # Complete dataset
│ ├── json/
│ │ ├── dataset.json # JSON format
│ │ └── full_dataset.json # Complete JSON
│ └── txt/
│ ├── categories.txt # Category labels
│ └── sample_documents.txt
│
├── scripts/
│ ├── preprocessing.py # Text preprocessing
│ ├── train_classifier.py # Traditional ML training
│ ├── train_transformers.py # BERT/Transformer training
│ ├── data_augmentation.py # 🆕 6 augmentation techniques
│ ├── visualizations.py # 🆕 Word clouds, charts
│ ├── api_server.py # 🆕 REST API server
│ ├── model_explainability.py # 🆕 LIME explanations
│ └── batch_processor.py # 🆕 Batch classification
│
└── notebooks/
└── text_classification_tutorial.ipynb # Complete tutorial
```

---

## 🚀 Quick Start

### 1. Clone or Download

```bash
# Download the dataset
wget https://rskworld.in/datasets/text-classification.zip
unzip text-classification.zip
cd text-classification
```

### 2. Install Dependencies

```bash
pip install -r requirements.txt
```

### 3. Load Dataset

```python
import pandas as pd

# Load training data
train_df = pd.read_csv('data/csv/train.csv', comment='#')
print(f"Training samples: {len(train_df)}")
print(train_df.head())
```

### 4. Train a Model

```bash
# Traditional ML model
python scripts/train_classifier.py

# View visualizations
python scripts/visualizations.py ../data
```

---

## 🆕 Advanced Features Usage

### 📊 Interactive Data Explorer

Open `explorer.html` in your browser to:
- Filter documents by category
- Search through the dataset
- View category distribution charts
- Analyze word count distributions

### 🌐 REST API Server

```bash
# Start the API server
cd scripts
python api_server.py --demo --port 5000
```

**API Endpoints:**
```
GET / - API info
GET /health - Health check
GET /categories - List all categories
POST /predict - Classify single text
POST /predict/batch - Classify multiple texts
POST /analyze - Detailed text analysis
```

**Example API Call:**
```bash
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{"text": "Apple announces new iPhone with AI features"}'
```

### 🔄 Data Augmentation

```python
from scripts.data_augmentation import TextAugmenter

augmenter = TextAugmenter(num_aug=5, random_state=42)
text = "Apple announces revolutionary new iPhone"

augmented = augmenter.augment(text)
for i, aug_text in enumerate(augmented, 1):
print(f"{i}. {aug_text}")
```

**Supported Techniques:**
- Synonym Replacement (SR)
- Random Insertion (RI)
- Random Swap (RS)
- Random Deletion (RD)
- Character-level augmentation
- Keyboard error simulation

### 🔍 Model Explainability

```python
from scripts.model_explainability import TextExplainer

explainer = TextExplainer(classifier_fn)
explanation = explainer.explain("New AI-powered smartphone released")

print(f"Predicted: {explanation['predicted_category']}")
print("Important words:")
for item in explanation['word_importance'][:5]:
print(f" {item['word']}: {item['importance']:.4f}")
```

### 📦 Batch Processing

```bash
# Process a file of texts
python scripts/batch_processor.py process \
--input input.csv \
--output predictions.csv \
--model model.joblib \
--batch-size 100

# Evaluate predictions
python scripts/batch_processor.py evaluate \
--predictions predictions.csv \
--ground-truth ground_truth.csv \
--output report.json
```

### 📈 Visualizations

```bash
# Generate all visualizations
python scripts/visualizations.py ../data

# Outputs:
# - visualizations/category_distribution.png
# - visualizations/text_length_distribution.png
# - visualizations/wordcloud_all.png
# - visualizations/wordclouds_by_category/
```

---

## 📝 Usage Examples

### Basic Classification

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import pandas as pd

# Load data
train_df = pd.read_csv('data/csv/train.csv', comment='#')

# Vectorize
tfidf = TfidfVectorizer(max_features=10000)
X = tfidf.fit_transform(train_df['text'])
y = train_df['label']

# Train
model = LogisticRegression()
model.fit(X, y)

# Predict
text = "Apple unveils new iPhone with AI features"
prediction = model.predict(tfidf.transform([text]))
print(f"Predicted: {prediction[0]}") # 0 = Technology
```

### Using Transformers (BERT)

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=6)

# Tokenize
text = "Scientists discover new planet in nearby galaxy"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# Predict
outputs = model(**inputs)
prediction = outputs.logits.argmax().item()
print(f"Predicted label: {prediction}") # 5 = Science
```

---

## 📊 Model Performance

| Model | Accuracy | F1 Score | Inference (ms) |
|-------|----------|----------|----------------|
| Naive Bayes | 85.2% | 0.847 | ~1ms |
| Logistic Regression | 89.7% | 0.892 | ~2ms |
| Linear SVM | 88.9% | 0.885 | ~2ms |
| BERT (fine-tuned) | 94.3% | 0.941 | ~50ms |

---

## 📜 License

This dataset is provided for **educational purposes only**.

**Copyright (c) 2026 RSK World - All Rights Reserved**

---

## 👨‍💻 Author

**Molla Samser**

- 🌐 Website: [https://rskworld.in](https://rskworld.in)
- 📧 Email: help@rskworld.in
- 📱 Phone: +91 93305 39277

### Designer & Tester

**Rima Khatun**

---

## 🤝 Support

If you have any questions or need support:

- 📧 Email: support@rskworld.in
- 🌐 Contact: [https://rskworld.in/contact.php](https://rskworld.in/contact.php)

---

## 🔗 Links

- [Homepage](index.html)
- [Data Explorer](explorer.html)
- [Download Dataset](text-classification.zip)

---


Made with ❤️ by RSK World 
<a href="https://rskworld.in">rskworld.in</a>

Theme Settings

Color Scheme

Display Options

Font Size

README.md