help@rskworld.in +91 93305 39277
RSK World
  • Home
  • Development
    • Web Development
    • Mobile Apps
    • Software
    • Games
    • Project
  • Technologies
    • Data Science
    • AI Development
    • Cloud Development
    • Blockchain
    • Cyber Security
    • Dev Tools
    • Testing Tools
  • About
  • Contact

Theme Settings

Color Scheme
Display Options
Font Size
100%
Back to Project
RSK World
text-classification
RSK World
text-classification
Text Classification Dataset - NLP + Multi-Class Classification + Machine Learning
text-classification
  • assets
  • data
  • models
  • notebooks
  • scripts
  • .gitignore1.4 KB
  • CHANGELOG.md2.6 KB
  • LICENSE3.2 KB
  • README.md11.1 KB
  • classifier.html34.1 KB
  • dashboard.html41.4 KB
  • explorer.html41.4 KB
  • index.html28.4 KB
  • requirements.txt1.8 KB
  • text-classification.svg4.6 KB
README.md
README.md
Raw Download

README.md

<!--
================================================================================
Text Classification Dataset - README
================================================================================
Project: Text Classification Dataset
Category: Text Data / NLP

Author: Molla Samser
Designer & Tester: Rima Khatun
Website: https://rskworld.in
Email: help@rskworld.in | support@rskworld.in
Phone: +91 93305 39277

Copyright (c) 2026 RSK World - All Rights Reserved
================================================================================
-->

# 📄 Text Classification Dataset

> Multi-class text classification dataset with labeled documents for news categorization, topic classification, and document analysis.

[![Author](https://img.shields.io/badge/Author-Molla%20Samser-red)](https://rskworld.in)
[![Website](https://img.shields.io/badge/Website-rskworld.in-blue)](https://rskworld.in)
[![License](https://img.shields.io/badge/License-Educational%20Use-green)](https://rskworld.in)
[![Difficulty](https://img.shields.io/badge/Difficulty-Advanced-orange)]()
[![Python](https://img.shields.io/badge/Python-3.8+-blue)](https://python.org)

---

## 📋 Project Information

| Property | Value |
|----------|-------|
| **Project** | Text Classification Dataset |
| **Category** | Text Data / NLP |
| **Author** | Molla Samser |
| **Designer & Tester** | Rima Khatun |
| **Website** | [https://rskworld.in](https://rskworld.in) |
| **Email** | help@rskworld.in |
| **Phone** | +91 93305 39277 |

---

## 📖 Description

This dataset includes labeled documents across multiple categories for text classification tasks. Perfect for:

- 📰 **News Categorization** - Classify news articles into categories
- 🏷️ **Topic Classification** - Identify main topics from text
- 📑 **Document Analysis** - Analyze and categorize documents
- 🤖 **NLP Model Training** - Train and fine-tune models

---

## ✨ Features

### Core Features
- ✅ **Multiple document categories** (6 classes)
- ✅ **Large labeled dataset** (240+ training samples)
- ✅ **Train/Validation/Test splits**
- ✅ **Multiple formats** (CSV, JSON, TXT)
- ✅ **Transformer ready format** (BERT, RoBERTa)

### 🆕 Advanced Features
- 🔥 **Interactive Data Explorer** - Visual data exploration tool
- 🔥 **REST API Server** - Flask-based prediction API
- 🔥 **Data Augmentation** - 6 augmentation techniques
- 🔥 **Model Explainability** - LIME-based explanations
- 🔥 **Batch Processing** - High-throughput classification
- 🔥 **Advanced Visualizations** - Word clouds, confusion matrices
- 🔥 **Performance Benchmarking** - Model comparison tools
- 🔥 **Cross-Validation** - Robust model evaluation

---

## 📊 Dataset Statistics

| Metric | Value |
|--------|-------|
| Training Samples | 240 |
| Validation Set | 30 |
| Test Set | 30 |
| Categories | 6 |
| Avg. Text Length | ~20 words |

### Categories

| Label | Category | Description | Color |
|-------|----------|-------------|-------|
| 0 | Technology | Tech news, gadgets, software, AI | 🔵 Blue |
| 1 | Sports | Athletics, competitions, leagues | 🟢 Green |
| 2 | Politics | Government, policy, elections | 🟣 Purple |
| 3 | Entertainment | Movies, music, TV shows, celebrities | 🩷 Pink |
| 4 | Business | Finance, markets, economy | 🟡 Amber |
| 5 | Science | Research, discoveries, space, health | 🔵 Cyan |

---

## 🛠️ Technologies

![CSV](https://img.shields.io/badge/Format-CSV-brightgreen)
![TXT](https://img.shields.io/badge/Format-TXT-blue)
![JSON](https://img.shields.io/badge/Format-JSON-orange)
![Transformers](https://img.shields.io/badge/Framework-Transformers-yellow)
![BERT](https://img.shields.io/badge/Model-BERT-red)
![Flask](https://img.shields.io/badge/API-Flask-lightgrey)
![Scikit-learn](https://img.shields.io/badge/ML-Scikit--learn-blue)

---

## 📁 Project Structure

```
text-classification/
├── index.html # Main showcase page
├── explorer.html # 🆕 Interactive data explorer
├── README.md # Documentation
├── requirements.txt # Python dependencies
├── text-classification.svg # Project logo
│
├── assets/
│ ├── css/
│ │ └── style.css # Styles
│ ├── js/
│ │ └── main.js # Scripts
│ └── favicon.svg # Favicon
│
├── data/
│ ├── csv/
│ │ ├── train.csv # Training data (240 samples)
│ │ ├── validation.csv # Validation data
│ │ ├── test.csv # Test data
│ │ └── full_dataset.csv # Complete dataset
│ ├── json/
│ │ ├── dataset.json # JSON format
│ │ └── full_dataset.json # Complete JSON
│ └── txt/
│ ├── categories.txt # Category labels
│ └── sample_documents.txt
│
├── scripts/
│ ├── preprocessing.py # Text preprocessing
│ ├── train_classifier.py # Traditional ML training
│ ├── train_transformers.py # BERT/Transformer training
│ ├── data_augmentation.py # 🆕 6 augmentation techniques
│ ├── visualizations.py # 🆕 Word clouds, charts
│ ├── api_server.py # 🆕 REST API server
│ ├── model_explainability.py # 🆕 LIME explanations
│ └── batch_processor.py # 🆕 Batch classification
│
└── notebooks/
└── text_classification_tutorial.ipynb # Complete tutorial
```

---

## 🚀 Quick Start

### 1. Clone or Download

```bash
# Download the dataset
wget https://rskworld.in/datasets/text-classification.zip
unzip text-classification.zip
cd text-classification
```

### 2. Install Dependencies

```bash
pip install -r requirements.txt
```

### 3. Load Dataset

```python
import pandas as pd

# Load training data
train_df = pd.read_csv('data/csv/train.csv', comment='#')
print(f"Training samples: {len(train_df)}")
print(train_df.head())
```

### 4. Train a Model

```bash
# Traditional ML model
python scripts/train_classifier.py

# View visualizations
python scripts/visualizations.py ../data
```

---

## 🆕 Advanced Features Usage

### 📊 Interactive Data Explorer

Open `explorer.html` in your browser to:
- Filter documents by category
- Search through the dataset
- View category distribution charts
- Analyze word count distributions

### 🌐 REST API Server

```bash
# Start the API server
cd scripts
python api_server.py --demo --port 5000
```

**API Endpoints:**
```
GET / - API info
GET /health - Health check
GET /categories - List all categories
POST /predict - Classify single text
POST /predict/batch - Classify multiple texts
POST /analyze - Detailed text analysis
```

**Example API Call:**
```bash
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{"text": "Apple announces new iPhone with AI features"}'
```

### 🔄 Data Augmentation

```python
from scripts.data_augmentation import TextAugmenter

augmenter = TextAugmenter(num_aug=5, random_state=42)
text = "Apple announces revolutionary new iPhone"

augmented = augmenter.augment(text)
for i, aug_text in enumerate(augmented, 1):
print(f"{i}. {aug_text}")
```

**Supported Techniques:**
- Synonym Replacement (SR)
- Random Insertion (RI)
- Random Swap (RS)
- Random Deletion (RD)
- Character-level augmentation
- Keyboard error simulation

### 🔍 Model Explainability

```python
from scripts.model_explainability import TextExplainer

explainer = TextExplainer(classifier_fn)
explanation = explainer.explain("New AI-powered smartphone released")

print(f"Predicted: {explanation['predicted_category']}")
print("Important words:")
for item in explanation['word_importance'][:5]:
print(f" {item['word']}: {item['importance']:.4f}")
```

### 📦 Batch Processing

```bash
# Process a file of texts
python scripts/batch_processor.py process \
--input input.csv \
--output predictions.csv \
--model model.joblib \
--batch-size 100

# Evaluate predictions
python scripts/batch_processor.py evaluate \
--predictions predictions.csv \
--ground-truth ground_truth.csv \
--output report.json
```

### 📈 Visualizations

```bash
# Generate all visualizations
python scripts/visualizations.py ../data

# Outputs:
# - visualizations/category_distribution.png
# - visualizations/text_length_distribution.png
# - visualizations/wordcloud_all.png
# - visualizations/wordclouds_by_category/
```

---

## 📝 Usage Examples

### Basic Classification

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import pandas as pd

# Load data
train_df = pd.read_csv('data/csv/train.csv', comment='#')

# Vectorize
tfidf = TfidfVectorizer(max_features=10000)
X = tfidf.fit_transform(train_df['text'])
y = train_df['label']

# Train
model = LogisticRegression()
model.fit(X, y)

# Predict
text = "Apple unveils new iPhone with AI features"
prediction = model.predict(tfidf.transform([text]))
print(f"Predicted: {prediction[0]}") # 0 = Technology
```

### Using Transformers (BERT)

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=6)

# Tokenize
text = "Scientists discover new planet in nearby galaxy"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# Predict
outputs = model(**inputs)
prediction = outputs.logits.argmax().item()
print(f"Predicted label: {prediction}") # 5 = Science
```

---

## 📊 Model Performance

| Model | Accuracy | F1 Score | Inference (ms) |
|-------|----------|----------|----------------|
| Naive Bayes | 85.2% | 0.847 | ~1ms |
| Logistic Regression | 89.7% | 0.892 | ~2ms |
| Linear SVM | 88.9% | 0.885 | ~2ms |
| BERT (fine-tuned) | 94.3% | 0.941 | ~50ms |

---

## 📜 License

This dataset is provided for **educational purposes only**.

**Copyright (c) 2026 RSK World - All Rights Reserved**

---

## 👨‍💻 Author

**Molla Samser**

- 🌐 Website: [https://rskworld.in](https://rskworld.in)
- 📧 Email: help@rskworld.in
- 📱 Phone: +91 93305 39277

### Designer & Tester

**Rima Khatun**

---

## 🤝 Support

If you have any questions or need support:

- 📧 Email: support@rskworld.in
- 🌐 Contact: [https://rskworld.in/contact.php](https://rskworld.in/contact.php)

---

## 🔗 Links

- [Homepage](index.html)
- [Data Explorer](explorer.html)
- [Download Dataset](text-classification.zip)

---

<p align="center">
<b>Made with ❤️ by RSK World</b><br>
<a href="https://rskworld.in">rskworld.in</a>
</p>

About RSK World

Founded by Molla Samser, with Designer & Tester Rima Khatun, RSK World is your one-stop destination for free programming resources, source code, and development tools.

Founder: Molla Samser
Designer & Tester: Rima Khatun

Development

  • Game Development
  • Web Development
  • Mobile Development
  • AI Development
  • Development Tools

Legal

  • Terms & Conditions
  • Privacy Policy
  • Disclaimer

Contact Info

Nutanhat, Mongolkote
Purba Burdwan, West Bengal
India, 713147

+91 93305 39277

hello@rskworld.in
support@rskworld.in

© 2026 RSK World. All rights reserved.

Content used for educational purposes only. View Disclaimer