help@rskworld.in +91 93305 39277
RSK World
  • Home
  • Development
    • Web Development
    • Mobile Apps
    • Software
    • Games
    • Project
  • Technologies
    • Data Science
    • AI Development
    • Cloud Development
    • Blockchain
    • Cyber Security
    • Dev Tools
    • Testing Tools
  • About
  • Contact

Theme Settings

Color Scheme
Display Options
Font Size
100%

Text Classification Dataset

Comprehensive Text Classification dataset with 300+ labeled documents across 6 categories (Technology, Sports, Politics, Entertainment, Business, Science). Includes training (240 samples), validation (30), and test (30) splits. Available in CSV, JSON, and TXT formats. Features Python scripts for text preprocessing, traditional ML training (Naive Bayes, Logistic Regression, SVM), transformer-based training (BERT, RoBERTa), data augmentation with 6 techniques, REST API server with Flask, model explainability with LIME, batch processing, and advanced visualizations with word clouds. Interactive data explorer, live classifier demo, and analytics dashboard included. Perfect for news categorization, topic classification, document analysis, and NLP education projects.

Text Classification NLP Ready Machine Learning Download BERT & Transformers 6 Classes Python Scripts REST API
Download Free Source Code Live Demo RSK View Files
Text Classification Dataset - RSK World
Text Classification Dataset - RSK World
Text Classification NLP Machine Learning 6 Classes Python BERT Ready

This project features a comprehensive Text Classification dataset designed for professional NLP, document categorization, and machine learning applications. The dataset includes 300+ labeled documents across 6 categories: Technology, Sports, Politics, Entertainment, Business, and Science. Features train/validation/test splits and multiple formats (CSV, JSON, TXT). Includes powerful Python scripts: preprocessing.py for text cleaning, train_classifier.py for traditional ML models, train_transformers.py for BERT/RoBERTa training, data_augmentation.py with 6 techniques, api_server.py for REST API deployment, model_explainability.py for LIME-based explanations, batch_processor.py for high-throughput classification, and visualizations.py for word clouds and charts. The package includes interactive data explorer, live classifier demo, analytics dashboard, comprehensive README.md, and MIT License. Perfect for data scientists, researchers, students, and developers working on news categorization, topic classification, document analysis, and NLP education projects.

If you find this Text Classification Dataset useful, you can support with a small contribution.

Secure Fast Trusted
Pay via UPI QR
Scan or tap an amount to auto-generate
UPI QR
₹
Open UPI app
GPay PhonePe Paytm
Download Free Source Code

Dataset Overview

Complete text classification dataset with 300+ labeled documents across 6 categories for NLP and machine learning.

  • 300+ labeled documents
  • 6 category classes
  • Technology, Sports, Politics
  • Entertainment, Business, Science
  • News articles included
  • Topic classification ready
  • Pre-split: Train (240), Val (30), Test (30)
  • Average text length: ~20 words
  • Balanced class distribution
  • Clean, high-quality labels
  • Perfect for NLP & ML training

Dataset Structure & Files

Well-organized folder structure with training, validation, and test splits plus comprehensive data files.

  • data/csv/train.csv - Training set (240)
  • data/csv/validation.csv - Validation set
  • data/csv/test.csv - Test set
  • data/csv/full_dataset.csv - Complete data
  • data/json/dataset.json - JSON format
  • data/json/full_dataset.json - Full JSON
  • data/txt/categories.txt - Category labels
  • scripts/ - Python utilities
  • Consistent naming convention
  • Easy to load with pandas
  • Transformer-ready format

Machine Learning Training

Complete training pipeline with support for traditional ML and transformer-based models.

  • Naive Bayes classifier (85.2%)
  • Logistic Regression (89.7%)
  • Linear SVM (88.9%)
  • BERT fine-tuning (94.3%)
  • TF-IDF vectorization
  • Count vectorization
  • Cross-validation support
  • Model checkpointing
  • Performance metrics report
  • Hyperparameter tuning
  • Model export & persistence

Multiple File Formats

Dataset available in multiple formats for maximum compatibility with different NLP tools and frameworks.

  • CSV format (.csv files)
  • JSON format with metadata
  • Plain text format (.txt)
  • Pandas DataFrame ready
  • HuggingFace compatible
  • Transformer pipeline ready
  • Easy format conversion
  • Unicode text support
  • UTF-8 encoding
  • Comment lines for metadata
  • Header row included

Analysis & Visualization

Comprehensive analysis tools with visualization capabilities and interactive explorer.

  • Interactive Data Explorer
  • Category distribution charts
  • Word frequency analysis
  • Word cloud generation
  • Text length histogram
  • Confusion matrix plots
  • Performance benchmarking
  • Model comparison tools
  • HTML report generation
  • Export visualization images
  • Analytics Dashboard

Compatible Frameworks

Works with all major NLP frameworks and libraries out of the box.

  • HuggingFace Transformers
  • BERT / RoBERTa / DistilBERT
  • scikit-learn ML library
  • pandas data manipulation
  • Flask REST API
  • LIME explainability
  • matplotlib visualization
  • wordcloud generation
  • TensorFlow/Keras ready
  • PyTorch compatible
  • Jupyter Notebook support

What You Get

Complete package with all files needed for professional text classification projects.

  • 300+ labeled documents
  • 10+ Python utility scripts
  • preprocessing.py - Text preprocessor
  • train_classifier.py - Traditional ML
  • train_transformers.py - BERT training
  • data_augmentation.py - 6 techniques
  • api_server.py - REST API server
  • model_explainability.py - LIME
  • batch_processor.py - Batch processing
  • visualizations.py - Charts & clouds
  • Interactive demo website

Interactive Demo Website

Beautiful demo website with data explorer, live classifier, analytics dashboard, and comprehensive guide.

  • Modern animated design
  • Interactive Data Explorer
  • Live Text Classifier
  • Analytics Dashboard
  • Filter by category
  • Real-time predictions
  • Category visualization
  • Performance metrics display
  • Step-by-step usage guide
  • Dark theme with gradients
  • Fully responsive layout

Python Scripts Included

Professional Python scripts for preprocessing, training, augmentation, API deployment, and explainability.

  • preprocessing.py - Text cleaning & tokenization
  • train_classifier.py - Traditional ML training
  • train_transformers.py - BERT/Transformer training
  • data_augmentation.py - 6 augmentation techniques
  • api_server.py - Flask REST API server
  • model_explainability.py - LIME explanations
  • batch_processor.py - High-throughput processing
  • visualizations.py - Charts & word clouds
  • hyperparameter_tuning.py - Optimization
  • data_quality.py - Quality checks
  • deep_learning.py - Neural networks

Classification Categories

6 distinct categories covering major news and document classification domains.

  • Technology - Tech news, gadgets, AI (🔵 Blue)
  • Sports - Athletics, competitions, leagues (🟢 Green)
  • Politics - Government, policy, elections (🟣 Purple)
  • Entertainment - Movies, music, TV (🩷 Pink)
  • Business - Finance, markets, economy (🟡 Amber)
  • Science - Research, discoveries, health (🔵 Cyan)
  • Clear labeling criteria
  • Human-verified labels
  • Balanced distribution
  • Easy to extend classes
  • Total: 300+ samples

Credits & Acknowledgments

This dataset is provided for educational and research purposes. Core technologies and libraries are credited below.

  • Python 3.8+ - Programming language (PSF License)
  • HuggingFace Transformers - BERT, RoBERTa (Apache 2.0)
  • scikit-learn - Machine Learning (BSD License)
  • Flask - REST API Framework (BSD License)
  • LIME - Model Explainability (BSD License)
  • matplotlib - Data Visualization (PSF License)
  • RSK World - Dataset creator and provider
  • GitHub Repository - Source code and releases
  • Author: Molla Samser | Designer: Rima Khatun
  • MIT License - Free for learning & research

Support & Contact

For commercial use, custom datasets, or integration help, please contact us.

  • Email: help@rskworld.in
  • Phone: +91 93305 39277
  • Website: RSKWORLD.in
  • Location: Nutanhat, Mongolkote, West Bengal, India
  • Author: Molla Samser
  • Designer & Tester: Rima Khatun
  • GitHub: Coming Soon
  • Text Classification Dataset Documentation
  • Technical Support Available
  • Custom Dataset Requests Welcome
Featured Content
Additional Sponsored Content

Download Free Source Code

Get the complete dataset bundle. You can view the files or download the dataset directly.

Download Free Source Code

Quick Links

Live Demo - Try Text Classification Click to explore
Download Free Source Code Click to explore
View Files (Browser) Click to explore
Explore All Dataset Projects by RSK World Click to explore
Explore All Data Science Projects by RSK World Click to explore

Categories

Text Classification NLP Machine Learning 6 Classes Python BERT Ready

Technologies

Text Classification
BERT
NLP
scikit-learn
Python

Explore More Datasets

NLP & Text Classification

Dataset Learning Dataset Computer Vision Python Image Classification
Voice Cloning Dataset - rskworld.in
Voice Cloning Dataset
Audio Data

Voice cloning dataset with speaker recordings and voice characteristics for text...

View Project
Surveillance Video Dataset - rskworld.in
Surveillance Video Dataset
Video Data

Surveillance video dataset with security camera footage, activity monitoring, an...

View Project
Customer Churn Dataset - rskworld.in
Customer Churn Dataset
Tabular Data

Comprehensive customer churn dataset with demographic, usage, and billing inform...

View Project
Face Recognition Dataset - rskworld.in
Face Recognition Dataset
Image Data

Facial recognition dataset with labeled face images across multiple identities f...

View Project
Stock Market Time Series Dataset - rskworld.in
Stock Market Time Series
Time Series Data

Historical stock market data with OHLCV (Open, High, Low, Close, Volume) prices ...

View Project
View All Projects

About RSK World

Founded by Molla Samser, with Designer & Tester Rima Khatun, RSK World is your one-stop destination for free programming resources, source code, and development tools.

Founder: Molla Samser
Designer & Tester: Rima Khatun

Development

  • Game Development
  • Web Development
  • Mobile Development
  • AI Development
  • Development Tools

Legal

  • Terms & Conditions
  • Privacy Policy
  • Disclaimer

Contact Info

Nutanhat, Mongolkote
Purba Burdwan, West Bengal
India, 713147

+91 93305 39277

hello@rskworld.in
support@rskworld.in

© 2026 RSK World. All rights reserved.

Content used for educational purposes only. View Disclaimer

Support This Free Project

This project is completely free to download!

If you find it useful, consider supporting us with a small donation. Your support helps us create more free projects.

Pay via Razorpay

If you find this Text Classification Dataset useful, you can support with a small contribution.

Secure Fast Trusted
Payment Successful! Your download will start automatically...
Pay via UPI QR
Scan or tap an amount to auto-generate
UPI QR
₹
Open UPI app
GPay PhonePe Paytm