help@rskworld.in +91 93305 39277
RSK World
  • Home
  • Development
    • Web Development
    • Mobile Apps
    • Software
    • Games
    • Project
  • Technologies
    • Data Science
    • AI Development
    • Cloud Development
    • Blockchain
    • Cyber Security
    • Dev Tools
    • Testing Tools
  • About
  • Contact

Theme Settings

Color Scheme
Display Options
Font Size
100%
Back to Project
RSK World
dask-parallel
/
data
RSK World
dask-parallel
Parallel and distributed computing with Dask
data
  • README.md3.8 KB
  • nested_data.jsonl41.8 MB
  • sample_data.csv6.5 MB
README.md
data/README.md
Raw Download

README.md

# Data Directory

<!--
Project: Dask Parallel Computing
Author: Molla Samser
Designer & Tester: Rima Khatun
Website: https://rskworld.in
Email: help@rskworld.in, support@rskworld.in
Phone: +91 93305 39277
-->

This directory contains sample data files used in the Dask examples.

## Basic Data Files

### Core Examples
- **sample_data.csv** (100K rows) - Basic sample dataset for DataFrame examples
- Columns: id, value1, value2, category, date
- Used in: `02_dask_dataframes.ipynb`

- **advanced_data_1.csv** (100K rows) - Time series data with categories
- Columns: id, timestamp, value, category, region, amount
- Used in: `07_advanced_dataframes.ipynb`

- **advanced_data_2.csv** (100K rows) - Metadata dataset for joins
- Columns: id, metadata, status, score
- Used in: `07_advanced_dataframes.ipynb`

- **timeseries_data.csv** (100K rows) - Time series sensor data
- Columns: timestamp, value, temperature, humidity, pressure
- Used in: Advanced data processing scripts

- **profile_data.csv** (100K rows) - Data for performance profiling
- Columns: id, value, category, amount
- Used in: `performance_profiling.py`

- **complex_data.csv** (100K rows) - Complex transaction-like data
- Columns: id, date, amount, category, region, discount, final_amount
- Used in: `advanced_data_processing.py`

## Advanced Data Files

### Large Scale Datasets
- **timeseries_large.csv** (1M rows) - Large time series dataset
- Columns: timestamp, sensor_id, temperature, humidity, pressure, value, status
- Used for: Large-scale time series processing

- **transactions_large.csv** (2M rows) - Large transaction dataset
- Columns: transaction_id, timestamp, customer_id, product_id, amount, quantity, category, region, payment_method, discount, final_amount
- Used for: Large-scale transaction analysis

- **ml_dataset.csv** (500K rows, 100 features) - Machine learning dataset
- Columns: feature_0 to feature_99, target
- Used for: Machine learning examples

- **network_data.csv** (100K+ edges) - Network/graph data
- Columns: source, target, weight, timestamp
- Used for: Network analysis examples

### Batch Files for Parallel Processing
- **batch_file_000.csv to batch_file_009.csv** (10 files, 100K rows each)
- Used for: Demonstrating parallel file processing
- Columns: id, value, category, score, date

- **file_0.csv to file_4.csv** (5 files, 10K rows each)
- Used for: Basic parallel file processing examples
- Columns: id, value, category

### Unstructured Data
- **nested_data.jsonl** (100K records) - JSON Lines format nested data
- Contains: User data with nested orders and metadata
- Used in: `06_dask_bags.ipynb` and advanced data processing

## Data Generation

To regenerate all data files, run:

```bash
# Generate advanced/large datasets
python scripts/generate_advanced_data.py

# Generate basic sample datasets
python scripts/create_basic_data.py
```

## File Sizes

- Basic files: ~1-5 MB each
- Large files: ~50-200 MB each
- Total data size: ~500 MB - 1 GB

## Note

- Large data files are excluded from version control (see `.gitignore`)
- Data files are generated with random data for demonstration purposes
- You can modify the generation scripts to create custom datasets
- All timestamps and IDs are synthetic for testing purposes

## Usage Examples

### Reading Basic Data
```python
import dask.dataframe as dd
df = dd.read_csv('data/sample_data.csv')
```

### Reading Large Data
```python
df = dd.read_csv('data/timeseries_large.csv', parse_dates=['timestamp'])
```

### Reading Multiple Files
```python
files = ['data/batch_file_000.csv', 'data/batch_file_001.csv', ...]
df = dd.read_csv(files)
```

### Reading JSON Data
```python
import dask.bag as db
bag = db.read_text('data/nested_data.jsonl')
```

About RSK World

Founded by Molla Samser, with Designer & Tester Rima Khatun, RSK World is your one-stop destination for free programming resources, source code, and development tools.

Founder: Molla Samser
Designer & Tester: Rima Khatun

Development

  • Game Development
  • Web Development
  • Mobile Development
  • AI Development
  • Development Tools

Legal

  • Terms & Conditions
  • Privacy Policy
  • Disclaimer

Contact Info

Nutanhat, Mongolkote
Purba Burdwan, West Bengal
India, 713147

+91 93305 39277

hello@rskworld.in
support@rskworld.in

© 2026 RSK World. All rights reserved.

Content used for educational purposes only. View Disclaimer