help@rskworld.in +91 93305 39277
RSK World
  • Home
  • Development
    • Web Development
    • Mobile Apps
    • Software
    • Games
    • Project
  • Technologies
    • Data Science
    • AI Development
    • Cloud Development
    • Blockchain
    • Cyber Security
    • Dev Tools
    • Testing Tools
  • About
  • Contact

Theme Settings

Color Scheme
Display Options
Font Size
100%
Back to Project
RSK World
dask-parallel
/
notebooks
RSK World
dask-parallel
Parallel and distributed computing with Dask
notebooks
  • 01_dask_arrays.ipynb4.2 KB
  • 02_dask_dataframes.ipynb5 KB
  • 03_delayed_computations.ipynb5.2 KB
  • 04_distributed_computing.ipynb4.8 KB
  • 05_task_scheduling.ipynb5.4 KB
  • 06_dask_bags.ipynb5.3 KB
  • 07_advanced_dataframes.ipynb6.7 KB
  • 08_dask_ml.ipynb7.2 KB
06_dask_bags.ipynb
notebooks/06_dask_bags.ipynb
Raw Download
Find: Go to:
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Dask Bags - Processing Unstructured Data\n",
        "\n",
        "<!--\n",
        "Project: Dask Parallel Computing\n",
        "Author: Molla Samser\n",
        "Designer & Tester: Rima Khatun\n",
        "Website: https://rskworld.in\n",
        "Email: help@rskworld.in, support@rskworld.in\n",
        "Phone: +91 93305 39277\n",
        "-->\n",
        "\n",
        "This notebook demonstrates Dask Bags for processing unstructured data like JSON, text files, and log files in parallel.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import dask.bag as db\n",
        "import json\n",
        "import time\n",
        "from dask import delayed\n",
        "\n",
        "print(\"Dask Bags Demo\")\n",
        "print(\"=\" * 50)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Creating Bags from Lists\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Create a bag from a list\n",
        "data = [i for i in range(1000)]\n",
        "bag = db.from_sequence(data, npartitions=4)\n",
        "\n",
        "print(f\"Bag partitions: {bag.npartitions}\")\n",
        "print(f\"Total elements: {len(bag)}\")\n",
        "print(f\"First 10 elements: {bag.take(10)}\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Processing Text Data\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Process text data in parallel\n",
        "text_data = [\n",
        "    \"Python is a great programming language\",\n",
        "    \"Dask enables parallel computing\",\n",
        "    \"Data science requires efficient tools\",\n",
        "    \"Machine learning needs scalable solutions\",\n",
        "    \"Big data processing is essential\"\n",
        "] * 200  # Repeat to create more data\n",
        "\n",
        "text_bag = db.from_sequence(text_data, npartitions=4)\n",
        "\n",
        "# Word count example\n",
        "def count_words(text):\n",
        "    return len(text.split())\n",
        "\n",
        "word_counts = text_bag.map(count_words)\n",
        "total_words = word_counts.sum().compute()\n",
        "\n",
        "print(f\"Total words: {total_words}\")\n",
        "print(f\"Average words per sentence: {total_words / len(text_data):.2f}\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Processing JSON Data\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Create sample JSON data\n",
        "json_data = [\n",
        "    {\"id\": i, \"name\": f\"User_{i}\", \"score\": i * 10, \"category\": [\"A\", \"B\", \"C\"][i % 3]}\n",
        "    for i in range(1000)\n",
        "]\n",
        "\n",
        "# Convert to JSON strings\n",
        "json_strings = [json.dumps(item) for item in json_data]\n",
        "json_bag = db.from_sequence(json_strings, npartitions=4)\n",
        "\n",
        "# Parse and filter\n",
        "def parse_and_filter(json_str):\n",
        "    data = json.loads(json_str)\n",
        "    if data[\"score\"] > 500:\n",
        "        return data\n",
        "    return None\n",
        "\n",
        "filtered = json_bag.map(parse_and_filter).filter(lambda x: x is not None)\n",
        "results = filtered.compute()\n",
        "\n",
        "print(f\"Filtered {len(results)} items with score > 500\")\n",
        "print(f\"Sample result: {results[0] if results else 'None'}\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Advanced Bag Operations\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Advanced operations: map, filter, reduce, groupby\n",
        "numbers = list(range(1, 10001))\n",
        "number_bag = db.from_sequence(numbers, npartitions=8)\n",
        "\n",
        "# Map: square all numbers\n",
        "squared = number_bag.map(lambda x: x ** 2)\n",
        "\n",
        "# Filter: keep only even squares\n",
        "even_squares = squared.filter(lambda x: x % 2 == 0)\n",
        "\n",
        "# Reduce: sum all even squares\n",
        "total = even_squares.sum().compute()\n",
        "\n",
        "print(f\"Sum of even squares from 1-10000: {total}\")\n",
        "\n",
        "# Groupby example\n",
        "data_with_keys = [(i % 5, i) for i in range(100)]\n",
        "keyed_bag = db.from_sequence(data_with_keys, npartitions=4)\n",
        "grouped = keyed_bag.groupby(lambda x: x[0]).map(lambda x: (x[0], sum([item[1] for item in x[1]])))\n",
        "grouped_results = grouped.compute()\n",
        "\n",
        "print(f\"\\nGrouped results: {dict(grouped_results)}\")\n"
      ]
    }
  ],
  "metadata": {
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}
176 lines•5.3 KB
json

About RSK World

Founded by Molla Samser, with Designer & Tester Rima Khatun, RSK World is your one-stop destination for free programming resources, source code, and development tools.

Founder: Molla Samser
Designer & Tester: Rima Khatun

Development

  • Game Development
  • Web Development
  • Mobile Development
  • AI Development
  • Development Tools

Legal

  • Terms & Conditions
  • Privacy Policy
  • Disclaimer

Contact Info

Nutanhat, Mongolkote
Purba Burdwan, West Bengal
India, 713147

+91 93305 39277

hello@rskworld.in
support@rskworld.in

© 2026 RSK World. All rights reserved.

Content used for educational purposes only. View Disclaimer