RSK World - Polars Fast DataFrames - Project Files | RSK World - Free Programming Resources & Source Code

notebooks/02_lazy_evaluation.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Lazy Evaluation and Optimization in Polars\n",
        "\n",
        "<!--\n",
        "Author: RSK World\n",
        "Website: https://rskworld.in\n",
        "Email: help@rskworld.in\n",
        "Phone: +91 93305 39277\n",
        "-->\n",
        "\n",
        "This notebook demonstrates Polars' lazy evaluation capabilities, which allow for query optimization and efficient processing of large datasets.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Author: RSK World\n",
        "# Website: https://rskworld.in\n",
        "# Email: help@rskworld.in\n",
        "# Phone: +91 93305 39277\n",
        "\n",
        "import polars as pl\n",
        "import numpy as np\n",
        "from datetime import datetime, timedelta\n",
        "\n",
        "print(\"Polars version:\", pl.__version__)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 1. Understanding Lazy Evaluation\n",
        "\n",
        "Lazy evaluation means that operations are not executed immediately. Instead, Polars builds a query plan that is optimized before execution.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Create a sample DataFrame\n",
        "df = pl.DataFrame({\n",
        "    'id': range(1, 10001),\n",
        "    'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], 10000),\n",
        "    'value1': np.random.randn(10000) * 100,\n",
        "    'value2': np.random.randn(10000) * 50,\n",
        "    'value3': np.random.randint(1, 1000, 10000)\n",
        "})\n",
        "\n",
        "print(\"DataFrame shape:\", df.shape)\n",
        "print(\"\\nFirst few rows:\")\n",
        "df.head()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 2. Creating a LazyFrame\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Convert DataFrame to LazyFrame\n",
        "lazy_df = df.lazy()\n",
        "\n",
        "print(\"Type:\", type(lazy_df))\n",
        "print(\"\\nLazyFrame operations are not executed until .collect() is called\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 3. Building a Lazy Query\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Build a complex query (not executed yet)\n",
        "query = (lazy_df\n",
        "    .filter(pl.col('value1') > 50)\n",
        "    .filter(pl.col('value2') < 20)\n",
        "    .select(['id', 'category', 'value1', 'value2'])\n",
        "    .group_by('category')\n",
        "    .agg([\n",
        "        pl.col('value1').mean().alias('avg_value1'),\n",
        "        pl.col('value2').mean().alias('avg_value2'),\n",
        "        pl.count().alias('count')\n",
        "    ])\n",
        "    .sort('avg_value1', descending=True)\n",
        ")\n",
        "\n",
        "print(\"Query built but not executed yet!\")\n",
        "print(\"Type:\", type(query))\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 4. Viewing the Query Plan\n",
        "\n",
        "Polars can show you the optimized query plan before execution.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Show the query plan\n",
        "print(\"Query Plan:\")\n",
        "print(query.explain())\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 5. Executing the Query\n",
        "\n",
        "Now we execute the query using `.collect()`\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Execute the query\n",
        "result = query.collect()\n",
        "print(\"Query executed!\")\n",
        "result\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 6. Query Optimization Benefits\n",
        "\n",
        "Lazy evaluation allows Polars to optimize queries by:\n",
        "- Pushing predicates down (filtering early)\n",
        "- Projection pushdown (selecting only needed columns)\n",
        "- Predicate combination (combining multiple filters)\n",
        "- Join reordering\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Example: Reading from CSV with lazy evaluation\n",
        "# This is more efficient for large files\n",
        "try:\n",
        "    lazy_from_csv = pl.scan_csv('data/sample_data.csv')\n",
        "    print(\"LazyFrame from CSV created\")\n",
        "    print(\"\\nQuery plan:\")\n",
        "    print(lazy_from_csv.filter(pl.col('price') > 100).select(['name', 'price']).explain())\n",
        "except FileNotFoundError:\n",
        "    print(\"Sample data file not found. Run data_generator.py first.\")\n"
      ]
    }
  ],
  "metadata": {
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}

197 lines•5.5 KB

json

Theme Settings

Color Scheme

Display Options

Font Size