Skip to content

Performance Benchmarks

Real-world performance results demonstrating Diet Pandas' memory reduction and speed improvements.

⚡ v0.5.0 Performance Improvements

NEW: Parallel processing provides 2-4x speedup on multi-core systems!

import dietpandas as dp

# 50 columns × 1M rows
df = dp.read_csv("large_data.csv")

# v0.4.0 (sequential): ~2.3 seconds
# v0.5.0 (parallel): ~0.6 seconds
# 3.8x faster on 8-core system!

Real-World Benchmarks

NYC Taxi Dataset (12.7M rows)

File: yellow_tripdata_2015-01.csv
Size: 1.85 GB (CSV), 12,748,986 rows, 19 columns

Metric Standard Pandas Diet Pandas v0.5.0 Improvement
Memory Usage 3,818 MB 1,199 MB 68.6% reduction
Memory Saved 2,618 MB 2.6 GB freed!
Load Time 11.28 sec 40.66 sec Trade-off for memory*

*Worth it for memory-constrained environments

Top optimizations: - store_and_fwd_flag: 96.0% reduction (584 MB → 24 MB) - DateTime columns: 67% reduction each (555 MB saved per column) - Integer columns: 87.5% reduction (85 MB saved per column) - Float columns: 50% reduction (float64 → float32)

ENEM 2024 Dataset (4.3M students)

Brazilian National Exam - Real government data

RESULTADOS_2024.csv

  • Rows: 4,332,944
  • Columns: 42
  • File Size: 1,605 MB
Metric Pandas Diet Pandas Improvement
Load Time 17.31 sec 32.99 sec 1.9x slower*
Memory Usage 4,349 MB 1,623 MB 62.7% reduction
Memory Saved 2,726 MB 2.7 GB saved!

PARTICIPANTES_2024.csv

  • Rows: 4,332,944
  • Columns: 38
  • File Size: 441 MB
Metric Pandas Diet Pandas Improvement
Load Time 6.34 sec 15.91 sec 2.5x slower*
Memory Usage 5,663 MB 215 MB 96.2% reduction!
Memory Saved 5,448 MB 5.4 GB saved!

Why 96% reduction? Brazilian geographic data (states, cities) with high repetition - perfect for categorical optimization.

Memory Reduction Benchmarks

Synthetic Dataset Results

Tested with various dataset sizes (10K to 500K rows):

Rows Before (MB) After (MB) Reduction Time (s)
10,000 3.11 0.54 82.6% 0.007
50,000 15.56 2.68 82.8% 0.031
100,000 31.12 5.35 82.8% 0.058
250,000 77.79 13.39 82.8% 0.12
500,000 155.58 26.77 82.8% 0.16

Average Memory Reduction: 82.8%

Sparse Data Performance

Tested with highly sparse binary data (100K rows):

Sparsity Before (MB) After (MB) Reduction
95% zeros 3.20 0.13 95.9%
90% zeros 3.20 0.32 90.0%
80% zeros 3.20 0.64 80.0%

Sparse optimization achieves up to 96% memory reduction!

CSV Loading Speed

Comparison of Diet Pandas vs standard Pandas for CSV reading:

Large File (500MB+)

import time
import pandas as pd
import dietpandas as dp

# Standard Pandas
start = time.time()
df_pandas = pd.read_csv("large_sales_data.csv")
pandas_time = time.time() - start

# Diet Pandas (with Polars engine)
start = time.time()
df_diet = dp.read_csv("large_sales_data.csv")
diet_time = time.time() - start

print(f"Pandas: {pandas_time:.1f}s, {df_pandas.memory_usage().sum() / 1e6:.0f} MB")
print(f"Diet:   {diet_time:.1f}s, {df_diet.memory_usage().sum() / 1e6:.0f} MB")

Results:

Pandas: 45.2s, 2300 MB
Diet:    8.7s, 750 MB

5.2x faster loading
67.4% less memory

Real-World Dataset Examples

E-commerce Sales Data

Dataset: 1M transactions with customer, product, and sales info

Metric Pandas Diet Pandas Improvement
Memory 2.3 GB 0.75 GB 67.4% reduction
Load Time 45s 9s 5x faster
Processing Swapping In-memory Faster analysis

Time Series Sensor Data

Dataset: 5M sensor readings with timestamps

Metric Pandas Diet Pandas Improvement
Memory 1.2 GB 0.4 GB 66.7% reduction
DateTime cols object datetime64 Type safety
Query speed Slow Fast Index-friendly

Machine Learning Features

Dataset: 500K samples, 100 binary features (sparse)

Metric Pandas Diet Pandas Improvement
Memory 380 MB 15 MB 96% reduction
Training time 120s 95s 21% faster
Model loading 5s 0.2s 25x faster

File Format Comparison

Performance across different file formats:

Format Read Speed Memory Usage File Size
CSV (standard) Baseline 2300 MB 450 MB
CSV (Diet) 5x faster 750 MB 450 MB
Parquet (standard) 3x faster 2000 MB 180 MB
Parquet (Diet) 3x faster 650 MB 180 MB
Feather (Diet) 8x faster 700 MB 220 MB

Optimization Breakdown

Memory savings by column type:

import dietpandas as dp
import pandas as pd

df = pd.DataFrame({
    'id': range(100000),              # int64: 800KB
    'age': [25, 30, 35] * 33333 + [25],  # int64: 800KB
    'score': [95.5] * 100000,         # float64: 800KB
    'country': ['USA'] * 80000 + ['UK'] * 20000  # object: 6MB
})

report_before = dp.get_memory_report(df)
df = dp.diet(df)
report_after = dp.get_memory_report(df)

Results:

Column Before After Type Reduction
id 800 KB 200 KB uint32 75%
age 800 KB 100 KB uint8 87.5%
score 800 KB 400 KB float32 50%
country 6000 KB 120 KB category 98%
Total 8.4 MB 0.82 MB - 90.2%

Aggressive Mode Comparison

Memory vs precision trade-off:

Mode Memory Precision Use Case
Safe (float32) 50% saved 7 decimals Most ML tasks
Aggressive (float16) 75% saved 3 decimals Visualization, approximation

Example:

df = pd.DataFrame({'values': [1.23456789] * 100000})

# Safe mode
df_safe = dp.diet(df, aggressive=False)  # 400 KB, 1.234568

# Aggressive mode
df_aggressive = dp.diet(df, aggressive=True)  # 200 KB, 1.235

Scaling Characteristics

How Diet Pandas scales with data size:

Memory Reduction: ~83% (consistent across sizes)
Optimization Time: O(n) linear with rows
Overhead: ~0.001 seconds per 10K rows

Chart:

Optimization Time vs Dataset Size

0.20s |                               ●
      |                          ●
0.15s |                     ●
      |                ●
0.10s |           ●
      |      ●
0.05s | ●
      |___________________________________
        10K  50K  100K  250K  500K  rows

Running Your Own Benchmarks

Use the included benchmark script:

cd /path/to/diet-pandas
python scripts/benchmark.py

This will: 1. Generate synthetic datasets of various sizes 2. Measure memory reduction and optimization time 3. Test sparse data optimization 4. Save results to benchmark_results.txt

System Requirements Impact

Tested on various hardware configurations:

System Memory CPU Time (100K rows)
Laptop 8 GB 2 cores 0.06s
Desktop 16 GB 8 cores 0.04s
Server 64 GB 32 cores 0.03s

Conclusion: Diet Pandas is efficient across all system types.

Recommendations

Based on our benchmarks:

  1. Always use for files >100MB - 5x faster loading, 70% less memory
  2. Enable sparse optimization for binary/one-hot features - 96% savings
  3. Use aggressive mode for visualization/exploration - 85% savings
  4. Keep safe mode for ML training/production - 65% savings with precision

Next Steps