File I/O¶
Learn how to load and save data efficiently with Diet Pandas.
Reading Files¶
Diet Pandas provides drop-in replacements for pandas I/O functions that automatically optimize memory usage.
CSV Files¶
The most common use case - CSV reading with Polars engine for speed:
import dietpandas as dp
# Basic usage - 5-10x faster than pandas.read_csv
df = dp.read_csv("data.csv")
# Disable optimization if needed
df = dp.read_csv("data.csv", optimize=False)
# Aggressive mode for maximum compression
df = dp.read_csv("data.csv", aggressive=True)
# Pass through pandas arguments
df = dp.read_csv("data.csv", sep=";", encoding="utf-8")
Parquet Files¶
Fast columnar format with built-in compression:
import dietpandas as dp
df = dp.read_parquet("data.parquet")
# Still optimizes further!
# 🥗 Diet Complete: Memory reduced by 45.2%
Excel Files¶
import dietpandas as dp
# Read specific sheet
df = dp.read_excel("data.xlsx", sheet_name="Sales")
# Read all sheets
dfs = dp.read_excel("data.xlsx", sheet_name=None)
# Returns dict of optimized DataFrames
Note: Requires openpyxl:
JSON Files¶
import dietpandas as dp
# JSON Lines format (recommended for large files)
df = dp.read_json("data.jsonl", lines=True)
# Standard JSON
df = dp.read_json("data.json")
HDF5 Files¶
Hierarchical data format for large datasets:
Note: Requires tables:
Feather Files¶
Apache Arrow format - extremely fast:
Writing Files¶
Save optimized DataFrames to disk:
CSV¶
import dietpandas as dp
dp.to_csv_optimized(df, "output.csv")
# Pass pandas arguments
dp.to_csv_optimized(df, "output.csv", index=False, sep="|")
Parquet¶
import dietpandas as dp
dp.to_parquet_optimized(df, "output.parquet")
# With compression
dp.to_parquet_optimized(df, "output.parquet", compression="gzip")
Feather¶
Performance Comparison¶
Loading a 500MB CSV file:
| Method | Time | Memory | Notes |
|---|---|---|---|
pd.read_csv() |
45s | 2.3 GB | Standard |
pd.read_csv() + diet() |
47s | 750 MB | Manual opt |
dp.read_csv() |
8s | 750 MB | Best! |
Choosing a File Format¶
| Format | Speed | Compression | Use Case |
|---|---|---|---|
| CSV | Medium | None | Human-readable, universal |
| Parquet | Fast | Excellent | Long-term storage |
| Feather | Very Fast | Good | Temporary storage |
| Excel | Slow | None | Business reports |
| JSON | Medium | None | Web APIs |
| HDF5 | Fast | Good | Scientific data |
Recommendations: - Fast iteration: Feather - Long-term storage: Parquet - Sharing with non-Python users: CSV - Large datasets: Parquet or HDF5
Working with Multiple Files¶
Loading Multiple CSVs¶
import dietpandas as dp
import glob
dfs = []
for filepath in glob.glob("data/*.csv"):
df = dp.read_csv(filepath)
dfs.append(df)
combined = pd.concat(dfs, ignore_index=True)
Batch Processing¶
import dietpandas as dp
def process_file(filepath):
df = dp.read_csv(filepath)
# Process
result = df.groupby('category')['sales'].sum()
return result
results = [process_file(f) for f in file_list]
Chunked Reading¶
For files too large to fit in memory:
import pandas as pd
import dietpandas as dp
# Read in chunks
for chunk in pd.read_csv("huge_file.csv", chunksize=10000):
# Optimize each chunk
chunk = dp.diet(chunk)
# Process
process(chunk)
URL and Cloud Storage¶
Diet Pandas works with URLs and cloud storage:
import dietpandas as dp
# From URL
df = dp.read_csv("https://example.com/data.csv")
# From S3 (with s3fs)
df = dp.read_csv("s3://bucket/data.csv")
# From Google Cloud Storage (with gcsfs)
df = dp.read_parquet("gs://bucket/data.parquet")
Compression¶
Reading compressed files:
import dietpandas as dp
# Automatic detection
df = dp.read_csv("data.csv.gz")
df = dp.read_csv("data.csv.bz2")
df = dp.read_csv("data.csv.zip")
# Still optimized!
Advanced Patterns¶
Pipeline Pattern¶
import dietpandas as dp
def load_and_prepare(filepath):
return (
dp.read_csv(filepath)
.dropna()
.query("age > 18")
.reset_index(drop=True)
)
df = load_and_prepare("users.csv")
Caching Pattern¶
import dietpandas as dp
import os
def load_with_cache(filepath, cache_path):
if os.path.exists(cache_path):
# Load from fast format
return dp.read_feather(cache_path)
else:
# Load and cache
df = dp.read_csv(filepath)
dp.to_feather_optimized(df, cache_path)
return df
df = load_with_cache("data.csv", "cache/data.feather")
Troubleshooting¶
Polars Engine Fails¶
If Polars engine fails, Diet Pandas automatically falls back to pandas:
df = dp.read_csv("complex_file.csv")
# ⚠️ Warning: Polars engine failed, falling back to pandas
# This is automatic - no action needed
Memory Issues¶
For very large files:
# Load without optimization first
df = dp.read_csv("huge.csv", optimize=False)
# Drop unnecessary columns
df = df[['col1', 'col2', 'col3']]
# Then optimize
df = dp.diet(df)
Encoding Issues¶
Next Steps¶
- Learn about Advanced Optimization
- Check Memory Reports
- See API Reference