File I/O¶

Learn how to load and save data efficiently with Diet Pandas.

Reading Files¶

Diet Pandas provides drop-in replacements for pandas I/O functions that automatically optimize memory usage.

CSV Files¶

The most common use case - CSV reading with Polars engine for speed:

import dietpandas as dp

# Basic usage - 5-10x faster than pandas.read_csv
df = dp.read_csv("data.csv")

# Disable optimization if needed
df = dp.read_csv("data.csv", optimize=False)

# Aggressive mode for maximum compression
df = dp.read_csv("data.csv", aggressive=True)

# Pass through pandas arguments
df = dp.read_csv("data.csv", sep=";", encoding="utf-8")

Parquet Files¶

Fast columnar format with built-in compression:

import dietpandas as dp

df = dp.read_parquet("data.parquet")

# Still optimizes further!
# 🥗 Diet Complete: Memory reduced by 45.2%

Excel Files¶

import dietpandas as dp

# Read specific sheet
df = dp.read_excel("data.xlsx", sheet_name="Sales")

# Read all sheets
dfs = dp.read_excel("data.xlsx", sheet_name=None)
# Returns dict of optimized DataFrames

Note: Requires openpyxl:

pip install "diet-pandas[excel]"

JSON Files¶

import dietpandas as dp

# JSON Lines format (recommended for large files)
df = dp.read_json("data.jsonl", lines=True)

# Standard JSON
df = dp.read_json("data.json")

HDF5 Files¶

Hierarchical data format for large datasets:

import dietpandas as dp

df = dp.read_hdf("data.h5", key="dataset1")

Note: Requires tables:

pip install "diet-pandas[hdf]"

Feather Files¶

Apache Arrow format - extremely fast:

import dietpandas as dp

df = dp.read_feather("data.feather")
# Fastest format for pandas data!

Writing Files¶

Save optimized DataFrames to disk:

CSV¶

import dietpandas as dp

dp.to_csv_optimized(df, "output.csv")

# Pass pandas arguments
dp.to_csv_optimized(df, "output.csv", index=False, sep="|")

Parquet¶

import dietpandas as dp

dp.to_parquet_optimized(df, "output.parquet")

# With compression
dp.to_parquet_optimized(df, "output.parquet", compression="gzip")

Feather¶

import dietpandas as dp

dp.to_feather_optimized(df, "output.feather")

Performance Comparison¶

Loading a 500MB CSV file:

Method	Time	Memory	Notes
`pd.read_csv()`	45s	2.3 GB	Standard
`pd.read_csv()` + `diet()`	47s	750 MB	Manual opt
`dp.read_csv()`	8s	750 MB	Best!

Choosing a File Format¶

Format	Speed	Compression	Use Case
CSV	Medium	None	Human-readable, universal
Parquet	Fast	Excellent	Long-term storage
Feather	Very Fast	Good	Temporary storage
Excel	Slow	None	Business reports
JSON	Medium	None	Web APIs
HDF5	Fast	Good	Scientific data

Recommendations: - Fast iteration: Feather - Long-term storage: Parquet - Sharing with non-Python users: CSV - Large datasets: Parquet or HDF5

Working with Multiple Files¶

Loading Multiple CSVs¶

import dietpandas as dp
import glob

dfs = []
for filepath in glob.glob("data/*.csv"):
    df = dp.read_csv(filepath)
    dfs.append(df)

combined = pd.concat(dfs, ignore_index=True)

Batch Processing¶

import dietpandas as dp

def process_file(filepath):
    df = dp.read_csv(filepath)
    # Process
    result = df.groupby('category')['sales'].sum()
    return result

results = [process_file(f) for f in file_list]

Chunked Reading¶

For files too large to fit in memory:

import pandas as pd
import dietpandas as dp

# Read in chunks
for chunk in pd.read_csv("huge_file.csv", chunksize=10000):
    # Optimize each chunk
    chunk = dp.diet(chunk)
    # Process
    process(chunk)

URL and Cloud Storage¶

Diet Pandas works with URLs and cloud storage:

import dietpandas as dp

# From URL
df = dp.read_csv("https://example.com/data.csv")

# From S3 (with s3fs)
df = dp.read_csv("s3://bucket/data.csv")

# From Google Cloud Storage (with gcsfs)
df = dp.read_parquet("gs://bucket/data.parquet")

Compression¶

Reading compressed files:

import dietpandas as dp

# Automatic detection
df = dp.read_csv("data.csv.gz")
df = dp.read_csv("data.csv.bz2")
df = dp.read_csv("data.csv.zip")

# Still optimized!

Advanced Patterns¶

Pipeline Pattern¶

import dietpandas as dp

def load_and_prepare(filepath):
    return (
        dp.read_csv(filepath)
          .dropna()
          .query("age > 18")
          .reset_index(drop=True)
    )

df = load_and_prepare("users.csv")

Caching Pattern¶

import dietpandas as dp
import os

def load_with_cache(filepath, cache_path):
    if os.path.exists(cache_path):
        # Load from fast format
        return dp.read_feather(cache_path)
    else:
        # Load and cache
        df = dp.read_csv(filepath)
        dp.to_feather_optimized(df, cache_path)
        return df

df = load_with_cache("data.csv", "cache/data.feather")

Troubleshooting¶

Polars Engine Fails¶

If Polars engine fails, Diet Pandas automatically falls back to pandas:

df = dp.read_csv("complex_file.csv")
# ⚠️ Warning: Polars engine failed, falling back to pandas
# This is automatic - no action needed

Memory Issues¶

For very large files:

# Load without optimization first
df = dp.read_csv("huge.csv", optimize=False)

# Drop unnecessary columns
df = df[['col1', 'col2', 'col3']]

# Then optimize
df = dp.diet(df)

Encoding Issues¶

df = dp.read_csv("data.csv", encoding="latin-1")