Skip to content

API Reference: I/O Functions

This page documents all file input/output functions in Diet Pandas.

All I/O functions automatically optimize the loaded DataFrame and return a standard pandas DataFrame.

Read Functions

read_csv()

Read a CSV file with automatic memory optimization. Uses Polars engine for 5-10x faster parsing.

dietpandas.io.read_csv(filepath, optimize=True, aggressive=False, categorical_threshold=0.5, verbose=False, use_polars=True, schema_path=None, save_schema=False, memory_threshold=0.7, auto_chunk=True, chunksize=100000, **kwargs)

Reads a CSV file using Polars engine (if available), then converts to optimized Pandas.

Automatically switches to chunked reading when file is too large to fit in memory. This function is often 5-10x faster at parsing CSVs than pandas.read_csv, and the resulting DataFrame uses significantly less memory due to automatic optimization.

Parameters:

Name Type Description Default
filepath Union[str, Path]

Path to CSV file

required
optimize bool

If True, apply diet optimization after reading (default: True)

True
aggressive bool

If True, use aggressive optimization (float16, etc.)

False
categorical_threshold float

Threshold for converting objects to categories

0.5
verbose bool

If True, print memory reduction statistics

False
use_polars bool

If True and Polars is available, use it for parsing (default: True)

True
schema_path Union[str, Path, None]

Optional path to schema file for consistent typing

None
save_schema bool

If True, save schema after optimization (only with chunked reading)

False
memory_threshold float

Use chunked reading if estimated memory > threshold * available (default: 0.7)

0.7
auto_chunk bool

If True, automatically use chunked reading for large files (default: True)

True
chunksize int

Number of rows per chunk when using chunked reading (default: 100,000)

100000
**kwargs

Additional arguments passed to the CSV reader

{}

Returns:

Type Description
DataFrame

Optimized pandas DataFrame

Examples:

>>> df = read_csv("large_dataset.csv")
Diet Complete: Memory reduced by 67.3%
>>> # Disable optimization if needed
>>> df = read_csv("data.csv", optimize=False)
>>> # Use aggressive mode for maximum compression
>>> df = read_csv("data.csv", aggressive=True)
>>> # Use saved schema for consistent typing
>>> df = read_csv("data.csv", schema_path="data.diet_schema.json")
>>> # Large files are automatically chunked
>>> df = read_csv("huge_file.csv")  # Automatically uses chunked reading
Source code in src/dietpandas/io.py
def read_csv(
    filepath: Union[str, Path],
    optimize: bool = True,
    aggressive: bool = False,
    categorical_threshold: float = 0.5,
    verbose: bool = False,
    use_polars: bool = True,
    schema_path: Union[str, Path, None] = None,
    save_schema: bool = False,
    memory_threshold: float = 0.7,
    auto_chunk: bool = True,
    chunksize: int = 100000,
    **kwargs,
) -> pd.DataFrame:
    """
    Reads a CSV file using Polars engine (if available), then converts to optimized Pandas.

    Automatically switches to chunked reading when file is too large to fit in memory.
    This function is often 5-10x faster at parsing CSVs than pandas.read_csv, and the
    resulting DataFrame uses significantly less memory due to automatic optimization.

    Args:
        filepath: Path to CSV file
        optimize: If True, apply diet optimization after reading (default: True)
        aggressive: If True, use aggressive optimization (float16, etc.)
        categorical_threshold: Threshold for converting objects to categories
        verbose: If True, print memory reduction statistics
        use_polars: If True and Polars is available, use it for parsing (default: True)
        schema_path: Optional path to schema file for consistent typing
        save_schema: If True, save schema after optimization
            (only with chunked reading)
        memory_threshold: Use chunked reading if estimated memory >
            threshold * available (default: 0.7)
        auto_chunk: If True, automatically use chunked reading for large files (default: True)
        chunksize: Number of rows per chunk when using chunked reading (default: 100,000)
        **kwargs: Additional arguments passed to the CSV reader

    Returns:
        Optimized pandas DataFrame

    Examples:
        >>> df = read_csv("large_dataset.csv")
        Diet Complete: Memory reduced by 67.3%

        >>> # Disable optimization if needed
        >>> df = read_csv("data.csv", optimize=False)

        >>> # Use aggressive mode for maximum compression
        >>> df = read_csv("data.csv", aggressive=True)

        >>> # Use saved schema for consistent typing
        >>> df = read_csv("data.csv", schema_path="data.diet_schema.json")

        >>> # Large files are automatically chunked
        >>> df = read_csv("huge_file.csv")  # Automatically uses chunked reading
    """
    filepath = Path(filepath)

    # Check if we should use chunked reading
    use_chunked = False
    if auto_chunk:
        try:
            estimated_memory = _estimate_csv_memory_mb(filepath)
            available_memory = _get_available_memory_mb()

            if estimated_memory > (available_memory * memory_threshold):
                use_chunked = True
                if verbose:
                    print(
                        f"File size: ~{estimated_memory:.0f}MB, "
                        f"Available: {available_memory:.0f}MB - "
                        f"Using chunked reading"
                    )
        except Exception:
            # If we can't check, proceed with normal reading
            pass

    # Use chunked reading for large files
    if use_chunked:
        return _read_csv_chunked(
            filepath,
            chunksize=chunksize,
            optimize=optimize,
            aggressive=aggressive,
            categorical_threshold=categorical_threshold,
            verbose=verbose,
            schema_path=schema_path,
            save_schema=save_schema,
            **kwargs,
        )

    # Normal reading path
    filepath_str = str(filepath)

    # Try to use Polars for fast parsing
    if use_polars and POLARS_AVAILABLE:
        try:
            # Step 1: Fast Read with Polars
            # Polars is multi-threaded and much faster at parsing CSVs
            pl_df = pl.read_csv(filepath_str, **kwargs)

            # Step 2: Convert to Pandas
            pd_df = pl_df.to_pandas()

            if verbose:
                print("Loaded with Polars engine (fast mode)")

        except Exception as e:
            if verbose:
                print(f"Polars parsing failed ({e}), falling back to Pandas")
            # Fallback to standard Pandas
            pd_df = pd.read_csv(filepath_str, **kwargs)
    else:
        # Use standard Pandas
        if verbose and use_polars and not POLARS_AVAILABLE:
            print("Polars not installed, using standard Pandas reader")
        pd_df = pd.read_csv(filepath_str, **kwargs)

    # Apply schema if provided
    if schema_path:
        from .schema import apply_schema, load_schema

        if Path(schema_path).exists():
            if verbose:
                print(f"Applying schema from {schema_path}")
            schema = load_schema(schema_path)
            pd_df = apply_schema(pd_df, schema)
            return pd_df

    # Step 3: Apply the Diet immediately
    if optimize:
        result = diet(
            pd_df,
            verbose=verbose,
            aggressive=aggressive,
            categorical_threshold=categorical_threshold,
        )

        # Save schema if requested
        if save_schema and schema_path:
            from .schema import save_schema as save_schema_func

            save_schema_func(result, schema_path)
            if verbose:
                print(f"Saved schema to {schema_path}")

        return result

    return pd_df

Example:

import dietpandas as dp

# Basic usage
df = dp.read_csv("data.csv")

# Disable optimization
df = dp.read_csv("data.csv", optimize=False)

# Aggressive mode
df = dp.read_csv("data.csv", aggressive=True)

read_parquet()

Read a Parquet file with automatic memory optimization.

dietpandas.io.read_parquet(filepath, optimize=True, aggressive=False, categorical_threshold=0.5, verbose=False, use_polars=True, **kwargs)

Reads a Parquet file using Polars engine (if available), then converts to optimized Pandas.

Parameters:

Name Type Description Default
filepath Union[str, Path]

Path to Parquet file

required
optimize bool

If True, apply diet optimization after reading (default: True)

True
aggressive bool

If True, use aggressive optimization (float16, etc.)

False
categorical_threshold float

Threshold for converting objects to categories

0.5
verbose bool

If True, print memory reduction statistics

False
use_polars bool

If True and Polars is available, use it for parsing (default: True)

True
**kwargs

Additional arguments passed to the Parquet reader

{}

Returns:

Type Description
DataFrame

Optimized pandas DataFrame

Source code in src/dietpandas/io.py
def read_parquet(
    filepath: Union[str, Path],
    optimize: bool = True,
    aggressive: bool = False,
    categorical_threshold: float = 0.5,
    verbose: bool = False,
    use_polars: bool = True,
    **kwargs,
) -> pd.DataFrame:
    """
    Reads a Parquet file using Polars engine (if available), then converts to optimized Pandas.

    Args:
        filepath: Path to Parquet file
        optimize: If True, apply diet optimization after reading (default: True)
        aggressive: If True, use aggressive optimization (float16, etc.)
        categorical_threshold: Threshold for converting objects to categories
        verbose: If True, print memory reduction statistics
        use_polars: If True and Polars is available, use it for parsing (default: True)
        **kwargs: Additional arguments passed to the Parquet reader

    Returns:
        Optimized pandas DataFrame
    """
    filepath = str(filepath)

    # Try to use Polars for fast parsing
    if use_polars and POLARS_AVAILABLE:
        try:
            pl_df = pl.read_parquet(filepath, **kwargs)
            pd_df = pl_df.to_pandas()

            if verbose:
                print("Loaded with Polars engine (fast mode)")

        except Exception as e:
            if verbose:
                print(f"Polars parsing failed ({e}), falling back to Pandas")
            pd_df = pd.read_parquet(filepath, **kwargs)
    else:
        if verbose and use_polars and not POLARS_AVAILABLE:
            print("Polars not installed, using standard Pandas reader")
        pd_df = pd.read_parquet(filepath, **kwargs)

    if optimize:
        return diet(
            pd_df,
            verbose=verbose,
            aggressive=aggressive,
            categorical_threshold=categorical_threshold,
        )

    return pd_df

Example:

import dietpandas as dp

df = dp.read_parquet("data.parquet")

read_excel()

Read an Excel file with automatic memory optimization.

dietpandas.io.read_excel(filepath, optimize=True, aggressive=False, categorical_threshold=0.5, verbose=False, **kwargs)

Reads an Excel file and returns an optimized Pandas DataFrame.

Note: Polars support for Excel is limited, so this uses pandas.read_excel.

Parameters:

Name Type Description Default
filepath Union[str, Path]

Path to Excel file

required
optimize bool

If True, apply diet optimization after reading (default: True)

True
aggressive bool

If True, use aggressive optimization (float16, etc.)

False
categorical_threshold float

Threshold for converting objects to categories

0.5
verbose bool

If True, print memory reduction statistics

False
**kwargs

Additional arguments passed to pandas.read_excel

{}

Returns:

Type Description
DataFrame

Optimized pandas DataFrame

Source code in src/dietpandas/io.py
def read_excel(
    filepath: Union[str, Path],
    optimize: bool = True,
    aggressive: bool = False,
    categorical_threshold: float = 0.5,
    verbose: bool = False,
    **kwargs,
) -> pd.DataFrame:
    """
    Reads an Excel file and returns an optimized Pandas DataFrame.

    Note: Polars support for Excel is limited, so this uses pandas.read_excel.

    Args:
        filepath: Path to Excel file
        optimize: If True, apply diet optimization after reading (default: True)
        aggressive: If True, use aggressive optimization (float16, etc.)
        categorical_threshold: Threshold for converting objects to categories
        verbose: If True, print memory reduction statistics
        **kwargs: Additional arguments passed to pandas.read_excel

    Returns:
        Optimized pandas DataFrame
    """
    filepath = str(filepath)
    pd_df = pd.read_excel(filepath, **kwargs)

    if optimize:
        return diet(
            pd_df,
            verbose=verbose,
            aggressive=aggressive,
            categorical_threshold=categorical_threshold,
        )

    return pd_df

Example:

import dietpandas as dp

# Read specific sheet
df = dp.read_excel("data.xlsx", sheet_name="Sheet1")

# Read all sheets
dfs = dp.read_excel("data.xlsx", sheet_name=None)

read_json()

Read a JSON file with automatic memory optimization.

dietpandas.io.read_json(filepath, optimize=True, aggressive=False, categorical_threshold=0.5, verbose=False, **kwargs)

Reads a JSON file and returns an optimized Pandas DataFrame.

Parameters:

Name Type Description Default
filepath Union[str, Path]

Path to JSON file

required
optimize bool

If True, apply diet optimization after reading (default: True)

True
aggressive bool

If True, use aggressive optimization (float16, etc.)

False
categorical_threshold float

Threshold for converting objects to categories

0.5
verbose bool

If True, print memory reduction statistics

False
**kwargs

Additional arguments passed to pandas.read_json

{}

Returns:

Type Description
DataFrame

Optimized pandas DataFrame

Examples:

>>> df = read_json("data.json")
🥗 Diet Complete: Memory reduced by 45.2%
Source code in src/dietpandas/io.py
def read_json(
    filepath: Union[str, Path],
    optimize: bool = True,
    aggressive: bool = False,
    categorical_threshold: float = 0.5,
    verbose: bool = False,
    **kwargs,
) -> pd.DataFrame:
    """
    Reads a JSON file and returns an optimized Pandas DataFrame.

    Args:
        filepath: Path to JSON file
        optimize: If True, apply diet optimization after reading (default: True)
        aggressive: If True, use aggressive optimization (float16, etc.)
        categorical_threshold: Threshold for converting objects to categories
        verbose: If True, print memory reduction statistics
        **kwargs: Additional arguments passed to pandas.read_json

    Returns:
        Optimized pandas DataFrame

    Examples:
        >>> df = read_json("data.json")
        🥗 Diet Complete: Memory reduced by 45.2%
    """
    filepath = str(filepath)
    pd_df = pd.read_json(filepath, **kwargs)

    if optimize:
        return diet(
            pd_df,
            verbose=verbose,
            aggressive=aggressive,
            categorical_threshold=categorical_threshold,
        )

    return pd_df

Example:

import dietpandas as dp

# Read JSON lines format
df = dp.read_json("data.jsonl", lines=True)

# Read standard JSON
df = dp.read_json("data.json")

read_hdf()

Read an HDF5 file with automatic memory optimization.

dietpandas.io.read_hdf(filepath, key, optimize=True, aggressive=False, categorical_threshold=0.5, verbose=False, **kwargs)

Reads an HDF5 file and returns an optimized Pandas DataFrame.

Parameters:

Name Type Description Default
filepath Union[str, Path]

Path to HDF5 file

required
key str

Group identifier in the HDF5 file

required
optimize bool

If True, apply diet optimization after reading (default: True)

True
aggressive bool

If True, use aggressive optimization (float16, etc.)

False
categorical_threshold float

Threshold for converting objects to categories

0.5
verbose bool

If True, print memory reduction statistics

False
**kwargs

Additional arguments passed to pandas.read_hdf

{}

Returns:

Type Description
DataFrame

Optimized pandas DataFrame

Examples:

>>> df = read_hdf("data.h5", key="dataset1")
🥗 Diet Complete: Memory reduced by 52.1%
Source code in src/dietpandas/io.py
def read_hdf(
    filepath: Union[str, Path],
    key: str,
    optimize: bool = True,
    aggressive: bool = False,
    categorical_threshold: float = 0.5,
    verbose: bool = False,
    **kwargs,
) -> pd.DataFrame:
    """
    Reads an HDF5 file and returns an optimized Pandas DataFrame.

    Args:
        filepath: Path to HDF5 file
        key: Group identifier in the HDF5 file
        optimize: If True, apply diet optimization after reading (default: True)
        aggressive: If True, use aggressive optimization (float16, etc.)
        categorical_threshold: Threshold for converting objects to categories
        verbose: If True, print memory reduction statistics
        **kwargs: Additional arguments passed to pandas.read_hdf

    Returns:
        Optimized pandas DataFrame

    Examples:
        >>> df = read_hdf("data.h5", key="dataset1")
        🥗 Diet Complete: Memory reduced by 52.1%
    """
    filepath = str(filepath)
    pd_df = pd.read_hdf(filepath, key=key, **kwargs)

    if optimize:
        return diet(
            pd_df,
            verbose=verbose,
            aggressive=aggressive,
            categorical_threshold=categorical_threshold,
        )

    return pd_df

Example:

import dietpandas as dp

df = dp.read_hdf("data.h5", key="dataset1")

Note: Requires optional tables dependency:

pip install "diet-pandas[hdf]"


read_feather()

Read a Feather file with automatic memory optimization.

dietpandas.io.read_feather(filepath, optimize=True, aggressive=False, categorical_threshold=0.5, verbose=False, **kwargs)

Reads a Feather file and returns an optimized Pandas DataFrame.

Feather is a fast, lightweight columnar data format.

Parameters:

Name Type Description Default
filepath Union[str, Path]

Path to Feather file

required
optimize bool

If True, apply diet optimization after reading (default: True)

True
aggressive bool

If True, use aggressive optimization (float16, etc.)

False
categorical_threshold float

Threshold for converting objects to categories

0.5
verbose bool

If True, print memory reduction statistics

False
**kwargs

Additional arguments passed to pandas.read_feather

{}

Returns:

Type Description
DataFrame

Optimized pandas DataFrame

Examples:

>>> df = read_feather("data.feather")
🥗 Diet Complete: Memory reduced by 38.7%
Source code in src/dietpandas/io.py
def read_feather(
    filepath: Union[str, Path],
    optimize: bool = True,
    aggressive: bool = False,
    categorical_threshold: float = 0.5,
    verbose: bool = False,
    **kwargs,
) -> pd.DataFrame:
    """
    Reads a Feather file and returns an optimized Pandas DataFrame.

    Feather is a fast, lightweight columnar data format.

    Args:
        filepath: Path to Feather file
        optimize: If True, apply diet optimization after reading (default: True)
        aggressive: If True, use aggressive optimization (float16, etc.)
        categorical_threshold: Threshold for converting objects to categories
        verbose: If True, print memory reduction statistics
        **kwargs: Additional arguments passed to pandas.read_feather

    Returns:
        Optimized pandas DataFrame

    Examples:
        >>> df = read_feather("data.feather")
        🥗 Diet Complete: Memory reduced by 38.7%
    """
    filepath = str(filepath)
    pd_df = pd.read_feather(filepath, **kwargs)

    if optimize:
        return diet(
            pd_df,
            verbose=verbose,
            aggressive=aggressive,
            categorical_threshold=categorical_threshold,
        )

    return pd_df

Example:

import dietpandas as dp

df = dp.read_feather("data.feather")

Write Functions

to_csv_optimized()

Write a DataFrame to CSV with memory optimization.

dietpandas.io.to_csv_optimized(df, filepath, optimize_before_save=True, **kwargs)

Saves a DataFrame to CSV, optionally optimizing it first.

Parameters:

Name Type Description Default
df DataFrame

DataFrame to save

required
filepath Union[str, Path]

Path where CSV will be saved

required
optimize_before_save bool

If True, optimize the DataFrame before saving

True
**kwargs

Additional arguments passed to pandas.to_csv

{}
Source code in src/dietpandas/io.py
def to_csv_optimized(
    df: pd.DataFrame, filepath: Union[str, Path], optimize_before_save: bool = True, **kwargs
) -> None:
    """
    Saves a DataFrame to CSV, optionally optimizing it first.

    Args:
        df: DataFrame to save
        filepath: Path where CSV will be saved
        optimize_before_save: If True, optimize the DataFrame before saving
        **kwargs: Additional arguments passed to pandas.to_csv
    """
    if optimize_before_save:
        df = diet(df, verbose=False)

    df.to_csv(filepath, **kwargs)

Example:

import dietpandas as dp
import pandas as pd

df = pd.DataFrame({'col': range(1000)})
dp.to_csv_optimized(df, "output.csv")

to_parquet_optimized()

Write a DataFrame to Parquet with memory optimization.

dietpandas.io.to_parquet_optimized(df, filepath, optimize_before_save=True, **kwargs)

Saves a DataFrame to Parquet format, optionally optimizing it first.

Parameters:

Name Type Description Default
df DataFrame

DataFrame to save

required
filepath Union[str, Path]

Path where Parquet file will be saved

required
optimize_before_save bool

If True, optimize the DataFrame before saving

True
**kwargs

Additional arguments passed to pandas.to_parquet

{}
Source code in src/dietpandas/io.py
def to_parquet_optimized(
    df: pd.DataFrame, filepath: Union[str, Path], optimize_before_save: bool = True, **kwargs
) -> None:
    """
    Saves a DataFrame to Parquet format, optionally optimizing it first.

    Args:
        df: DataFrame to save
        filepath: Path where Parquet file will be saved
        optimize_before_save: If True, optimize the DataFrame before saving
        **kwargs: Additional arguments passed to pandas.to_parquet
    """
    if optimize_before_save:
        df = diet(df, verbose=False)

    df.to_parquet(filepath, **kwargs)

Example:

import dietpandas as dp
import pandas as pd

df = pd.DataFrame({'col': range(1000)})
dp.to_parquet_optimized(df, "output.parquet")

to_feather_optimized()

Write a DataFrame to Feather format with memory optimization.

dietpandas.io.to_feather_optimized(df, filepath, optimize_before_save=True, **kwargs)

Saves a DataFrame to Feather format, optionally optimizing it first.

Parameters:

Name Type Description Default
df DataFrame

DataFrame to save

required
filepath Union[str, Path]

Path where Feather file will be saved

required
optimize_before_save bool

If True, optimize the DataFrame before saving

True
**kwargs

Additional arguments passed to pandas.to_feather

{}
Source code in src/dietpandas/io.py
def to_feather_optimized(
    df: pd.DataFrame, filepath: Union[str, Path], optimize_before_save: bool = True, **kwargs
) -> None:
    """
    Saves a DataFrame to Feather format, optionally optimizing it first.

    Args:
        df: DataFrame to save
        filepath: Path where Feather file will be saved
        optimize_before_save: If True, optimize the DataFrame before saving
        **kwargs: Additional arguments passed to pandas.to_feather
    """
    if optimize_before_save:
        df = diet(df, verbose=False)

    df.to_feather(filepath, **kwargs)

Example:

import dietpandas as dp
import pandas as pd

df = pd.DataFrame({'col': range(1000)})
dp.to_feather_optimized(df, "output.feather")

Supported File Formats

Format Read Function Write Function Optional Dependency
CSV read_csv() to_csv_optimized() None (built-in)
Parquet read_parquet() to_parquet_optimized() pyarrow
Excel read_excel() N/A openpyxl
JSON read_json() N/A None (built-in)
HDF5 read_hdf() N/A tables
Feather read_feather() to_feather_optimized() pyarrow

Performance Comparison

CSV Reading Performance

import time
import pandas as pd
import dietpandas as dp

# Standard pandas
start = time.time()
df_pandas = pd.read_csv("large_file.csv")
pandas_time = time.time() - start

# Diet pandas
start = time.time()
df_diet = dp.read_csv("large_file.csv")
diet_time = time.time() - start

print(f"Pandas: {pandas_time:.2f}s, Memory: {df_pandas.memory_usage().sum() / 1e6:.1f} MB")
print(f"Diet:   {diet_time:.2f}s, Memory: {df_diet.memory_usage().sum() / 1e6:.1f} MB")
# Pandas: 45.2s, Memory: 2300.0 MB
# Diet:   8.7s, Memory: 750.0 MB

Common Parameters

Most read functions support these common parameters:

  • optimize (bool, default=True): Whether to optimize memory usage
  • aggressive (bool, default=False): Use aggressive optimization mode
  • **kwargs: Additional parameters passed to underlying pandas function

See Also