Skip to content

API Reference: Core Functions

This page documents all core optimization functions in Diet Pandas.

Main Functions

diet()

Optimize a pandas DataFrame by downcasting data types to reduce memory usage.

NEW in v0.5.0: Supports parallel processing with parallel and max_workers parameters for 2-4x speedup.

dietpandas.core.diet(df, verbose=True, aggressive=False, categorical_threshold=0.5, sparse_threshold=0.9, optimize_datetimes=True, optimize_sparse_cols=False, optimize_bools=True, float_to_int=True, inplace=False, skip_columns=None, force_categorical=None, force_aggressive=None, warn_on_issues=False, parallel=True, max_workers=None)

Main function to optimize DataFrame memory usage.

This function iterates over all columns and applies appropriate optimizations: - Booleans: Convert integer/object columns with boolean values to bool dtype - Integers: Downcast to smallest safe type (int8, int16, uint8, etc.) - Floats: Convert to integers if they contain only whole numbers, else float32 (or float16 in aggressive mode) - Objects: Convert to category if cardinality is low - DateTime: Optimize datetime representations - Sparse: Convert to sparse arrays for columns with many repeated values

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame to optimize

required
verbose bool

If True, print memory reduction statistics

True
aggressive bool

If True, use more aggressive optimization (may lose precision)

False
categorical_threshold float

Threshold for converting objects to categories

0.5
sparse_threshold float

Threshold for converting to sparse format (default: 0.9)

0.9
optimize_datetimes bool

If True, optimize datetime columns (default: True)

True
optimize_sparse_cols bool

If True, check for sparse optimization opportunities (default: False)

False
optimize_bools bool

If True, convert boolean-like columns to bool dtype (default: True)

True
float_to_int bool

If True, convert float columns to integers when they contain only whole numbers (default: True)

True
inplace bool

If True, modify the DataFrame in place (default: False)

False
skip_columns list

List of column names to skip optimization (default: None)

None
force_categorical list

List of column names to force categorical conversion (default: None)

None
force_aggressive list

List of column names to force aggressive optimization (default: None)

None
warn_on_issues bool

If True, emit warnings for potential issues (default: False)

False
parallel bool

If True, use parallel processing for column optimization (default: True)

True
max_workers int

Maximum number of worker threads for parallel processing (default: None, uses number of CPU cores)

None

Returns:

Type Description
DataFrame

Optimized DataFrame with reduced memory usage

Examples:

>>> df = pd.DataFrame({'year': [2020, 2021, 2022], 'val': [1.1, 2.2, 3.3]})
>>> optimized = diet(df)
🥗 Diet Complete: Memory reduced by 62.5%
   0.00MB -> 0.00MB
>>> # Skip specific columns
>>> df = diet(df, skip_columns=['id', 'uuid'])
>>> # Force categorical conversion on high-cardinality column
>>> df = diet(df, force_categorical=['country_code'])
>>> # Use aggressive mode only for specific columns
>>> df = diet(df, force_aggressive=['approximation_field'])
>>> # Enable warnings for potential issues
>>> df = diet(df, warn_on_issues=True)
Source code in src/dietpandas/core.py
def diet(
    df: pd.DataFrame,
    verbose: bool = True,
    aggressive: bool = False,
    categorical_threshold: float = 0.5,
    sparse_threshold: float = 0.9,
    optimize_datetimes: bool = True,
    optimize_sparse_cols: bool = False,
    optimize_bools: bool = True,
    float_to_int: bool = True,
    inplace: bool = False,
    skip_columns: list = None,
    force_categorical: list = None,
    force_aggressive: list = None,
    warn_on_issues: bool = False,
    parallel: bool = True,
    max_workers: int = None,
) -> pd.DataFrame:
    """
    Main function to optimize DataFrame memory usage.

    This function iterates over all columns and applies appropriate optimizations:
    - Booleans: Convert integer/object columns with boolean values to bool dtype
    - Integers: Downcast to smallest safe type (int8, int16, uint8, etc.)
    - Floats: Convert to integers if they contain only whole numbers, else
      float32 (or float16 in aggressive mode)
    - Objects: Convert to category if cardinality is low
    - DateTime: Optimize datetime representations
    - Sparse: Convert to sparse arrays for columns with many repeated values

    Args:
        df: Input DataFrame to optimize
        verbose: If True, print memory reduction statistics
        aggressive: If True, use more aggressive optimization (may lose precision)
        categorical_threshold: Threshold for converting objects to categories
        sparse_threshold: Threshold for converting to sparse format (default: 0.9)
        optimize_datetimes: If True, optimize datetime columns (default: True)
        optimize_sparse_cols: If True, check for sparse optimization
            opportunities (default: False)
        optimize_bools: If True, convert boolean-like columns to bool dtype
            (default: True)
        float_to_int: If True, convert float columns to integers when they
            contain only whole numbers (default: True)
        inplace: If True, modify the DataFrame in place (default: False)
        skip_columns: List of column names to skip optimization (default: None)
        force_categorical: List of column names to force categorical conversion (default: None)
        force_aggressive: List of column names to force aggressive optimization (default: None)
        warn_on_issues: If True, emit warnings for potential issues (default: False)
        parallel: If True, use parallel processing for column optimization (default: True)
        max_workers: Maximum number of worker threads for parallel processing
            (default: None, uses number of CPU cores)

    Returns:
        Optimized DataFrame with reduced memory usage

    Examples:
        >>> df = pd.DataFrame({'year': [2020, 2021, 2022], 'val': [1.1, 2.2, 3.3]})
        >>> optimized = diet(df)
        🥗 Diet Complete: Memory reduced by 62.5%
           0.00MB -> 0.00MB

        >>> # Skip specific columns
        >>> df = diet(df, skip_columns=['id', 'uuid'])

        >>> # Force categorical conversion on high-cardinality column
        >>> df = diet(df, force_categorical=['country_code'])

        >>> # Use aggressive mode only for specific columns
        >>> df = diet(df, force_aggressive=['approximation_field'])

        >>> # Enable warnings for potential issues
        >>> df = diet(df, warn_on_issues=True)
    """
    if not inplace:
        df = df.copy()

    # Initialize lists if None
    skip_columns = skip_columns or []
    force_categorical = force_categorical or []
    force_aggressive = force_aggressive or []

    start_mem = df.memory_usage(deep=True).sum()

    # Parallel processing for column optimization
    if parallel and len(df.columns) > 1:
        # Create partial function with fixed parameters
        optimize_func = partial(
            _optimize_single_column,
            aggressive=aggressive,
            categorical_threshold=categorical_threshold,
            sparse_threshold=sparse_threshold,
            optimize_datetimes=optimize_datetimes,
            optimize_sparse_cols=optimize_sparse_cols,
            optimize_bools=optimize_bools,
            float_to_int=float_to_int,
            skip_columns=skip_columns,
            force_categorical=force_categorical,
            force_aggressive=force_aggressive,
            warn_on_issues=warn_on_issues,
        )

        # Process columns in parallel
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            # Submit all column optimization tasks
            future_to_col = {
                executor.submit(optimize_func, col, df[col].copy()): col for col in df.columns
            }

            # Collect results as they complete
            for future in as_completed(future_to_col):
                col, optimized_series, warning = future.result()
                df[col] = optimized_series

                # Emit warnings if any
                if warning:
                    warnings.warn(warning[0], warning[1], stacklevel=2)
    else:
        # Sequential processing (fallback or for single column)
        for col in df.columns:
            col, optimized_series, warning = _optimize_single_column(
                col,
                df[col],
                aggressive=aggressive,
                categorical_threshold=categorical_threshold,
                sparse_threshold=sparse_threshold,
                optimize_datetimes=optimize_datetimes,
                optimize_sparse_cols=optimize_sparse_cols,
                optimize_bools=optimize_bools,
                float_to_int=float_to_int,
                skip_columns=skip_columns,
                force_categorical=force_categorical,
                force_aggressive=force_aggressive,
                warn_on_issues=warn_on_issues,
            )
            df[col] = optimized_series

            # Emit warnings if any
            if warning:
                warnings.warn(warning[0], warning[1], stacklevel=2)

    end_mem = df.memory_usage(deep=True).sum()

    if verbose:
        reduction = 100 * (start_mem - end_mem) / start_mem if start_mem > 0 else 0
        print(f"🥗 Diet Complete: Memory reduced by {reduction:.1f}%")
        print(f"   {start_mem/1e6:.2f}MB -> {end_mem/1e6:.2f}MB")

    return df

Example:

import dietpandas as dp
import pandas as pd

df = pd.DataFrame({'col': [1, 2, 3]})

# Standard optimization
df_optimized = dp.diet(df)

# Parallel processing (default, 2-4x faster)
df_optimized = dp.diet(df, parallel=True)

# Control number of threads
df_optimized = dp.diet(df, parallel=True, max_workers=4)

# Sequential processing
df_optimized = dp.diet(df, parallel=False)

optimize_int()

Optimize integer columns to the smallest safe integer type.

dietpandas.core.optimize_int(series)

Downcasts integer series to the smallest possible safe type.

Parameters:

Name Type Description Default
series Series

A pandas Series with integer dtype

required

Returns:

Type Description
Series

Optimized Series with smallest safe integer type

Examples:

>>> s = pd.Series([1, 2, 3], dtype='int64')
>>> optimized = optimize_int(s)
>>> optimized.dtype
dtype('uint8')
Source code in src/dietpandas/core.py
def optimize_int(series: pd.Series) -> pd.Series:
    """
    Downcasts integer series to the smallest possible safe type.

    Args:
        series: A pandas Series with integer dtype

    Returns:
        Optimized Series with smallest safe integer type

    Examples:
        >>> s = pd.Series([1, 2, 3], dtype='int64')
        >>> optimized = optimize_int(s)
        >>> optimized.dtype
        dtype('uint8')
    """
    # Early exit: already optimal
    if series.dtype in [np.uint8, np.int8]:
        return series

    c_min, c_max = series.min(), series.max()

    # Check if unsigned is possible (positive numbers only)
    if c_min >= 0:
        if c_max <= np.iinfo(np.uint8).max:
            return series.astype(np.uint8)
        if c_max <= np.iinfo(np.uint16).max:
            return series.astype(np.uint16)
        if c_max <= np.iinfo(np.uint32).max:
            return series.astype(np.uint32)
        return series.astype(np.uint64)
    # Otherwise use signed integers
    else:
        if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
            return series.astype(np.int8)
        if c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
            return series.astype(np.int16)
        if c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
            return series.astype(np.int32)

    return series

Example:

import pandas as pd
from dietpandas import optimize_int

s = pd.Series([1, 2, 3, 4, 5])  # int64
s_optimized = optimize_int(s)    # uint8

optimize_float()

Optimize float columns to smaller precision when safe.

dietpandas.core.optimize_float(series, aggressive=False, float_to_int=True)

Downcasts float series to float32, float16, or integer types when possible.

First checks if the float values are actually integers (no decimal part). If so, converts to the appropriate integer type. Otherwise, downcasts to float32 or float16 (if aggressive mode).

Parameters:

Name Type Description Default
series Series

A pandas Series with float dtype

required
aggressive bool

If True, use float16 for maximum compression (may lose precision)

False
float_to_int bool

If True, convert floats to integers when they have no decimal part

True

Returns:

Type Description
Series

Optimized Series with smaller float type or integer type

Examples:

>>> s = pd.Series([1.0, 2.0, 3.0, 4.0])
>>> optimized = optimize_float(s, float_to_int=True)
>>> optimized.dtype
dtype('int8')
>>> s = pd.Series([1.5, 2.5, 3.5])
>>> optimized = optimize_float(s, float_to_int=True)
>>> optimized.dtype
dtype('float32')
Source code in src/dietpandas/core.py
def optimize_float(
    series: pd.Series, aggressive: bool = False, float_to_int: bool = True
) -> pd.Series:
    """
    Downcasts float series to float32, float16, or integer types when possible.

    First checks if the float values are actually integers (no decimal part).
    If so, converts to the appropriate integer type. Otherwise, downcasts to
    float32 or float16 (if aggressive mode).

    Args:
        series: A pandas Series with float dtype
        aggressive: If True, use float16 for maximum compression (may lose precision)
        float_to_int: If True, convert floats to integers when they have no decimal part

    Returns:
        Optimized Series with smaller float type or integer type

    Examples:
        >>> s = pd.Series([1.0, 2.0, 3.0, 4.0])
        >>> optimized = optimize_float(s, float_to_int=True)
        >>> optimized.dtype
        dtype('int8')

        >>> s = pd.Series([1.5, 2.5, 3.5])
        >>> optimized = optimize_float(s, float_to_int=True)
        >>> optimized.dtype
        dtype('float32')
    """
    # Check if all values are integers (no decimal part)
    if float_to_int:
        # Check if series contains only integer values (excluding NaN)
        is_integer = series.dropna().apply(lambda x: x == int(x)).all()

        if is_integer:
            has_na = series.isna().any()

            if has_na:
                # Use nullable integer type to preserve NaN
                # First convert to Int64, then optimize to smaller nullable int
                int_series = series.astype("Int64")

                # Manually optimize nullable integers
                c_min, c_max = int_series.min(), int_series.max()

                if c_min >= 0:
                    if c_max <= np.iinfo(np.uint8).max:
                        return int_series.astype("UInt8")
                    if c_max <= np.iinfo(np.uint16).max:
                        return int_series.astype("UInt16")
                    if c_max <= np.iinfo(np.uint32).max:
                        return int_series.astype("UInt32")
                    return int_series.astype("UInt64")
                else:
                    if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
                        return int_series.astype("Int8")
                    if c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
                        return int_series.astype("Int16")
                    if c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
                        return int_series.astype("Int32")
                    return int_series.astype("Int64")
            else:
                # No NaN, use regular integer type
                int_series = series.astype(np.int64)
                # Now optimize the integer series
                return optimize_int(int_series)

    # If not convertible to int, optimize as float
    if aggressive:
        return series if series.dtype == np.float16 else series.astype(np.float16)
    else:
        return series if series.dtype == np.float32 else series.astype(np.float32)

Example:

import pandas as pd
from dietpandas import optimize_float

s = pd.Series([1.1, 2.2, 3.3])  # float64
s_optimized = optimize_float(s)  # float32

optimize_obj()

Optimize object columns by converting low-cardinality strings to category type.

dietpandas.core.optimize_obj(series, categorical_threshold=0.5)

Converts object columns to categories if unique ratio is low.

Parameters:

Name Type Description Default
series Series

A pandas Series with object dtype

required
categorical_threshold float

If unique_ratio < threshold, convert to category

0.5

Returns:

Type Description
Series

Optimized Series (categorical if beneficial, otherwise unchanged)

Examples:

>>> s = pd.Series(['A', 'B', 'A', 'B', 'A', 'B'])
>>> optimized = optimize_obj(s)
>>> optimized.dtype.name
'category'
Source code in src/dietpandas/core.py
def optimize_obj(series: pd.Series, categorical_threshold: float = 0.5) -> pd.Series:
    """
    Converts object columns to categories if unique ratio is low.

    Args:
        series: A pandas Series with object dtype
        categorical_threshold: If unique_ratio < threshold, convert to category

    Returns:
        Optimized Series (categorical if beneficial, otherwise unchanged)

    Examples:
        >>> s = pd.Series(['A', 'B', 'A', 'B', 'A', 'B'])
        >>> optimized = optimize_obj(s)
        >>> optimized.dtype.name
        'category'
    """
    num_unique = series.nunique()
    num_total = len(series)

    # Avoid division by zero
    if num_total == 0:
        return series

    unique_ratio = num_unique / num_total

    if unique_ratio < categorical_threshold:
        return series.astype("category")

    return series

Example:

import pandas as pd
from dietpandas import optimize_obj

s = pd.Series(['A', 'B', 'A', 'B', 'C'] * 100)  # object
s_optimized = optimize_obj(s)                     # category

optimize_datetime()

Optimize datetime columns for better memory efficiency.

dietpandas.core.optimize_datetime(series)

Optimizes datetime columns by converting to more efficient datetime64 types.

For datetime columns, attempts to use more memory-efficient representations: - If all datetimes are dates (no time component), suggests conversion - Removes unnecessary precision (e.g., nanosecond to microsecond)

Parameters:

Name Type Description Default
series Series

A pandas Series with datetime64 dtype

required

Returns:

Type Description
Series

Optimized Series with more efficient datetime representation

Examples:

>>> dates = pd.Series(pd.date_range('2020-01-01', periods=100))
>>> optimized = optimize_datetime(dates)
Source code in src/dietpandas/core.py
def optimize_datetime(series: pd.Series) -> pd.Series:
    """
    Optimizes datetime columns by converting to more efficient datetime64 types.

    For datetime columns, attempts to use more memory-efficient
    representations:
    - If all datetimes are dates (no time component), suggests conversion
    - Removes unnecessary precision (e.g., nanosecond to microsecond)

    Args:
        series: A pandas Series with datetime64 dtype

    Returns:
        Optimized Series with more efficient datetime representation

    Examples:
        >>> dates = pd.Series(pd.date_range('2020-01-01', periods=100))
        >>> optimized = optimize_datetime(dates)
    """
    # If the series is already datetime64[ns], check if we can downcast
    if pd.api.types.is_datetime64_any_dtype(series):
        # Remove timezone info for memory efficiency if present
        if series.dt.tz is not None:
            # Keep timezone but note that tz-naive uses less memory
            pass

        # Pandas datetime64[ns] is already quite efficient
        # The main optimization is ensuring it's in the right format
        return series

    # Try to convert to datetime if it's an object
    if series.dtype == "object":
        try:
            return pd.to_datetime(series, errors="coerce")
        except Exception:
            return series

    return series

Example:

import pandas as pd
from dietpandas import optimize_datetime

# Object column with datetime strings
s = pd.Series(['2020-01-01', '2020-02-01', '2020-03-01'])
s_optimized = optimize_datetime(s)  # datetime64[ns]

optimize_sparse()

Convert columns with many repeated values to sparse format.

dietpandas.core.optimize_sparse(series, sparse_threshold=0.9)

Converts series to sparse format if it has many repeated values (especially zeros/NaNs).

Sparse arrays are highly memory-efficient when a series contains mostly one value. Common for binary features, indicator variables, or data with many missing values.

Parameters:

Name Type Description Default
series Series

A pandas Series

required
sparse_threshold float

If most common value appears >= threshold% of time, use sparse

0.9

Returns:

Type Description
Series

Optimized Series (sparse if beneficial, otherwise unchanged)

Examples:

>>> s = pd.Series([0, 0, 1, 0, 0, 0, 2, 0, 0, 0])
>>> optimized = optimize_sparse(s)
>>> isinstance(optimized.dtype, pd.SparseDtype)
True
Source code in src/dietpandas/core.py
def optimize_sparse(series: pd.Series, sparse_threshold: float = 0.9) -> pd.Series:
    """
    Converts series to sparse format if it has many repeated values (especially zeros/NaNs).

    Sparse arrays are highly memory-efficient when a series contains mostly one value.
    Common for binary features, indicator variables, or data with many missing values.

    Args:
        series: A pandas Series
        sparse_threshold: If most common value appears >= threshold% of time, use sparse

    Returns:
        Optimized Series (sparse if beneficial, otherwise unchanged)

    Examples:
        >>> s = pd.Series([0, 0, 1, 0, 0, 0, 2, 0, 0, 0])
        >>> optimized = optimize_sparse(s)
        >>> isinstance(optimized.dtype, pd.SparseDtype)
        True
    """
    if len(series) == 0:
        return series

    # Check if already sparse
    if isinstance(series.dtype, pd.SparseDtype):
        return series

    # Calculate the most common value's frequency
    value_counts = series.value_counts(dropna=False)
    if len(value_counts) == 0:
        return series

    most_common_freq = value_counts.iloc[0] / len(series)

    # If one value dominates, convert to sparse
    if most_common_freq >= sparse_threshold:
        try:
            fill_value = value_counts.index[0]
            return series.astype(pd.SparseDtype(series.dtype, fill_value=fill_value))
        except Exception:
            # If conversion fails, return original
            return series

    return series

Example:

import pandas as pd
from dietpandas import optimize_sparse

# Column with 95% zeros
s = pd.Series([0] * 950 + [1] * 50)
s_optimized = optimize_sparse(s)  # Sparse[int8, 0]

get_memory_report()

Generate a detailed memory usage report for a DataFrame.

dietpandas.core.get_memory_report(df)

Generate a detailed memory usage report for each column.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame

required

Returns:

Type Description
DataFrame

DataFrame with memory statistics per column

Examples:

>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
>>> report = get_memory_report(df)
>>> print(report)
Source code in src/dietpandas/core.py
def get_memory_report(df: pd.DataFrame) -> pd.DataFrame:
    """
    Generate a detailed memory usage report for each column.

    Args:
        df: Input DataFrame

    Returns:
        DataFrame with memory statistics per column

    Examples:
        >>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
        >>> report = get_memory_report(df)
        >>> print(report)
    """
    mem_usage = df.memory_usage(deep=True)

    report = pd.DataFrame(
        {
            "column": mem_usage.index,
            "dtype": [df[col].dtype if col != "Index" else "Index" for col in mem_usage.index],
            "memory_bytes": mem_usage.values,
            "memory_mb": mem_usage.values / 1e6,
        }
    )

    report["percent_of_total"] = 100 * report["memory_bytes"] / report["memory_bytes"].sum()
    report = report.sort_values("memory_bytes", ascending=False).reset_index(drop=True)

    return report

Example:

import dietpandas as dp
import pandas as pd

df = pd.DataFrame({
    'a': range(1000),
    'b': ['text'] * 1000
})

report = dp.get_memory_report(df)
print(report)
#   column    dtype  memory_bytes  memory_mb  percent_of_total
# 0      b   object         59000      0.059              88.1
# 1      a    int64          8000      0.008              11.9

Type Optimization Rules

Integer Optimization

Value Range Optimized Type Bytes Saved per Value
0 to 255 uint8 7 bytes (from int64)
0 to 65,535 uint16 6 bytes
-128 to 127 int8 7 bytes
-32,768 to 32,767 int16 6 bytes

Float Optimization

Mode Conversion Precision Use Case
Safe float64 → float32 ~7 decimal digits Most ML tasks
Aggressive float64 → float16 ~3 decimal digits Extreme compression

Object Optimization

Condition Optimization Memory Savings
Unique ratio < 50% object → category 50-90% typical
All datetime strings object → datetime64 50-70% typical
High cardinality No change Keep as object

Sparse Optimization

Condition Optimization Memory Savings
>90% repeated values Dense → Sparse 90-99% typical
Binary features (0/1) Dense → Sparse[int8] ~96% typical
Low sparsity No change Keep as dense

See Also