Basic Usage¶
Learn how to use Diet Pandas effectively for everyday data science tasks.
The diet() Function¶
The core function of Diet Pandas is diet(), which optimizes a DataFrame's memory usage.
Default Behavior (Safe Mode)¶
By default, diet() performs lossless optimization:
import pandas as pd
import dietpandas as dp
df = pd.DataFrame({
'id': range(1000), # int64
'age': [25, 30, 35] * 333 + [25], # int64
'score': [95.5, 87.3, 92.1] * 333 + [95.5], # float64
'city': ['NYC', 'LA', 'SF'] * 333 + ['NYC'] # object
})
print("Before:")
print(df.memory_usage(deep=True))
# Index 132 bytes
# id 8000 bytes
# age 8000 bytes
# score 8000 bytes
# city 60000 bytes
df_optimized = dp.diet(df)
print("\nAfter:")
print(df_optimized.memory_usage(deep=True))
# Index 132 bytes
# id 2000 bytes (uint16)
# age 1000 bytes (uint8)
# score 4000 bytes (float32)
# city 1200 bytes (category)
Understanding the Output¶
When optimization completes, you'll see:
In-Place Optimization¶
To modify the DataFrame directly without creating a copy:
This is useful when working with large DataFrames where you don't want to duplicate memory.
Silent Mode¶
To suppress output messages:
Optimization Modes¶
Safe Mode (Default)¶
Preserves precision for most use cases:
Aggressive Mode (Keto Diet)¶
Maximum compression with some precision loss:
df = dp.diet(df, aggressive=True)
# float64 -> float16 (3 decimal digits precision)
# Use for: visualization, approximate calculations
# Avoid for: financial calculations, scientific computing
Smart Float-to-Integer Conversion¶
By default, diet() automatically detects when float columns contain only whole numbers (like 1.0, 2.0, 3.0) and converts them to integer types for better memory efficiency.
When It Helps¶
This is particularly useful for: - IDs and counts loaded as floats - Year columns (2020.0, 2021.0, etc.) - Categorical codes stored as floats - Survey responses (1.0, 2.0, 3.0, 4.0, 5.0)
import pandas as pd
import dietpandas as dp
# Common scenario: CSV with mixed types
df = pd.DataFrame({
'user_id': [1.0, 2.0, 3.0, 4.0], # float64
'year': [2020.0, 2021.0, 2022.0, 2023.0], # float64
'rating': [4.5, 3.8, 4.2, 3.9], # float64 (has decimals)
})
df_optimized = dp.diet(df) # float_to_int=True by default
print(df_optimized.dtypes)
# user_id uint8 ✓ Converted to integer
# year uint16 ✓ Converted to integer
# rating float32 ✓ Stays float (has decimals)
Handling NaN Values¶
The conversion preserves NaN values using nullable integer types:
df = pd.DataFrame({
'ratings': [5.0, 4.0, np.nan, 3.0, 5.0] # float64 with NaN
})
df_optimized = dp.diet(df)
print(df_optimized.dtypes)
# ratings UInt8 ✓ Nullable integer preserves NaN
Disabling Float-to-Int Conversion¶
If you want floats to remain as floats:
df = dp.diet(df, float_to_int=False)
# All floats stay as float32/float16 (depending on aggressive mode)
Customizing Optimization¶
Categorical Threshold¶
Control when strings are converted to categories:
# Convert to category if <30% unique values (stricter)
df = dp.diet(df, categorical_threshold=0.3)
# Convert to category if <70% unique values (more aggressive)
df = dp.diet(df, categorical_threshold=0.7)
Rule of thumb: - Low threshold (0.3): Only very repetitive data becomes categorical - High threshold (0.7): More columns become categorical - Default (0.5): Balanced approach
DateTime Optimization¶
Enable datetime string detection and conversion:
This automatically detects object columns with datetime strings and converts them to datetime64.
Sparse Data Optimization¶
For data with many repeated values:
This converts columns where >90% of values are the same to sparse format.
Best for: - Binary features (0/1) - Indicator variables - Data with many zeros or NaNs - One-hot encoded features
Common Patterns¶
Pattern 1: Load and Optimize¶
import pandas as pd
import dietpandas as dp
# Load with standard pandas
df = pd.read_csv("data.csv")
# Clean and transform
df = df.dropna()
df['new_col'] = df['col1'] + df['col2']
# Optimize before analysis
df = dp.diet(df)
# Now analyze with less memory
print(df.describe())
Pattern 2: Optimize in Pipeline¶
import dietpandas as dp
def load_and_clean(filepath):
df = dp.read_csv(filepath) # Already optimized
df = df.dropna()
df = df[df['age'] > 18]
return df
df = load_and_clean("users.csv")
Pattern 3: Selective Optimization¶
import dietpandas as dp
# Don't optimize high-cardinality ID columns
df_ids = df[['user_id', 'transaction_id']]
df_data = df.drop(['user_id', 'transaction_id'], axis=1)
# Optimize only the data columns
df_data = dp.diet(df_data)
# Recombine
df = pd.concat([df_ids, df_data], axis=1)
Pattern 4: Iterative Optimization¶
import dietpandas as dp
# Load data
df = pd.read_csv("large_file.csv")
# Process in chunks
for i in range(0, len(df), 10000):
chunk = df.iloc[i:i+10000]
chunk = dp.diet(chunk)
# Process chunk
process(chunk)
Memory Reports¶
Basic Report¶
Output:
column dtype memory_bytes memory_mb percent_of_total
0 description object 450000000 450.00 67.3
1 user_id int64 32000000 32.00 4.8
2 timestamp datetime64 32000000 32.00 4.8
Comparing Before/After¶
import dietpandas as dp
# Before optimization
report_before = dp.get_memory_report(df)
print("Before:")
print(report_before)
# Optimize
df = dp.diet(df)
# After optimization
report_after = dp.get_memory_report(df)
print("\nAfter:")
print(report_after)
Data Preservation¶
Diet Pandas preserves your data:
import pandas as pd
import dietpandas as dp
# Original data
df = pd.DataFrame({'values': [1.1, 2.2, 3.3]})
# Optimize
df_opt = dp.diet(df)
# Data is preserved (within float32 precision)
assert df['values'].sum() == df_opt['values'].sum()
assert df['values'].mean() == df_opt['values'].mean()
When NOT to Use Optimization¶
Avoid optimization for:
- ID Columns: High-cardinality strings or large integers
- Precise Calculations: Financial data requiring exact decimal precision
- Small DataFrames: Optimization overhead not worth it (<1MB)
- Streaming Data: Optimize in batches instead
Next Steps¶
- Learn about File I/O
- Explore Advanced Optimization
- Check Memory Reports