API Reference: Analysis Functions¶
This page documents the analysis and inspection functions in Diet Pandas.
Analysis Functions¶
analyze()¶
Analyze a DataFrame and return optimization recommendations without modifying it.
dietpandas.analysis.analyze(df, aggressive=False, categorical_threshold=0.5, sparse_threshold=0.9, optimize_datetimes=True, optimize_sparse_cols=False, optimize_bools=True)
¶
Analyze DataFrame and return optimization recommendations without modifying it.
This function performs a "dry run" of the optimization process, providing insights into potential memory savings and recommended data type changes without actually modifying the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame to analyze |
required |
aggressive
|
bool
|
If True, simulate aggressive optimization (float16) |
False
|
categorical_threshold
|
float
|
Threshold for converting objects to categories |
0.5
|
sparse_threshold
|
float
|
Threshold for converting to sparse format |
0.9
|
optimize_datetimes
|
bool
|
If True, include datetime optimization analysis |
True
|
optimize_sparse_cols
|
bool
|
If True, check for sparse optimization opportunities |
False
|
optimize_bools
|
bool
|
If True, check for boolean optimization opportunities |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with columns: |
DataFrame
|
|
DataFrame
|
|
DataFrame
|
|
DataFrame
|
|
DataFrame
|
|
DataFrame
|
|
DataFrame
|
|
DataFrame
|
|
Examples:
>>> df = pd.DataFrame({'age': [25, 30, 35], 'name': ['A', 'B', 'A']})
>>> analysis = analyze(df)
>>> print(analysis)
Source code in src/dietpandas/analysis.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 | |
Example:
import pandas as pd
import dietpandas as dp
df = pd.DataFrame({
'id': range(1000),
'amount': [1.1, 2.2, 3.3] * 333 + [1.1],
'category': ['A', 'B', 'C'] * 333 + ['A']
})
# Get detailed analysis
analysis_df = dp.analyze(df)
print(analysis_df)
# column current_dtype recommended_dtype current_memory_mb optimized_memory_mb savings_mb savings_percent reasoning
# 0 id int64 uint16 0.008 0.002 0.006 75.0 Integer range fits in uint16
# 1 amount float64 float32 0.008 0.004 0.004 50.0 Standard float optimization
# 2 category object category 0.057 0.001 0.056 98.2 Low cardinality (3 unique values)
get_optimization_summary()¶
Get summary statistics from an analysis DataFrame.
dietpandas.analysis.get_optimization_summary(df, **kwargs)
¶
Get a summary of optimization opportunities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame to analyze |
required |
**kwargs
|
Additional arguments passed to analyze() |
{}
|
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary with summary statistics: |
dict
|
|
dict
|
|
dict
|
|
dict
|
|
dict
|
|
Examples:
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
>>> summary = get_optimization_summary(df)
>>> print(f"Potential savings: {summary['total_savings_percent']:.1f}%")
Source code in src/dietpandas/analysis.py
Example:
import pandas as pd
import dietpandas as dp
df = pd.DataFrame({
'id': range(1000),
'value': [1.5, 2.5, 3.5] * 333 + [1.5]
})
analysis = dp.analyze(df)
summary = dp.get_optimization_summary(analysis)
print(summary)
# {
# 'total_columns': 2,
# 'optimizable_columns': 2,
# 'current_memory_mb': 0.016,
# 'optimized_memory_mb': 0.006,
# 'total_savings_mb': 0.010,
# 'total_savings_percent': 62.5
# }
estimate_memory_reduction()¶
Quickly estimate potential memory reduction percentage.
dietpandas.analysis.estimate_memory_reduction(df, **kwargs)
¶
Quick estimate of potential memory reduction percentage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame to analyze |
required |
**kwargs
|
Additional arguments passed to analyze() |
{}
|
Returns:
| Type | Description |
|---|---|
float
|
Estimated memory reduction as a percentage (0-100) |
Examples:
>>> df = pd.DataFrame({'year': [2020, 2021], 'val': [1.1, 2.2]})
>>> reduction = estimate_memory_reduction(df)
>>> print(f"Estimated reduction: {reduction:.1f}%")
Source code in src/dietpandas/analysis.py
Example:
import pandas as pd
import dietpandas as dp
df = pd.DataFrame({
'int_col': [1, 2, 3, 4, 5] * 200,
'float_col': [1.1, 2.2, 3.3, 4.4, 5.5] * 200,
'str_col': ['A', 'B', 'C', 'A', 'B'] * 200
})
# Quick estimate without detailed analysis
reduction = dp.estimate_memory_reduction(df)
print(f"Estimated reduction: {reduction:.1f}%")
# Estimated reduction: 78.3%
# Compare with full analysis
analysis = dp.analyze(df)
summary = dp.get_optimization_summary(analysis)
print(f"Actual reduction: {summary['total_savings_percent']:.1f}%")
Workflow Example¶
Analyze Before Optimizing¶
import pandas as pd
import dietpandas as dp
# Load your data
df = pd.read_csv("data.csv")
# 1. Quick estimate
print(f"Expected reduction: {dp.estimate_memory_reduction(df):.1f}%")
# 2. Detailed analysis
analysis = dp.analyze(df)
print(analysis)
# 3. Review summary
summary = dp.get_optimization_summary(analysis)
print(f"Total savings: {summary['total_savings_mb']:.2f} MB")
print(f"Reduction: {summary['total_savings_percent']:.1f}%")
# 4. Apply optimization
df_optimized = dp.diet(df)
Aggressive Mode Analysis¶
import pandas as pd
import dietpandas as dp
df = pd.DataFrame({
'metric': [1.123456789] * 1000
})
# Compare normal vs aggressive mode
normal_analysis = dp.analyze(df, aggressive=False)
aggressive_analysis = dp.analyze(df, aggressive=True)
print("Normal mode:")
print(normal_analysis)
# float64 -> float32 (50% reduction)
print("\nAggressive mode:")
print(aggressive_analysis)
# float64 -> float16 (75% reduction, but possible precision loss)
See Also¶
- Core Functions - Main optimization functions
- I/O Functions - File reading with automatic optimization
- Exceptions - Custom warnings and exceptions