API Reference: Core Functions¶
This page documents all core optimization functions in Diet Pandas.
Main Functions¶
diet()¶
Optimize a pandas DataFrame by downcasting data types to reduce memory usage.
NEW in v0.5.0: Supports parallel processing with parallel and max_workers parameters for 2-4x speedup.
dietpandas.core.diet(df, verbose=True, aggressive=False, categorical_threshold=0.5, sparse_threshold=0.9, optimize_datetimes=True, optimize_sparse_cols=False, optimize_bools=True, float_to_int=True, inplace=False, skip_columns=None, force_categorical=None, force_aggressive=None, warn_on_issues=False, parallel=True, max_workers=None)
¶
Main function to optimize DataFrame memory usage.
This function iterates over all columns and applies appropriate optimizations: - Booleans: Convert integer/object columns with boolean values to bool dtype - Integers: Downcast to smallest safe type (int8, int16, uint8, etc.) - Floats: Convert to integers if they contain only whole numbers, else float32 (or float16 in aggressive mode) - Objects: Convert to category if cardinality is low - DateTime: Optimize datetime representations - Sparse: Convert to sparse arrays for columns with many repeated values
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame to optimize |
required |
verbose
|
bool
|
If True, print memory reduction statistics |
True
|
aggressive
|
bool
|
If True, use more aggressive optimization (may lose precision) |
False
|
categorical_threshold
|
float
|
Threshold for converting objects to categories |
0.5
|
sparse_threshold
|
float
|
Threshold for converting to sparse format (default: 0.9) |
0.9
|
optimize_datetimes
|
bool
|
If True, optimize datetime columns (default: True) |
True
|
optimize_sparse_cols
|
bool
|
If True, check for sparse optimization opportunities (default: False) |
False
|
optimize_bools
|
bool
|
If True, convert boolean-like columns to bool dtype (default: True) |
True
|
float_to_int
|
bool
|
If True, convert float columns to integers when they contain only whole numbers (default: True) |
True
|
inplace
|
bool
|
If True, modify the DataFrame in place (default: False) |
False
|
skip_columns
|
list
|
List of column names to skip optimization (default: None) |
None
|
force_categorical
|
list
|
List of column names to force categorical conversion (default: None) |
None
|
force_aggressive
|
list
|
List of column names to force aggressive optimization (default: None) |
None
|
warn_on_issues
|
bool
|
If True, emit warnings for potential issues (default: False) |
False
|
parallel
|
bool
|
If True, use parallel processing for column optimization (default: True) |
True
|
max_workers
|
int
|
Maximum number of worker threads for parallel processing (default: None, uses number of CPU cores) |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Optimized DataFrame with reduced memory usage |
Examples:
>>> df = pd.DataFrame({'year': [2020, 2021, 2022], 'val': [1.1, 2.2, 3.3]})
>>> optimized = diet(df)
🥗 Diet Complete: Memory reduced by 62.5%
0.00MB -> 0.00MB
>>> # Force categorical conversion on high-cardinality column
>>> df = diet(df, force_categorical=['country_code'])
>>> # Use aggressive mode only for specific columns
>>> df = diet(df, force_aggressive=['approximation_field'])
Source code in src/dietpandas/core.py
435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 | |
Example:
import dietpandas as dp
import pandas as pd
df = pd.DataFrame({'col': [1, 2, 3]})
# Standard optimization
df_optimized = dp.diet(df)
# Parallel processing (default, 2-4x faster)
df_optimized = dp.diet(df, parallel=True)
# Control number of threads
df_optimized = dp.diet(df, parallel=True, max_workers=4)
# Sequential processing
df_optimized = dp.diet(df, parallel=False)
optimize_int()¶
Optimize integer columns to the smallest safe integer type.
dietpandas.core.optimize_int(series)
¶
Downcasts integer series to the smallest possible safe type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
series
|
Series
|
A pandas Series with integer dtype |
required |
Returns:
| Type | Description |
|---|---|
Series
|
Optimized Series with smallest safe integer type |
Examples:
>>> s = pd.Series([1, 2, 3], dtype='int64')
>>> optimized = optimize_int(s)
>>> optimized.dtype
dtype('uint8')
Source code in src/dietpandas/core.py
Example:
import pandas as pd
from dietpandas import optimize_int
s = pd.Series([1, 2, 3, 4, 5]) # int64
s_optimized = optimize_int(s) # uint8
optimize_float()¶
Optimize float columns to smaller precision when safe.
dietpandas.core.optimize_float(series, aggressive=False, float_to_int=True)
¶
Downcasts float series to float32, float16, or integer types when possible.
First checks if the float values are actually integers (no decimal part). If so, converts to the appropriate integer type. Otherwise, downcasts to float32 or float16 (if aggressive mode).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
series
|
Series
|
A pandas Series with float dtype |
required |
aggressive
|
bool
|
If True, use float16 for maximum compression (may lose precision) |
False
|
float_to_int
|
bool
|
If True, convert floats to integers when they have no decimal part |
True
|
Returns:
| Type | Description |
|---|---|
Series
|
Optimized Series with smaller float type or integer type |
Examples:
>>> s = pd.Series([1.0, 2.0, 3.0, 4.0])
>>> optimized = optimize_float(s, float_to_int=True)
>>> optimized.dtype
dtype('int8')
>>> s = pd.Series([1.5, 2.5, 3.5])
>>> optimized = optimize_float(s, float_to_int=True)
>>> optimized.dtype
dtype('float32')
Source code in src/dietpandas/core.py
Example:
import pandas as pd
from dietpandas import optimize_float
s = pd.Series([1.1, 2.2, 3.3]) # float64
s_optimized = optimize_float(s) # float32
optimize_obj()¶
Optimize object columns by converting low-cardinality strings to category type.
dietpandas.core.optimize_obj(series, categorical_threshold=0.5)
¶
Converts object columns to categories if unique ratio is low.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
series
|
Series
|
A pandas Series with object dtype |
required |
categorical_threshold
|
float
|
If unique_ratio < threshold, convert to category |
0.5
|
Returns:
| Type | Description |
|---|---|
Series
|
Optimized Series (categorical if beneficial, otherwise unchanged) |
Examples:
>>> s = pd.Series(['A', 'B', 'A', 'B', 'A', 'B'])
>>> optimized = optimize_obj(s)
>>> optimized.dtype.name
'category'
Source code in src/dietpandas/core.py
Example:
import pandas as pd
from dietpandas import optimize_obj
s = pd.Series(['A', 'B', 'A', 'B', 'C'] * 100) # object
s_optimized = optimize_obj(s) # category
optimize_datetime()¶
Optimize datetime columns for better memory efficiency.
dietpandas.core.optimize_datetime(series)
¶
Optimizes datetime columns by converting to more efficient datetime64 types.
For datetime columns, attempts to use more memory-efficient representations: - If all datetimes are dates (no time component), suggests conversion - Removes unnecessary precision (e.g., nanosecond to microsecond)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
series
|
Series
|
A pandas Series with datetime64 dtype |
required |
Returns:
| Type | Description |
|---|---|
Series
|
Optimized Series with more efficient datetime representation |
Examples:
>>> dates = pd.Series(pd.date_range('2020-01-01', periods=100))
>>> optimized = optimize_datetime(dates)
Source code in src/dietpandas/core.py
Example:
import pandas as pd
from dietpandas import optimize_datetime
# Object column with datetime strings
s = pd.Series(['2020-01-01', '2020-02-01', '2020-03-01'])
s_optimized = optimize_datetime(s) # datetime64[ns]
optimize_sparse()¶
Convert columns with many repeated values to sparse format.
dietpandas.core.optimize_sparse(series, sparse_threshold=0.9)
¶
Converts series to sparse format if it has many repeated values (especially zeros/NaNs).
Sparse arrays are highly memory-efficient when a series contains mostly one value. Common for binary features, indicator variables, or data with many missing values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
series
|
Series
|
A pandas Series |
required |
sparse_threshold
|
float
|
If most common value appears >= threshold% of time, use sparse |
0.9
|
Returns:
| Type | Description |
|---|---|
Series
|
Optimized Series (sparse if beneficial, otherwise unchanged) |
Examples:
>>> s = pd.Series([0, 0, 1, 0, 0, 0, 2, 0, 0, 0])
>>> optimized = optimize_sparse(s)
>>> isinstance(optimized.dtype, pd.SparseDtype)
True
Source code in src/dietpandas/core.py
Example:
import pandas as pd
from dietpandas import optimize_sparse
# Column with 95% zeros
s = pd.Series([0] * 950 + [1] * 50)
s_optimized = optimize_sparse(s) # Sparse[int8, 0]
get_memory_report()¶
Generate a detailed memory usage report for a DataFrame.
dietpandas.core.get_memory_report(df)
¶
Generate a detailed memory usage report for each column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with memory statistics per column |
Examples:
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
>>> report = get_memory_report(df)
>>> print(report)
Source code in src/dietpandas/core.py
Example:
import dietpandas as dp
import pandas as pd
df = pd.DataFrame({
'a': range(1000),
'b': ['text'] * 1000
})
report = dp.get_memory_report(df)
print(report)
# column dtype memory_bytes memory_mb percent_of_total
# 0 b object 59000 0.059 88.1
# 1 a int64 8000 0.008 11.9
Type Optimization Rules¶
Integer Optimization¶
| Value Range | Optimized Type | Bytes Saved per Value |
|---|---|---|
| 0 to 255 | uint8 |
7 bytes (from int64) |
| 0 to 65,535 | uint16 |
6 bytes |
| -128 to 127 | int8 |
7 bytes |
| -32,768 to 32,767 | int16 |
6 bytes |
Float Optimization¶
| Mode | Conversion | Precision | Use Case |
|---|---|---|---|
| Safe | float64 → float32 | ~7 decimal digits | Most ML tasks |
| Aggressive | float64 → float16 | ~3 decimal digits | Extreme compression |
Object Optimization¶
| Condition | Optimization | Memory Savings |
|---|---|---|
| Unique ratio < 50% | object → category | 50-90% typical |
| All datetime strings | object → datetime64 | 50-70% typical |
| High cardinality | No change | Keep as object |
Sparse Optimization¶
| Condition | Optimization | Memory Savings |
|---|---|---|
| >90% repeated values | Dense → Sparse | 90-99% typical |
| Binary features (0/1) | Dense → Sparse[int8] | ~96% typical |
| Low sparsity | No change | Keep as dense |