Performance Optimization¶
This guide covers performance optimization techniques for the Climate Diagnostics Toolkit, with a focus on sophisticated chunking strategies and memory management for large climate datasets.
🚀 Overview¶
The Climate Diagnostics Toolkit includes advanced performance optimization features:
Disk-aware chunking that adapts to file structure
Operation-specific optimization for different analysis types
Memory-conscious processing with automatic scaling
Dynamic chunk calculation based on system resources
⚡ Quick Performance Tips¶
Always optimize chunking for your specific analysis type
Use operation-specific methods like
optimize_for_trends()
Monitor memory usage with built-in analysis tools
Choose appropriate chunk sizes based on your data frequency
🔧 Chunking Strategies¶
Basic Chunking Optimization¶
Start with basic chunking optimization for any analysis:
import xarray as xr
import climate_diagnostics
# Load your dataset
ds = xr.open_dataset("large_climate_data.nc")
# Apply basic optimization
ds_optimized = ds.climate_timeseries.optimize_chunks(
target_mb=50, # Target 50 MB chunks
variable='temperature'
)
Advanced Chunking Strategies¶
For more sophisticated optimization:
# Advanced optimization with operation-specific tuning
ds_advanced = ds.climate_timeseries.optimize_chunks_advanced(
operation_type='timeseries', # 'timeseries', 'spatial', 'statistical'
performance_priority='balanced', # 'memory', 'speed', 'balanced'
memory_limit_gb=8.0, # Set memory limit
use_disk_chunks=True # Preserve spatial disk chunks
)
Operation-Specific Optimization¶
Different analysis types benefit from different chunking strategies:
Time Series Analysis:
# Optimize for time series operations
ds_ts = ds.climate_timeseries.optimize_chunks_advanced(
operation_type='timeseries',
performance_priority='memory'
)
# Or use the dedicated method
ds_ts = ds.climate_timeseries.optimize_for_decomposition()
Trend Analysis:
# Optimize for trend calculations
ds_trends = ds.climate_trends.optimize_for_trends(
variable='temperature',
use_case='spatial_trends'
)
Spatial Analysis:
# Optimize for spatial operations and plotting
ds_spatial = ds.climate_timeseries.optimize_chunks_advanced(
operation_type='spatial',
performance_priority='speed'
)
📊 Performance Analysis¶
Chunking Analysis Tools¶
Analyze your current chunking strategy:
# Print detailed chunking information
ds.climate_timeseries.print_chunking_info(detailed=True)
# Get chunking recommendations for different use cases
ds.climate_timeseries.analyze_chunking_strategy()
Example output:
Climate Data Chunking Analysis
================================================
Recommended chunking strategies:
Time Series:
Target: 25 MB chunks
Max: 100 MB chunks
Chunks: {'time': 48, 'lat': 73, 'lon': 144}
Use: Optimized for time series analysis with smaller chunks
Spatial Analysis:
Target: 100 MB chunks
Max: 500 MB chunks
Chunks: {'time': 12, 'lat': 145, 'lon': 288}
Use: Larger chunks for spatial operations and mapping
Memory Management¶
Monitor and control memory usage:
# Check system memory
from climate_diagnostics.utils.chunking_utils import get_system_memory_info
memory_info = get_system_memory_info()
print(f"Available memory: {memory_info['available']:.1f} GB")
# Optimize for memory-constrained systems
ds_memory = ds.climate_timeseries.optimize_chunks_advanced(
operation_type='general',
performance_priority='memory',
memory_limit_gb=4.0 # Limit to 4 GB
)
🎯 Best Practices by Data Type¶
Daily Data (High Frequency)¶
# For daily data (365+ time steps per year)
ds_daily = ds.climate_timeseries.optimize_chunks(
target_mb=75,
time_freq='daily'
)
Monthly Data (Standard Climate)¶
# For monthly data (12 time steps per year)
ds_monthly = ds.climate_timeseries.optimize_chunks(
target_mb=50,
time_freq='monthly'
)
High-Resolution Spatial Data¶
# For high-resolution grids (>1000x1000)
ds_hires = ds.climate_timeseries.optimize_chunks_advanced(
operation_type='spatial',
performance_priority='memory',
memory_limit_gb=8.0
)
🔍 Troubleshooting Performance Issues¶
Common Issues and Solutions¶
Memory Errors:
# Reduce chunk sizes
ds_safe = ds.climate_timeseries.optimize_chunks_advanced(
performance_priority='memory',
memory_limit_gb=2.0 # Conservative limit
)
Slow Processing:
# Increase chunk sizes for speed
ds_fast = ds.climate_timeseries.optimize_chunks_advanced(
performance_priority='speed',
operation_type='spatial'
)
Poor Parallelization:
# Ensure sufficient chunks for parallel processing
ds_parallel = ds.climate_timeseries.optimize_chunks_advanced(
operation_type='general',
memory_limit_gb=16.0 # Allow larger memory for more chunks
)
📈 Performance Monitoring¶
Track Performance Improvements¶
import time
from dask.diagnostics import ProgressBar
# Time operations with different chunking strategies
def time_operation(dataset, operation_name):
start = time.time()
with ProgressBar():
result = dataset.air.mean(['lat', 'lon']).compute()
end = time.time()
print(f"{operation_name}: {end - start:.2f} seconds")
return result
# Compare performance
time_operation(ds_original, "Original chunking")
time_operation(ds_optimized, "Optimized chunking")
Real-World Examples¶
Large Climate Model Output:
# For CMIP6-style data (>10 GB files)
ds_cmip = xr.open_dataset("cmip6_tas_daily.nc")
ds_cmip_opt = ds_cmip.climate_timeseries.optimize_chunks_advanced(
operation_type='timeseries',
performance_priority='balanced',
memory_limit_gb=12.0,
variable='tas'
)
Observational Gridded Data:
# For observational products (ERA5, etc.)
ds_obs = xr.open_dataset("era5_temperature.nc")
ds_obs_opt = ds_obs.climate_timeseries.optimize_chunks(
target_mb=100,
time_freq='hourly',
variable='t2m'
)
🎛️ Advanced Configuration¶
Custom Chunking Strategies¶
For specialized use cases, you can create custom chunking:
from climate_diagnostics.utils.chunking_utils import (
calculate_optimal_chunks_from_disk,
dynamic_chunk_calculator
)
# Custom disk-aware chunking
custom_chunks = calculate_optimal_chunks_from_disk(
ds,
target_mb=150,
variable='precipitation'
)
ds_custom = ds.chunk(custom_chunks)
# Dynamic chunking with custom parameters
adaptive_chunks = dynamic_chunk_calculator(
ds,
operation_type='statistical',
memory_limit_gb=6.0,
performance_priority='speed'
)
See Also¶
Quickstart - Getting started with basic optimization
Utilities API Reference - Full API reference for chunking utilities
Plotting Guide - Visualization performance tips