Performance Optimization

This guide covers performance optimization techniques for the Climate Diagnostics Toolkit, with a focus on sophisticated chunking strategies and memory management for large climate datasets.

🚀 Overview

The Climate Diagnostics Toolkit includes advanced performance optimization features:

  • Disk-aware chunking that adapts to file structure

  • Operation-specific optimization for different analysis types

  • Memory-conscious processing with automatic scaling

  • Dynamic chunk calculation based on system resources

Quick Performance Tips

  1. Always optimize chunking for your specific analysis type

  2. Use operation-specific methods like optimize_for_trends()

  3. Monitor memory usage with built-in analysis tools

  4. Choose appropriate chunk sizes based on your data frequency

🔧 Chunking Strategies

Basic Chunking Optimization

Start with basic chunking optimization for any analysis:

import xarray as xr
import climate_diagnostics

# Load your dataset
ds = xr.open_dataset("large_climate_data.nc")

# Apply basic optimization
ds_optimized = ds.climate_timeseries.optimize_chunks(
    target_mb=50,  # Target 50 MB chunks
    variable='temperature'
)

Advanced Chunking Strategies

For more sophisticated optimization:

# Advanced optimization with operation-specific tuning
ds_advanced = ds.climate_timeseries.optimize_chunks_advanced(
    operation_type='timeseries',        # 'timeseries', 'spatial', 'statistical'
    performance_priority='balanced',    # 'memory', 'speed', 'balanced'
    memory_limit_gb=8.0,               # Set memory limit
    use_disk_chunks=True               # Preserve spatial disk chunks
)

Operation-Specific Optimization

Different analysis types benefit from different chunking strategies:

Time Series Analysis:

# Optimize for time series operations
ds_ts = ds.climate_timeseries.optimize_chunks_advanced(
    operation_type='timeseries',
    performance_priority='memory'
)

# Or use the dedicated method
ds_ts = ds.climate_timeseries.optimize_for_decomposition()

Trend Analysis:

# Optimize for trend calculations
ds_trends = ds.climate_trends.optimize_for_trends(
    variable='temperature',
    use_case='spatial_trends'
)

Spatial Analysis:

# Optimize for spatial operations and plotting
ds_spatial = ds.climate_timeseries.optimize_chunks_advanced(
    operation_type='spatial',
    performance_priority='speed'
)

📊 Performance Analysis

Chunking Analysis Tools

Analyze your current chunking strategy:

# Print detailed chunking information
ds.climate_timeseries.print_chunking_info(detailed=True)

# Get chunking recommendations for different use cases
ds.climate_timeseries.analyze_chunking_strategy()

Example output:

Climate Data Chunking Analysis
================================================

Recommended chunking strategies:

Time Series:
  Target: 25 MB chunks
  Max: 100 MB chunks
  Chunks: {'time': 48, 'lat': 73, 'lon': 144}
  Use: Optimized for time series analysis with smaller chunks

Spatial Analysis:
  Target: 100 MB chunks
  Max: 500 MB chunks
  Chunks: {'time': 12, 'lat': 145, 'lon': 288}
  Use: Larger chunks for spatial operations and mapping

Memory Management

Monitor and control memory usage:

# Check system memory
from climate_diagnostics.utils.chunking_utils import get_system_memory_info

memory_info = get_system_memory_info()
print(f"Available memory: {memory_info['available']:.1f} GB")

# Optimize for memory-constrained systems
ds_memory = ds.climate_timeseries.optimize_chunks_advanced(
    operation_type='general',
    performance_priority='memory',
    memory_limit_gb=4.0  # Limit to 4 GB
)

🎯 Best Practices by Data Type

Daily Data (High Frequency)

# For daily data (365+ time steps per year)
ds_daily = ds.climate_timeseries.optimize_chunks(
    target_mb=75,
    time_freq='daily'
)

Monthly Data (Standard Climate)

# For monthly data (12 time steps per year)
ds_monthly = ds.climate_timeseries.optimize_chunks(
    target_mb=50,
    time_freq='monthly'
)

High-Resolution Spatial Data

# For high-resolution grids (>1000x1000)
ds_hires = ds.climate_timeseries.optimize_chunks_advanced(
    operation_type='spatial',
    performance_priority='memory',
    memory_limit_gb=8.0
)

🔍 Troubleshooting Performance Issues

Common Issues and Solutions

Memory Errors:

# Reduce chunk sizes
ds_safe = ds.climate_timeseries.optimize_chunks_advanced(
    performance_priority='memory',
    memory_limit_gb=2.0  # Conservative limit
)

Slow Processing:

# Increase chunk sizes for speed
ds_fast = ds.climate_timeseries.optimize_chunks_advanced(
    performance_priority='speed',
    operation_type='spatial'
)

Poor Parallelization:

# Ensure sufficient chunks for parallel processing
ds_parallel = ds.climate_timeseries.optimize_chunks_advanced(
    operation_type='general',
    memory_limit_gb=16.0  # Allow larger memory for more chunks
)

📈 Performance Monitoring

Track Performance Improvements

import time
from dask.diagnostics import ProgressBar

# Time operations with different chunking strategies
def time_operation(dataset, operation_name):
    start = time.time()
    with ProgressBar():
        result = dataset.air.mean(['lat', 'lon']).compute()
    end = time.time()
    print(f"{operation_name}: {end - start:.2f} seconds")
    return result

# Compare performance
time_operation(ds_original, "Original chunking")
time_operation(ds_optimized, "Optimized chunking")

Real-World Examples

Large Climate Model Output:

# For CMIP6-style data (>10 GB files)
ds_cmip = xr.open_dataset("cmip6_tas_daily.nc")
ds_cmip_opt = ds_cmip.climate_timeseries.optimize_chunks_advanced(
    operation_type='timeseries',
    performance_priority='balanced',
    memory_limit_gb=12.0,
    variable='tas'
)

Observational Gridded Data:

# For observational products (ERA5, etc.)
ds_obs = xr.open_dataset("era5_temperature.nc")
ds_obs_opt = ds_obs.climate_timeseries.optimize_chunks(
    target_mb=100,
    time_freq='hourly',
    variable='t2m'
)

🎛️ Advanced Configuration

Custom Chunking Strategies

For specialized use cases, you can create custom chunking:

from climate_diagnostics.utils.chunking_utils import (
    calculate_optimal_chunks_from_disk,
    dynamic_chunk_calculator
)

# Custom disk-aware chunking
custom_chunks = calculate_optimal_chunks_from_disk(
    ds,
    target_mb=150,
    variable='precipitation'
)
ds_custom = ds.chunk(custom_chunks)

# Dynamic chunking with custom parameters
adaptive_chunks = dynamic_chunk_calculator(
    ds,
    operation_type='statistical',
    memory_limit_gb=6.0,
    performance_priority='speed'
)

See Also