🌌 Maya4 - Multi-Level SAR Dataset 📡

Made with Love By

Roberto Del Prete, Gabriele Daga, Nicolas Longépé
ESA Φ-lab, European Space Research Institute (ESRIN), Frascati, Italy

🎯Overview

🌌 Maya4 📡 is a project dedicated to curating and providing multi-level intermediate SAR representations from Sentinel-1 acquisitions, spanning the entire chain from Level 0 to Level 1.

The name Maya4 draws inspiration from the Māyā veil in philosophy, where reality is hidden behind successive layers—just as radar echoes undergo transformations before forming a final SAR image.

Key Features

🎚️ Multi-Level Access

Complete processing chain from raw echoes to focused imagery (raw → rc → rcmc → az)

🚀 Performance

Zarr-based storage with intelligent chunk caching and lazy loading

☁️ Cloud-Native

Native HuggingFace Hub integration with 68TB+ of curated data

📊 ML-Ready

PyTorch-compatible dataloaders optimized for pre-training workflows

Quick Start

from maya4 import get_sar_dataloader

# Create a basic dataloader
loader = get_sar_dataloader(
data_dir='./data',
level_from='rcmc',
level_to='az',
batch_size=16,
patch_size=(1000, 1000),
online=True
)

# Iterate through batches
for x_batch, y_batch in loader:
# x_batch: input (rcmc)
# y_batch: target (az)
pass

📦Installation

Quick Install

🚀 Recommended: Install from PyPI

The easiest way to get started with Maya4 is through pip:

# Install from PyPI (recommended)
pip install maya4

Alternative Installation Methods

📦 Using PDM

For development with PDM package manager:

pdm install

🔧 Development Mode

For local development and editable install:

pip install -e .

Environment-Specific Installation

📓 Jupyter Environment

Includes Jupyter notebook and lab dependencies for interactive development

pdm install -G jupyter_env

🗺️ Geospatial Features

Adds geographic processing tools and coordinate system support

pdm install -G geospatial

🛠️ Development Setup

Installs testing, linting, and development utilities

pdm install -G dev

🌟 Complete Installation

Installs all optional dependencies for full functionality

pdm install -G :all

Requirements

Package Version Purpose
Python 3.8+ Core interpreter
PyTorch 2.0+ Deep learning framework
zarr Latest Cloud-native array storage
huggingface_hub Latest Dataset downloading
CUDA Optional GPU acceleration

Verification

✅ Test Your Installation

After installation, verify that Maya4 is working correctly:

import maya4
from maya4 import get_sar_dataloader

# Print version
print(f"Maya4 version: {maya4.__version__}")

# Test dataloader creation
loader = get_sar_dataloader(
data_dir='./data',
level_from='rcmc',
level_to='az',
max_products=1,
online=True,
verbose=True
)
print("✓ Installation successful!")

Troubleshooting

🔒 HuggingFace Authentication

For private datasets, authenticate with HuggingFace:

huggingface-cli login

💾 Disk Space

Each SAR product is ~20GB. Ensure adequate storage for your use case. Online mode downloads chunks on-demand.

🐛 Issues?

Found a bug or need help? Open an issue on our GitHub repository.

🆕 Updates

Keep Maya4 up to date:

pip install --upgrade maya4

⚙️Parameters Reference

Data Source Configuration

data_dir

str

Local directory for storing/loading SAR products.
Products are organized as: data_dir/[part]/[product_name].zarr/.
This is where Maya4 will look for local Zarr files or download new products when online mode is enabled.

online

bool

Enable automatic downloading from HuggingFace Hub.
True: downloads missing products from Maya4 HF organization (requires HF authentication for private datasets)
False: only uses locally available products.
In online mode, only required chunks are downloaded on-demand (lazy loading), metadata downloaded first.

max_products

int | None

Maximum number of SAR products to load. Useful for testing or limiting dataset size.
Set to None to load all available products matching filters.
Start with 1 for prototyping, then scale up.

filters

SampleFilter

Metadata-based filtering by year, polarization, stripmap mode, and geographic location.
years: [2023, 2024] for acquisition years
polarizations: ["hh", "hv", "vh", "vv"] where H=horizontal, V=vertical
stripmap_modes: [1,2,3] for narrow swaths/higher resolution, [4,5,6] for wider swaths
parts: ["PT1", "PT2", etc.] for geographic regions.

Processing Levels

level_from

str

Input processing level: "raw" (L0), "rc", "rcmc", or "az" (L1). This is X in your training pair.

level_to

str

Target processing level: "raw", "rc", "rcmc", or "az". This is Y in your training pair.

Patch Extraction

patch_mode

str

How to extract patches from the full SAR image. "rectangular": Extract rectangular patches (most common). Other modes may be added in future versions.

patch_size

tuple(int, int)

(height, width) of extracted patches in pixels.
(H, W): fixed size
(H, -1): full width, fixed height (entire rows)
(-1, W): full height, fixed width (entire columns)
(-1, -1): entire image (not recommended)
Example: (1, 1000) extracts 1×1000 horizontal slices.

stride

tuple(int, int)

(vertical, horizontal) step size between patches.
stride < patch_size: overlapping patches
stride = patch_size: non-overlapping tiles
stride > patch_size: skip regions
Example: (1000, 1000) with patch_size (1000, 1000) = no overlap.

buffer

tuple(int, int)

(vertical, horizontal) buffer zones at image boundaries.
Excludes this many pixels from edges to avoid boundary artifacts.
Example: (1000, 1000) excludes 1000 pixels from each edge.

patch_order

str

Order in which patches are extracted.
"row": left→right, top→bottom (horizontal raster)
"col": top→bottom, left→right (vertical raster)
"chunk": follows Zarr storage chunks (I/O efficient, typically fastest).

max_base_sample_size

tuple(int, int) | None

(height, width) maximum size for base samples.
When concatenate_patches=True, limits the size of concatenated blocks.
Helps manage memory usage. Set to (-1,-1) for no limit.

block_pattern

tuple(int, int) | None

(vertical_blocks, horizontal_blocks) for block sampling.
Divides each product into a grid and samples within blocks, ensuring representative coverage.
Example: (32, -1) = 32 vertical blocks. Set to None for standard extraction.

shuffle_files

bool

Shuffle the order of SAR products.
True: random product order each epoch (recommended for training)
False: deterministic order (useful for reproducibility).

use_balanced_sampling

bool

Balance samples across geographic locations using K-means clustering on lat/lon.
Requires sklearn and ~10+ products.
True: equal representation from different areas
False: no geographic balancing.

Data Representation

complex_valued

bool

Return data as complex numbers or separate real/imag channels.
True: Complex64 tensors (native SAR representation, single channel)
False: Float32 tensors with separate channels (shape changes from (B, H, W) to (B, 2, H, W)).

positional_encoding

bool

Add spatial position information to samples.
Appends normalized (row, col) coordinates as additional channels.
Useful for transformer models and spatial-aware architectures.
True: adds 2 channels
False: only SAR data.

concatenate_patches

bool

Stack multiple patches into larger samples for sequence learning.
True: concatenate patches along specified axis
False: return individual patches.

concat_axis

int

Axis along which to concatenate patches. Only used when concatenate_patches=True.
0: vertical concatenation (stack vertically)
1: horizontal concatenation (stack horizontally).

transform

SARTransform

Transformation pipeline defining normalization for each processing level.
SARTransform can specify: transform_raw (Level 0), transform_rc (Range Compressed), transform_rcmc (RCMC), transform_az (focused).
Use minmax_normalize, z-score, robust, or custom functions. Set to None to disable normalization.

DataLoader Configuration

batch_size

int

Number of samples per batch. Standard PyTorch DataLoader parameter.
Typical values: 8-32 for large patches, 64-128 for small patches.
Adjust based on GPU memory.

num_workers

int

Number of parallel workers for data loading.
0: load in main process (good for debugging)
>0: use multiprocessing (faster but harder to debug)
Recommended: 2-4 for most use cases.

samples_per_prod

int

Number of patches to extract per SAR product.
Controls how many samples each product contributes per epoch.
Higher values = more thorough coverage but longer epochs.
Typical: 100-1000.

cache_size

int

Number of products to keep in memory cache.
Larger cache = fewer disk reads but more RAM usage.
Recommended: 10-100 depending on available RAM.
Uses LRU caching at chunk level for optimal performance.

backend

str

Storage format for SAR products.
"zarr": Zarr format (default, supports cloud streaming).
Zarr provides scalable, chunked storage with lazy loading capabilities.

verbose

bool

Print detailed information during initialization and loading.
True: show download progress, cache info, debugging messages
False: minimal output (recommended for training).

save_samples

bool

Save extracted patches to disk for inspection.
Useful for debugging and visualizing what the model sees.
True: save patches as image files
False: no saving (recommended for training).

📋Configuration Examples

🎯 Fast Prototyping

Quick testing with a single product

loader = get_sar_dataloader( data_dir='./data', level_from='rcmc', level_to='az', batch_size=4, patch_size=(1000, 1000), max_products=1, online=True, verbose=True )

🚀 Full Training

Production-ready multi-product training

loader = get_sar_dataloader( data_dir='./data', level_from='rcmc', level_to='az', batch_size=32, num_workers=4, patch_size=(1000, 1000), stride=(500, 500), shuffle_files=True, samples_per_prod=1000, cache_size=50, online=True )

🔬 Vertical Slice Analysis

Azimuth direction analysis with vertical slices

loader = get_sar_dataloader( data_dir='./data', level_from='rcmc', level_to='az', patch_size=(1, -1), # Full width patch_order='row', complex_valued=True )

🎓 Sequence Learning

Transformer-ready with positional encoding

loader = get_sar_dataloader( data_dir='./data', level_from='rcmc', level_to='az', patch_size=(100, 1), patch_order='col', complex_valued=False, positional_encoding=True, concatenate_patches=True, concat_axis=0 )

🎚️Processing Levels

SAR Processing Pipeline

Maya4 exposes the complete SAR processing chain through intermediate signal representations. Each level represents a step in transforming raw radar echoes into focused SAR imagery.

Level Pair Task Description Complexity
raw → rc Range Compression Learn how radar echoes are compressed in range direction Low - 1D processing
rc → rcmc Range Cell Migration Correction Learn to correct for target motion during acquisition Medium - Geometric correction
rcmc → az Azimuth Compression (Focusing) Learn final focusing step to create SAR image High - 2D focusing
raw → rcmc Combined RC + RCMC Multi-stage processing in one step Medium-High - Combined operations
raw → az End-to-End SAR Processing Complete processing chain from echoes to image Very High - Full focusing pipeline

Level Characteristics

🔴 raw (Level 0)

Raw radar echoes as received by the satellite. Unprocessed time-domain signal with range and azimuth dimensions.

🟡 rc (Range Compressed)

After pulse compression in range direction. Improves range resolution by correlating with transmitted chirp.

🟠 rcmc (Range Cell Migration Corrected)

Corrected for range cell migration caused by platform motion. Straightens signal trajectory.

🟢 az (Azimuth Focused)

Fully focused SAR image after azimuth compression. This is the final Level 1 product.

🎲Sampling Strategies & Data Processing

🔍 Data Filtering with SampleFilter

Selective Product Loading

The SampleFilter class allows you to precisely select which SAR products to include in your dataset based on acquisition metadata.

This is essential for creating focused datasets that match your research requirements.

📅 Year Filter

Parameter: years=[2023, 2024]

Select products by acquisition year.

Useful for temporal analysis, studying seasonal changes, or ensuring temporal consistency in training data.

📡 Polarization Filter

Parameter: polarizations=["hh", "vv"]

Options: "hh", "hv", "vh", "vv"
HH: Horizontal transmit/receive (co-pol)
VV: Vertical transmit/receive (co-pol)
HV/VH: Cross-polarization (depolarization)

📶 Stripmap Mode Filter

Parameter: stripmap_modes=[1, 2, 3]

Sentinel-1 beam modes (1-6):
Modes 1-3: Narrow swaths, higher resolution
Modes 4-6: Wide swaths, lower resolution

🗺️ Geographic Parts Filter

Parameter: parts=["PT1", "PT2"]

Maya4 organizes products into geographic partitions (PT1, PT2, PT3, PT4).

Filter by region to focus on specific areas of interest or ensure geographic diversity.

from maya4 import SampleFilter

# Create a filter for HH polarization data from 2023-2024
# in high-resolution modes from the PT1 region
filters = SampleFilter(
years=[2023, 2024], # Only recent acquisitions
polarizations=["hh"], # Co-polarized horizontal
stripmap_modes=[1, 2, 3], # High-resolution modes only
parts=["PT1"] # Specific geographic region
)

loader = get_sar_dataloader(
...,
filters=filters # Apply the filter
)

📊 Normalization with SARTransform

Level-Specific Data Normalization

The SARTransform class provides processing-level-aware transformations.

Each SAR processing level (raw, rc, rcmc, az) has different signal characteristics and dynamic ranges, requiring tailored normalization strategies.

🎯 Why Level-Specific Normalization?

Each processing level represents different signal stages:
Raw/RC/RCMC: Compressed signal domain (similar ranges)
Azimuth (AZ): Focused image domain (different range)

📈 Transform Functions

Each transform is a callable function applied to tensors:
transform_raw: Applied to Level 0 data
transform_rc: Applied to range compressed
transform_rcmc: Applied to RCMC data
transform_az: Applied to focused data

🔧 Min-Max Normalization

Maya4 provides pre-computed statistics:
RC_MIN, RC_MAX: For compressed levels
GT_MIN, GT_MAX: For ground truth (AZ)

These ensure consistent normalization across the dataset.

⚡ Custom Transforms

You can provide any callable transformation:
• Standard scaling (z-score)
• Log scaling for amplitude
• Custom domain-specific preprocessing
• Augmentation pipelines

from maya4 import SARTransform, minmax_normalize
from maya4 import RC_MIN, RC_MAX, GT_MIN, GT_MAX
import functools

# Create level-specific normalization transforms
transforms = SARTransform(
# Raw data normalization (Level 0)
transform_raw=functools.partial(
minmax_normalize,
array_min=RC_MIN, # Pre-computed minimum for compressed data
array_max=RC_MAX # Pre-computed maximum for compressed data
),

# Range Compressed normalization
transform_rc=functools.partial(
minmax_normalize,
array_min=RC_MIN,
array_max=RC_MAX
),

# RCMC normalization (same range as RC)
transform_rcmc=functools.partial(
minmax_normalize,
array_min=RC_MIN,
array_max=RC_MAX
),

# Azimuth focused normalization (different range!)
transform_az=functools.partial(
minmax_normalize,
array_min=GT_MIN, # Different statistics for focused data
array_max=GT_MAX
)
)

loader = get_sar_dataloader(
...,
transform=transforms # Apply level-aware transforms
)

💡 Pro Tip: Custom Normalization

For custom normalization strategies, you can pass any callable or create your own SARTransform:

# Example: Custom log-scale normalization for amplitude
def log_normalize(x):
return torch.log(torch.abs(x) + 1e-10)

# Example: Standard scaling (z-score)
def standard_scale(x, mean, std):
return (x - mean) / std

# Apply to all levels
transforms = SARTransform(
transform_rcmc=log_normalize,
transform_az=functools.partial(standard_scale, mean=0.5, std=0.2)
)

Patch Order Comparison

Strategy Description Use Case I/O Efficiency
row Left→right, top→bottom (horizontal raster scan) Sequential horizontal features, coherent spatial ordering Medium
col Top→bottom, left→right (vertical raster scan) Range direction analysis, horizontal features Medium
chunk Follows Zarr storage chunks Maximum I/O performance, minimizes cache misses High ⚡

Block Pattern Sampling

Stratified Coverage

Block pattern divides each product into a grid of blocks and samples within each block.

This ensures representative sampling across the entire SAR product, capturing diverse spatial characteristics.

loader = get_sar_dataloader(
...,
block_pattern=(32, 32), # 32×32 grid of blocks
patch_order='row' # Within each block
)

Advanced Features

Patch Concatenation

Stack multiple patches into larger samples for sequence learning.

Specify axis: 0 (vertical) or 1 (horizontal).

Balanced Sampling

Use K-means clustering on lat/lon to ensure geographic diversity.

Requires sklearn and ~10+ products.

Lazy Loading

In online mode, only required chunks are downloaded on-demand.

Metadata downloaded first, data chunks as needed.

🤖Interactive Code Generator

Configure Your DataLoader

Click "Generate Code" to see your custom configuration...