🌌 Maya4 📡 is a project dedicated to curating and providing multi-level intermediate SAR representations from Sentinel-1 acquisitions, spanning the entire chain from Level 0 to Level 1.
The name Maya4 draws inspiration from the Māyā veil in philosophy, where reality is hidden behind successive layers—just as radar echoes undergo transformations before forming a final SAR image.
Complete processing chain from raw echoes to focused imagery (raw → rc → rcmc → az)
Zarr-based storage with intelligent chunk caching and lazy loading
Native HuggingFace Hub integration with 68TB+ of curated data
PyTorch-compatible dataloaders optimized for pre-training workflows
The easiest way to get started with Maya4 is through pip:
For development with PDM package manager:
For local development and editable install:
Includes Jupyter notebook and lab dependencies for interactive development
Adds geographic processing tools and coordinate system support
Installs testing, linting, and development utilities
Installs all optional dependencies for full functionality
| Package | Version | Purpose |
|---|---|---|
| Python | 3.8+ | Core interpreter |
| PyTorch | 2.0+ | Deep learning framework |
| zarr | Latest | Cloud-native array storage |
| huggingface_hub | Latest | Dataset downloading |
| CUDA | Optional | GPU acceleration |
After installation, verify that Maya4 is working correctly:
For private datasets, authenticate with HuggingFace:
Each SAR product is ~20GB. Ensure adequate storage for your use case. Online mode downloads chunks on-demand.
Found a bug or need help? Open an issue on our GitHub repository.
Keep Maya4 up to date:
Local directory for storing/loading SAR products.
Products are organized as: data_dir/[part]/[product_name].zarr/.
This is where Maya4 will look for local Zarr files or download new products when online mode is enabled.
Enable automatic downloading from HuggingFace Hub.
True: downloads missing products from Maya4 HF organization (requires HF authentication for private datasets)
False: only uses locally available products.
In online mode, only required chunks are downloaded on-demand (lazy loading), metadata downloaded first.
Maximum number of SAR products to load. Useful for testing or limiting dataset size.
Set to None to load all available products matching filters.
Start with 1 for prototyping, then scale up.
Metadata-based filtering by year, polarization, stripmap mode, and geographic location.
years: [2023, 2024] for acquisition years
polarizations: ["hh", "hv", "vh", "vv"] where H=horizontal, V=vertical
stripmap_modes: [1,2,3] for narrow swaths/higher resolution, [4,5,6] for wider swaths
parts: ["PT1", "PT2", etc.] for geographic regions.
Input processing level: "raw" (L0), "rc", "rcmc", or "az" (L1). This is X in your training pair.
Target processing level: "raw", "rc", "rcmc", or "az". This is Y in your training pair.
How to extract patches from the full SAR image. "rectangular": Extract rectangular patches (most common). Other modes may be added in future versions.
(height, width) of extracted patches in pixels.
(H, W): fixed size
(H, -1): full width, fixed height (entire rows)
(-1, W): full height, fixed width (entire columns)
(-1, -1): entire image (not recommended)
Example: (1, 1000) extracts 1×1000 horizontal slices.
(vertical, horizontal) step size between patches.
stride < patch_size: overlapping patches
stride = patch_size: non-overlapping tiles
stride > patch_size: skip regions
Example: (1000, 1000) with patch_size (1000, 1000) = no overlap.
(vertical, horizontal) buffer zones at image boundaries.
Excludes this many pixels from edges to avoid boundary artifacts.
Example: (1000, 1000) excludes 1000 pixels from each edge.
Order in which patches are extracted.
"row": left→right, top→bottom (horizontal raster)
"col": top→bottom, left→right (vertical raster)
"chunk": follows Zarr storage chunks (I/O efficient, typically fastest).
(height, width) maximum size for base samples.
When concatenate_patches=True, limits the size of concatenated blocks.
Helps manage memory usage. Set to (-1,-1) for no limit.
(vertical_blocks, horizontal_blocks) for block sampling.
Divides each product into a grid and samples within blocks, ensuring representative coverage.
Example: (32, -1) = 32 vertical blocks. Set to None for standard extraction.
Shuffle the order of SAR products.
True: random product order each epoch (recommended for training)
False: deterministic order (useful for reproducibility).
Balance samples across geographic locations using K-means clustering on lat/lon.
Requires sklearn and ~10+ products.
True: equal representation from different areas
False: no geographic balancing.
Return data as complex numbers or separate real/imag channels.
True: Complex64 tensors (native SAR representation, single channel)
False: Float32 tensors with separate channels (shape changes from (B, H, W) to (B, 2, H, W)).
Add spatial position information to samples.
Appends normalized (row, col) coordinates as additional channels.
Useful for transformer models and spatial-aware architectures.
True: adds 2 channels
False: only SAR data.
Stack multiple patches into larger samples for sequence learning.
True: concatenate patches along specified axis
False: return individual patches.
Axis along which to concatenate patches. Only used when concatenate_patches=True.
0: vertical concatenation (stack vertically)
1: horizontal concatenation (stack horizontally).
Transformation pipeline defining normalization for each processing level.
SARTransform can specify: transform_raw (Level 0), transform_rc (Range Compressed), transform_rcmc (RCMC), transform_az (focused).
Use minmax_normalize, z-score, robust, or custom functions. Set to None to disable normalization.
Number of samples per batch. Standard PyTorch DataLoader parameter.
Typical values: 8-32 for large patches, 64-128 for small patches.
Adjust based on GPU memory.
Number of parallel workers for data loading.
0: load in main process (good for debugging)
>0: use multiprocessing (faster but harder to debug)
Recommended: 2-4 for most use cases.
Number of patches to extract per SAR product.
Controls how many samples each product contributes per epoch.
Higher values = more thorough coverage but longer epochs.
Typical: 100-1000.
Number of products to keep in memory cache.
Larger cache = fewer disk reads but more RAM usage.
Recommended: 10-100 depending on available RAM.
Uses LRU caching at chunk level for optimal performance.
Storage format for SAR products.
"zarr": Zarr format (default, supports cloud streaming).
Zarr provides scalable, chunked storage with lazy loading capabilities.
Print detailed information during initialization and loading.
True: show download progress, cache info, debugging messages
False: minimal output (recommended for training).
Save extracted patches to disk for inspection.
Useful for debugging and visualizing what the model sees.
True: save patches as image files
False: no saving (recommended for training).
Quick testing with a single product
Production-ready multi-product training
Azimuth direction analysis with vertical slices
Transformer-ready with positional encoding
Maya4 exposes the complete SAR processing chain through intermediate signal representations. Each level represents a step in transforming raw radar echoes into focused SAR imagery.
| Level Pair | Task | Description | Complexity |
|---|---|---|---|
| raw → rc | Range Compression | Learn how radar echoes are compressed in range direction | Low - 1D processing |
| rc → rcmc | Range Cell Migration Correction | Learn to correct for target motion during acquisition | Medium - Geometric correction |
| rcmc → az | Azimuth Compression (Focusing) | Learn final focusing step to create SAR image | High - 2D focusing |
| raw → rcmc | Combined RC + RCMC | Multi-stage processing in one step | Medium-High - Combined operations |
| raw → az | End-to-End SAR Processing | Complete processing chain from echoes to image | Very High - Full focusing pipeline |
Raw radar echoes as received by the satellite. Unprocessed time-domain signal with range and azimuth dimensions.
After pulse compression in range direction. Improves range resolution by correlating with transmitted chirp.
Corrected for range cell migration caused by platform motion. Straightens signal trajectory.
Fully focused SAR image after azimuth compression. This is the final Level 1 product.
The SampleFilter class allows you to precisely select which SAR products to include in your dataset based on acquisition metadata.
This is essential for creating focused datasets that match your research requirements.
Parameter: years=[2023, 2024]
Select products by acquisition year.
Useful for temporal analysis, studying seasonal changes, or ensuring temporal consistency in training data.
Parameter: polarizations=["hh", "vv"]
Options: "hh", "hv", "vh", "vv"
• HH: Horizontal transmit/receive (co-pol)
• VV: Vertical transmit/receive (co-pol)
• HV/VH: Cross-polarization (depolarization)
Parameter: stripmap_modes=[1, 2, 3]
Sentinel-1 beam modes (1-6):
• Modes 1-3: Narrow swaths, higher resolution
• Modes 4-6: Wide swaths, lower resolution
Parameter: parts=["PT1", "PT2"]
Maya4 organizes products into geographic partitions (PT1, PT2, PT3, PT4).
Filter by region to focus on specific areas of interest or ensure geographic diversity.
The SARTransform class provides processing-level-aware transformations.
Each SAR processing level (raw, rc, rcmc, az) has different signal characteristics and dynamic ranges, requiring tailored normalization strategies.
Each processing level represents different signal stages:
• Raw/RC/RCMC: Compressed signal domain (similar ranges)
• Azimuth (AZ): Focused image domain (different range)
Each transform is a callable function applied to tensors:
• transform_raw: Applied to Level 0 data
• transform_rc: Applied to range compressed
• transform_rcmc: Applied to RCMC data
• transform_az: Applied to focused data
Maya4 provides pre-computed statistics:
• RC_MIN, RC_MAX: For compressed levels
• GT_MIN, GT_MAX: For ground truth (AZ)
These ensure consistent normalization across the dataset.
You can provide any callable transformation:
• Standard scaling (z-score)
• Log scaling for amplitude
• Custom domain-specific preprocessing
• Augmentation pipelines
For custom normalization strategies, you can pass any callable or create your own SARTransform:
| Strategy | Description | Use Case | I/O Efficiency |
|---|---|---|---|
| row | Left→right, top→bottom (horizontal raster scan) | Sequential horizontal features, coherent spatial ordering | Medium |
| col | Top→bottom, left→right (vertical raster scan) | Range direction analysis, horizontal features | Medium |
| chunk | Follows Zarr storage chunks | Maximum I/O performance, minimizes cache misses | High ⚡ |
Block pattern divides each product into a grid of blocks and samples within each block.
This ensures representative sampling across the entire SAR product, capturing diverse spatial characteristics.
Stack multiple patches into larger samples for sequence learning.
Specify axis: 0 (vertical) or 1 (horizontal).
Use K-means clustering on lat/lon to ensure geographic diversity.
Requires sklearn and ~10+ products.
In online mode, only required chunks are downloaded on-demand.
Metadata downloaded first, data chunks as needed.