Radar precipitation source data¶
data_stage: source_data
product: radar_precipitation
version: 0.2.0
Run the validator directly with uv from release on pypi.org:
uvx --with mlcast-dataset-validator@0.3.0.dev7+g24e2f3c58 mlcast.validate_dataset source_data radar_precipitation <PATH_OR_URL>
Warning: this build uses a pre-release or local version (
0.3.0.dev7+g24e2f3c58), which meansmainmay include changes not yet released on PyPI. You can run directly from the GitHub source instead:uvx --from "git+https://github.com/mlcast-community/mlcast-sourcedata-validator" mlcast.validate_dataset source_data radar_precipitation <PATH_OR_URL>
1. Introduction¶
This document specifies the requirements for 2D radar precipitation and reflectivity composite datasets to be included in the MLCast data collection. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
2. Scope¶
This specification applies to 2D radar composite datasets (merged from multiple radar sources) intended for machine learning applications in weather and climate research. Single-radar datasets are explicitly excluded from this specification.
(see inline comments below for rest of specification)
3. Coordinate Requirements¶
3.1 Coordinate Variables¶
- The dataset MUST expose CF-compliant coordinates: latitude/longitude and projected x/y.
- Coordinate metadata MUST provide
standard_name/axis/unitsper CF (with a validtimecoordinate as well).
3.2 Spatial Requirements¶
- The dataset MUST provide 2D radar composites with a spatial resolution of 1 kilometer or finer.
- The valid sensing area MUST support at least one 256×256 pixel square crop that is fully contained within the radar sensing range.
- The spatial domain, including resolution, size, and geographical coverage, MUST remain constant across all times in the archive.
3.3 Temporal Requirements¶
- The timestep (the duration between successive time values for which a single data-point is valid) MAY be variable throughout the archive, but in that case a global attribute named
consistent_timestep_startMUST be included to indicate the first timestamp where regular timestepping begins. In the absence of this attribute, the timestep MUST be regular throughout the archive. - Times for which data is missing MUST be given expicitly in the variable
missing_timesas CF-compliant time values. The timestep is defined as the interval between consecutive times (including missing times). These times MUST NOT be included in the main time coordinate. - Time values MUST be strictly monotonically increasing.
4. Data Variable Requirements¶
4.1 Chunking Strategy¶
- The dataset MUST use a chunking strategy of 1 × height × width (one chunk per timestep).
4.2 Compression¶
- The main data arrays MUST use compression to reduce storage requirements.
- ZSTD compression is RECOMMENDED for optimal performance of the main data arrays.
- Coordinate arrays MAY use different compression algorithms (e.g., lz4) as appropriate.
4.3 Data Structure¶
- The main data variable MUST be encoded with dimensions in the order: time × height (y, lat) × width (x, lon).
- The data type MUST be floating-point (float16, float32, or float64).
4.4 Data Variable Naming and Attributes¶
- The data variable name SHOULD be a CF convention standard name or use a sensible name from the ECMWF parameter database.
- The data variable MUST include the
long_name,standard_nameandunitsattributes following CF conventions.
4.5 Georeferencing¶
- The dataset MUST include proper georeferencing information following the GeoZarr specification.
- The data variable MUST include a
grid_mappingattribute that references the coordinate reference system (crs) variable. - The crs variable MUST include both a
spatial_refand acrs_wktattribute with a WKT string.
5. Global Attribute Requirements¶
5.1 Conditional Global Attributes¶
- The global attribute
consistent_timestep_startis CONDITIONALLY REQUIRED if the dataset uses a variable timestep. It MUST be an ISO formatted datetime string indicating the first timestamp where regular timestepping begins.
5.2 Licensing Requirements¶
- The dataset MUST include a global
licenseattribute containing a valid SPDX identifier. - The following licenses are RECOMMENDED:
CC-BY,CC-BY-SA,OGL. - Licenses with
NCorNDrestrictions SHOULD generate warnings but MAY be accepted after review.
5.3 Zarr Format¶
- The dataset MUST use Zarr version 2 or version 3 format.
- If Zarr version 2 is used, the dataset MUST include consolidated metadata.
5.4 MLCast Metadata¶
The dataset MUST include the following global attributes:
mlcast_created_on: ISO formatted datetime of dataset creation.mlcast_created_by: Creator contact inName <email>format.mlcast_created_with: GitHub URL of the creating software including version (e.g., https://github.com/mlcast-community/mlcast-dataset-radklim@v0.1.0) and the repository/revision MUST exist.mlcast_dataset_version: Dataset specification version (semver or calver).mlcast_dataset_identifier: Unique dataset identifier formatted as<country_code>-<entity>-<physical_variable>by default.mlcast_dataset_identifier_format: OPTIONAL format string that MUST start with<country_code>-<entity>-<physical_variable>and MAY include only the approved identifier parts:country_code,entity,physical_variable,time_resolution,common_name.
6. Tool Compatibility Requirements¶
Practical interoperability checks derived from the standalone validator.
6.1 GDAL Compatibility¶
- The dataset SHOULD expose georeferencing metadata readable by GDAL, including a CRS WKT.
- A basic GeoTIFF export SHOULD roundtrip through GDAL with geotransform/projection metadata.
6.2 Cartopy Compatibility¶
- The CRS WKT SHOULD be parseable by cartopy.
- Coordinate grids SHOULD transform cleanly into PlateCarree for mapping workflows.