Skip to content

Radar precipitation source data

data_stage: source_data
product: radar_precipitation
version: 0.2.0

View spec source on GitHub

Run the validator directly with uv from release on pypi.org:

uvx --with mlcast-dataset-validator@0.3.0.dev7+g24e2f3c58 mlcast.validate_dataset source_data radar_precipitation <PATH_OR_URL>

Warning: this build uses a pre-release or local version (0.3.0.dev7+g24e2f3c58), which means main may include changes not yet released on PyPI. You can run directly from the GitHub source instead:

uvx --from "git+https://github.com/mlcast-community/mlcast-sourcedata-validator" mlcast.validate_dataset source_data radar_precipitation <PATH_OR_URL>

1. Introduction

This document specifies the requirements for 2D radar precipitation and reflectivity composite datasets to be included in the MLCast data collection. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

2. Scope

This specification applies to 2D radar composite datasets (merged from multiple radar sources) intended for machine learning applications in weather and climate research. Single-radar datasets are explicitly excluded from this specification.

(see inline comments below for rest of specification)

3. Coordinate Requirements

3.1 Coordinate Variables

  • The dataset MUST expose CF-compliant coordinates: latitude/longitude and projected x/y.
  • Coordinate metadata MUST provide standard_name/axis/units per CF (with a valid time coordinate as well).

3.2 Spatial Requirements

  • The dataset MUST provide 2D radar composites with a spatial resolution of 1 kilometer or finer.
  • The valid sensing area MUST support at least one 256×256 pixel square crop that is fully contained within the radar sensing range.
  • The spatial domain, including resolution, size, and geographical coverage, MUST remain constant across all times in the archive.

3.3 Temporal Requirements

  • The timestep (the duration between successive time values for which a single data-point is valid) MAY be variable throughout the archive, but in that case a global attribute named consistent_timestep_start MUST be included to indicate the first timestamp where regular timestepping begins. In the absence of this attribute, the timestep MUST be regular throughout the archive.
  • Times for which data is missing MUST be given expicitly in the variable missing_times as CF-compliant time values. The timestep is defined as the interval between consecutive times (including missing times). These times MUST NOT be included in the main time coordinate.
  • Time values MUST be strictly monotonically increasing.

4. Data Variable Requirements

4.1 Chunking Strategy

  • The dataset MUST use a chunking strategy of 1 × height × width (one chunk per timestep).

4.2 Compression

  • The main data arrays MUST use compression to reduce storage requirements.
  • ZSTD compression is RECOMMENDED for optimal performance of the main data arrays.
  • Coordinate arrays MAY use different compression algorithms (e.g., lz4) as appropriate.

4.3 Data Structure

  • The main data variable MUST be encoded with dimensions in the order: time × height (y, lat) × width (x, lon).
  • The data type MUST be floating-point (float16, float32, or float64).

4.4 Data Variable Naming and Attributes

  • The data variable name SHOULD be a CF convention standard name or use a sensible name from the ECMWF parameter database.
  • The data variable MUST include the long_name, standard_name and units attributes following CF conventions.

4.5 Georeferencing

  • The dataset MUST include proper georeferencing information following the GeoZarr specification.
  • The data variable MUST include a grid_mapping attribute that references the coordinate reference system (crs) variable.
  • The crs variable MUST include both a spatial_ref and a crs_wkt attribute with a WKT string.

5. Global Attribute Requirements

5.1 Conditional Global Attributes

  • The global attribute consistent_timestep_start is CONDITIONALLY REQUIRED if the dataset uses a variable timestep. It MUST be an ISO formatted datetime string indicating the first timestamp where regular timestepping begins.

5.2 Licensing Requirements

  • The dataset MUST include a global license attribute containing a valid SPDX identifier.
  • The following licenses are RECOMMENDED: CC-BY, CC-BY-SA, OGL.
  • Licenses with NC or ND restrictions SHOULD generate warnings but MAY be accepted after review.

5.3 Zarr Format

  • The dataset MUST use Zarr version 2 or version 3 format.
  • If Zarr version 2 is used, the dataset MUST include consolidated metadata.

5.4 MLCast Metadata

The dataset MUST include the following global attributes:

  • mlcast_created_on: ISO formatted datetime of dataset creation.
  • mlcast_created_by: Creator contact in Name <email> format.
  • mlcast_created_with: GitHub URL of the creating software including version (e.g., https://github.com/mlcast-community/mlcast-dataset-radklim@v0.1.0) and the repository/revision MUST exist.
  • mlcast_dataset_version: Dataset specification version (semver or calver).
  • mlcast_dataset_identifier: Unique dataset identifier formatted as <country_code>-<entity>-<physical_variable> by default.
  • mlcast_dataset_identifier_format: OPTIONAL format string that MUST start with <country_code>-<entity>-<physical_variable> and MAY include only the approved identifier parts: country_code, entity, physical_variable, time_resolution, common_name.

6. Tool Compatibility Requirements

Practical interoperability checks derived from the standalone validator.

6.1 GDAL Compatibility

  • The dataset SHOULD expose georeferencing metadata readable by GDAL, including a CRS WKT.
  • A basic GeoTIFF export SHOULD roundtrip through GDAL with geotransform/projection metadata.

6.2 Cartopy Compatibility

  • The CRS WKT SHOULD be parseable by cartopy.
  • Coordinate grids SHOULD transform cleanly into PlateCarree for mapping workflows.