Skip to content

Raw, Published, And Archive Artifacts

Purpose

This page explains the three main artifact types in the pipeline and why they are intentionally separate.

One Demo, One Raw Episode

The first durable artifact for a take is always:

  • one raw episode folder under raw_episodes/<episode_id>/

That folder contains:

  • bag/
  • episode_manifest.json
  • notes.md

The raw episode is the source-of-truth record of what happened during the take.

Current raw-episode contents are:

  • bag/
  • episode_manifest.json
  • notes.md

Artifact Types

1. Capture bag

The capture bag is the immediate ROS-native output of recording.

Current policy:

  • one bag per demo
  • plain mcap
  • no live trim
  • no live bag rewrite

Why

Live recording should optimize for:

  • reliability
  • faithful topic preservation
  • post-take debugging
  • safe downstream conversion

It should not try to be the final storage-optimized artifact.

2. Archive bag

The archive bag is a derived offline artifact created later from the preserved capture bag.

Its purpose is:

  • long-term storage reduction
  • ROS-native playback and inspection
  • lossless compression of visual topics

Why archive is offline

Compression, trim, and transcode add runtime risk during recording. The pipeline separates that work so:

  • the demo-to-demo critical path stays simple
  • failures in archive generation do not corrupt the original capture
  • capture bags remain available for debugging and conversion

See:

  • data_pipeline/docs/internal/raw-storage.md
  • data_pipeline/docs/internal/archive-bag.md

Current archive outputs live under:

  • raw_episodes/<episode_id>/archive/
  • raw_episodes/<episode_id>/archive/bag/
  • raw_episodes/<episode_id>/archive/archive_manifest.json

3. Published dataset

The published dataset is the fixed-schema learning artifact under published/.

It is derived from the raw episode, not from the operator UI state and not from the archive bag by default.

Its purpose is:

  • fixed-rate aligned training data
  • stable LeRobot-compatible schema
  • long-term provenance for converted episodes

Why it is separate from archive

The archive bag stays ROS-native. The published dataset is model- and tooling-facing.

Keeping them separate avoids a false choice between:

  • ROS-native debugging
  • training-ready dataset layout

Source Of Truth Rule

The raw episode remains authoritative because it preserves:

  • the original asynchronous topic streams
  • the resolved per-episode manifest snapshot
  • notes attached to the take

The published dataset may copy source provenance, but it is still a derived view.

Current conversion artifacts also include:

  • published/<dataset_id>/meta/spark_conversion/<episode_id>/diagnostics.json
  • published/<dataset_id>/meta/spark_conversion/<episode_id>/conversion_summary.json
  • published/<dataset_id>/meta/spark_conversion/<episode_id>/effective_profile.yaml

Published Provenance Rule

The published dataset now keeps a copy of the raw source snapshot for each converted episode under:

  • meta/spark_source/<episode_id>/episode_manifest.json
  • meta/spark_source/<episode_id>/notes.md

This keeps the learning artifact tied back to the exact raw episode truth without pretending that dataset-level metadata alone is enough.

If published depth is enabled for the effective schema, the dataset also carries derived sidecars under:

  • published/<dataset_id>/depth/
  • published/<dataset_id>/depth_preview/
  • published/<dataset_id>/meta/depth_info.json

Design Consequences

  • do not optimize live recording around long-term archive size first
  • do not treat the published dataset as a substitute for the raw episode
  • do not assume the archive bag is interchangeable with the raw capture unless the pipeline is explicitly redesigned around that choice