Skip to content

Archive And Compression Strategy

Purpose

This page explains why live recording, archive generation, and published export are separate stages with different compression policies.

Core Decision

The pipeline does not try to make the live capture bag also be the final storage-optimized artifact.

Instead, it separates:

  • raw capture
  • offline archive generation
  • published dataset export

That separation is deliberate.

Why Live Capture Stays Simple

The capture path is optimized for:

  • recording reliability
  • low runtime overhead
  • faithful raw topic preservation
  • safe post-take debugging

That is why the capture bag stays:

  • one bag per demo
  • plain mcap
  • untrimmed
  • not rewritten in place after recording

The design assumption is that recording integrity matters more than squeezing maximum storage savings out of the first write.

Why Archive Is Offline

Archive generation does the work that is too risky or too expensive to put on the live recording path:

  • head/tail trim
  • lossless image transcode
  • MCAP chunk compression
  • archive verification and provenance

Doing this offline means:

  • the original capture stays preserved if archive generation fails
  • one bad archive job does not corrupt the source-of-truth artifact
  • heavy transcode work does not sit on the demo-to-demo critical path

archive_episode.py also writes a separate archive_manifest.json so trim, transcode, and final archive settings are auditable without mutating the raw episode manifest.

Why Compression Policy Depends On Artifact Type

Compression is not one global policy. It depends on what the artifact is for.

Capture bag

Primary goal:

  • trustworthy recording

Preferred properties:

  • minimal runtime work
  • simple storage backend
  • no in-place mutation

Archive bag

Primary goal:

  • smaller long-term ROS-native artifact

Preferred properties:

  • lossless image compression
  • offline trim
  • MCAP chunk compression

Implementation note:

  • archive_episode.py exposes zstd_fast and zstd_small
  • the default archive preset is zstd_small

Published dataset

Primary goal:

  • aligned learning artifact

Preferred properties:

  • fixed-rate frames
  • model-facing schema
  • copied source provenance

Why The Archive Path Is Lossless

The archive path is designed to stay lossless for the important visual modalities.

That is why the archive path uses:

  • RGB and tactile archived with PNG-backed compressed image transport
  • depth archived with lossless compressedDepth PNG

The design reasoning is simple:

  • archive should reduce storage cost
  • but it should not give up future debugging or geometry fidelity casually

Why Head/Tail Trim Moved Out Of Recording

Trim is now an archive-time decision, not a record-time mutation.

That avoids two problems:

  • recording logic doing too much on the critical path
  • source-of-truth bags being rewritten to fit later storage preferences

The raw capture remains what was recorded. The archive is the curated long-term ROS-native derivative.

Design Consequence

Future work should preserve this artifact logic:

  • capture first
  • archive later
  • publish separately

If one stage starts trying to impersonate another, the pipeline will become harder to trust and harder to debug.