Published Dataset Contract¶
Purpose¶
This document defines how raw episode bags are converted into the published LeRobot dataset.
The raw bag preserves asynchronous truth. The published dataset is a fixed-rate aligned view of that raw data.
Core Design Rule¶
The published dataset is derived from the raw episode.
That means:
- raw recording preserves asynchronous source truth
- conversion builds one explicit aligned learning view
- published artifacts must not silently redefine what the raw episode meant
This is why the published layer is allowed to be stricter and smaller than the raw layer without pretending the raw layer never existed.
Conversion Profile¶
The current pipeline uses one checked-in conversion profile at 20 Hz:
multisensor_20hz
That file defines:
- published frame rate
- timestamp and alignment policy
- missing-data policy
- diagnostics policy
It does not hardcode one fixed embodiment or one fixed sensor set.
Current implementation note:
- raw recording uses
multisensor_20hz.yamlplus the session's active-arm set and enabled sensor keys - conversion uses
multisensor_20hz.yamlplus the manifest's active-arm set and thesensor_keyvalues fromsensors.devices --profileis now overriding the generic conversion policy, not choosing between arm-specific profile files
Why¶
If GelSight is a first-class published modality, then 20 Hz is the most honest default common rate. A faster published rate would either duplicate tactile frames too aggressively or claim more temporal precision than the raw streams actually support.
The checked-in policy is intentionally generic:
- one conversion policy
- schema derived from manifest active arms and the
sensor_keyvalues undersensors.devices
That is simpler and more honest than maintaining near-copy profile files only to encode embodiment differences.
Raw vs Published¶
The raw bag may contain one active arm or two active arms.
That does not mean all raw episodes should be coerced into one published schema.
Rules:
- raw recording should preserve whichever
/spark/...robot topics are actually present - published conversion should derive its effective schema from the recorded embodiment and sensors
- do not zero-fill an inactive arm into a bimanual schema by default
- do not append episodes from different active-arm or sensor layouts into the same
dataset_id
Why¶
The storage cost of zero-filling an inactive arm is small, but the semantic cost is not. It mixes single-arm and bimanual behavior into one schema and makes downstream training depend on implicit padding conventions instead of explicit embodiment choice.
Published Folder Contract¶
One published folder must represent one coherent dataset contract.
In practice that means:
- one effective low-dimensional schema
- one image/depth field set
- one embodiment and sensor-layout interpretation
Do not append episodes with incompatible published schemas into the same folder.
If the effective schema changes, the correct action is:
- use a new published folder
not:
- silently append and hope downstream code tolerates shape drift
Effective Schema Resolution¶
For each raw episode:
- Read the generic conversion profile.
- Read the manifest active-arm set.
- Read the recorded sensor keys from
sensors.devices. - Derive the effective published schema from those two pieces of episode truth.
- Fail conversion if the arm presence is ambiguous or inconsistent.
Examples of inconsistent episodes that should fail:
lightningstate exists butlightningaction does notthunderaction exists butthunderstate does not- an arm comes and goes in a way that makes the effective published schema ambiguous for the episode
Canonical Published Time Grid¶
For each raw episode:
- Load all required published streams.
- Compute:
t_start = max(first timestamp of each required published stream)t_end = min(last timestamp of each required published stream)- Define:
t_k = t_start + k / 20.0- Keep frame indices while
t_k <= t_end.
Why¶
This creates one explicit frame timeline for the published episode. The timeline is no longer implicit in whichever modality happened to be processed first. This is the cleanest form for LeRobot and for downstream learning code.
Teleop Activity And Valid Published Frames¶
/spark/session/teleop_active is now part of the raw conversion contract for
supported episodes.
Why:
- pedal-off spans are intentional inactivity, not missing-action failures
- published conversion should remove those spans from the usable interval
- missing activity should fail conversion instead of inventing fallback behavior
So the published dataset is not just “raw topics sampled at 20 Hz.” It is the usable active teleoperation interval sampled at 20 Hz under explicit validity rules.
Published Observation Schema¶
The published dataset includes:
observation.stateaction- one image field for each recorded sensor with a color stream
- one depth field for each recorded sensor with a depth stream that the effective schema includes
Why¶
This keeps the published schema honest to what was actually recorded while still using one stable conversion policy.
Field names are derived mechanically from sensor keys. Examples:
/spark/cameras/lightning/wrist_1observation.images.lightning.wrist_1observation.depth.lightning.wrist_1/spark/cameras/world/scene_1observation.images.world.scene_1observation.depth.world.scene_1/spark/tactile/lightning/finger_leftobservation.images.tactile.lightning.finger_left
For arm-dependent low-dimensional features, the effective schema uses a fixed arm order:
lightningthunder
That ordering must not change across episodes.
If only one arm is active, only that arm's low-dimensional slice appears in the effective schema.
Published Provenance¶
The published dataset keeps a copy of the raw source snapshot per episode under:
meta/spark_source/<episode_id>/episode_manifest.jsonmeta/spark_source/<episode_id>/notes.md
And the converter writes episode-level conversion artifacts under:
meta/spark_conversion/<episode_id>/diagnostics.jsonmeta/spark_conversion/<episode_id>/conversion_summary.jsonmeta/spark_conversion/<episode_id>/effective_profile.yaml
This is deliberate.
Dataset-level metadata alone is not enough to reconstruct the exact episode truth later. The copied raw snapshot keeps the learning artifact tied back to the original source-of-truth episode.
Observation State Definition¶
The current bimanual multisensor_20hz profile uses this flat observation.state order:
lightning¶
lightning_joint_pos_1lightning_joint_pos_2lightning_joint_pos_3lightning_joint_pos_4lightning_joint_pos_5lightning_joint_pos_6lightning_eef_xlightning_eef_ylightning_eef_zlightning_eef_rxlightning_eef_rylightning_eef_rzlightning_gripper_positionlightning_ft_fxlightning_ft_fylightning_ft_fzlightning_ft_txlightning_ft_tylightning_ft_tz
thunder¶
thunder_joint_pos_1thunder_joint_pos_2thunder_joint_pos_3thunder_joint_pos_4thunder_joint_pos_5thunder_joint_pos_6thunder_eef_xthunder_eef_ythunder_eef_zthunder_eef_rxthunder_eef_rythunder_eef_rzthunder_gripper_positionthunder_ft_fxthunder_ft_fythunder_ft_fzthunder_ft_txthunder_ft_tythunder_ft_tz
Why¶
This keeps all robot-side low-dimensional state in one compact feature, which is the easiest shape for LeRobot-style datasets and policy code. It also avoids prematurely creating many custom low-dimensional namespaces.
For both arms:
*_gripper_positionis normalized measured opening on0..10.0 = fully open1.0 = fully closed
Action Definition¶
The current bimanual multisensor_20hz profile uses this flat action order:
lightning¶
lightning_cmd_joint_1lightning_cmd_joint_2lightning_cmd_joint_3lightning_cmd_joint_4lightning_cmd_joint_5lightning_cmd_joint_6lightning_cmd_gripper
thunder¶
thunder_cmd_joint_1thunder_cmd_joint_2thunder_cmd_joint_3thunder_cmd_joint_4thunder_cmd_joint_5thunder_cmd_joint_6thunder_cmd_gripper
Why¶
The published action is the command sent by the teleoperation/runtime stack.
This is more stable and more semantically honest than silently replacing action
with a derived delta later in the pipeline.
For both arms:
*_cmd_gripperis commanded gripper opening on0..10.0 = fully open1.0 = fully closed
For single-arm episodes, use only the corresponding arm slice.
Per-Topic Alignment Rules¶
Robot state¶
Sources:
/spark/{arm}/robot/joint_statefor each active arm/spark/{arm}/robot/eef_posefor each active arm/spark/{arm}/robot/tcp_wrenchfor each active arm/spark/{arm}/robot/gripper_statefor each active arm
Alignment rule:
- choose the latest sample with timestamp
<= t_k
Validity threshold:
- max age 50 ms
Why¶
Robot state is causal and should not look into the future relative to the published frame time. Latest-before is the correct rule for a state signal that is being sampled onto a coarser grid.
Action¶
Sources:
/spark/{arm}/teleop/cmd_joint_statefor each active arm/spark/{arm}/teleop/cmd_gripper_statefor each active arm
Alignment rule:
- choose the latest sample with timestamp
<= t_k
Validity threshold:
- max age 150 ms
Why¶
Action is also causal. A nearest-future command would make the published sample look as though the system already knew a command that had not yet been issued.
The action threshold is intentionally looser than the state threshold because the current Spark command path can exhibit isolated command gaps even when the demonstration is still semantically valid. The published action still uses bounded latest-before hold, but the bound is wide enough to tolerate short runtime hiccups without silently allowing large stale spans.
Teleop activity mask¶
Raw source:
/spark/session/teleop_active
Alignment rule:
- treat the Boolean value as a zero-order-held teleop-activity signal until the next sample
- keep published frames only while the held value is
true
Why¶
The action topics encode what command was issued, not whether the operator intended teleoperation to be active continuously. When the foot pedal is intentionally released in the middle of a raw episode, those pedal-off spans should be removed from the published demonstration rather than counted as stale-action failures.
The teleop-activity topic is part of the required raw contract for supported episodes. Conversion does not fall back to a command-only interpretation when that signal is missing.
Camera RGB¶
Sources:
- every recorded camera color topic
/spark/cameras/{attachment}/{camera_slot}/color/image_raw
Alignment rule:
- choose the nearest sample to
t_k
Validity threshold:
- max skew 25 ms
Why¶
Image streams are observations, not control signals. Nearest is the correct rule for selecting the frame that best represents the scene around the target time.
Tactile RGB¶
Sources:
- every recorded tactile color topic
/spark/tactile/{arm}/{finger_slot}/color/image_raw
Alignment rule:
- choose the nearest sample to
t_k
Validity threshold:
- max skew 25 ms
Why¶
GelSight RGB is treated like a tactile image stream. Nearest-to-grid keeps the published episode visually coherent without pretending tactile updates happen exactly on the grid.
Missing Data Policy¶
If a required modality is outside its validity threshold:
- if the failure occurs only at the episode tail, truncate the episode at the last valid frame
- otherwise fail conversion for that episode
Exception:
- frames masked out by the teleop-activity signal are not treated as failures
- they are removed from the published timeline before action-age validity is applied
Why¶
Silent filling of large gaps hides real collection problems and makes the dataset look healthier than it is. Tail truncation is acceptable because it only shortens the usable interval. Mid-episode failures should be made explicit.
Raw-Only Modalities¶
The following topics remain raw-only unless the effective schema explicitly publishes them:
/spark/session/teleop_active- optional point cloud or debugging topics
Depth is not automatically discarded. If a recorded sensor has a publishable depth stream under the effective schema, it becomes a published depth field derived mechanically from that sensor key.
Why¶
These topics are valuable to preserve, but they complicate the first published dataset contract without being necessary for the first training and visualization workflows.
Multi-Sensor Rules¶
The current pipeline supports:
- multiple RealSense sensors
- multiple GelSight sensors
The effective published schema includes every recorded sensor that resolves to a publishable stream under the shared topic contract.
The raw manifest should still preserve every recorded sensor as a sensor instance with:
sensor_keyserial_number- sensor-specific metadata captured at record time when available
- for RealSense: stream profiles, intrinsics, firmware version, and
depth_scale_meters_per_unit
Why¶
Support for multiple sensors comes from the raw-first design, not from trying to cram every modality into one live fused message. The generic conversion policy keeps alignment rules stable while still allowing each session to publish the sensors it actually recorded.
Conversion Outputs¶
For each raw episode, conversion produces:
- one published episode in the shared LeRobot dataset
- episode-level diagnostics
- a source snapshot under:
meta/spark_source/<episode_id>/episode_manifest.jsonmeta/spark_source/<episode_id>/notes.md
The copied source manifest is the canonical per-episode provenance record inside the published dataset. Dataset-level sidecars like meta/depth_info.json are only indexes for the published layout, not replacements for the raw manifest.
Diagnostics should include:
- usable interval
- number of published frames
- per-modality alignment error summary
- count of invalid or dropped frames
Why¶
A successful conversion should mean more than "the script finished." We need enough diagnostics to judge whether the episode quality is acceptable.