Before You Start: What You Need

Data collection is the foundation of every imitation learning pipeline. Poor data produces poor policies regardless of how sophisticated your training algorithm is. Before recording a single episode, verify you have the following:

Hardware Checklist

  • Robot arm with position control at 50 Hz: ViperX-300 S2, Koch v1.1, OpenArm, Franka Emika Panda, or similar. The arm must accept joint position commands at 50 Hz or higher and report current joint positions at the same rate.
  • Teleoperation interface: Leader-follower arm (highest data quality), VR controller (Meta Quest 3, lowest cost), or 3D mouse (SpaceMouse, simplest setup). See the method selection section below.
  • Cameras: Minimum 2 cameras — 1 wrist-mounted (Intel RealSense D405 recommended) and 1 overhead (RealSense D435 or Logitech C920). Mount rigidly — cameras must not move between collection sessions.
  • Workstation: Ubuntu 22.04 or 24.04, NVIDIA GPU (any modern card works for data collection — GPU is for training later), 500 GB+ NVMe SSD for episode storage. USB 3.0 ports for camera and arm connections.
  • Recording software: LeRobot recording stack, ALOHA recording scripts, or custom ROS2 data logger. Must output HDF5 files with synchronized joint data (50 Hz) and camera frames (30 fps).
  • Task objects: The objects you will manipulate during data collection. Have multiples if possible — you will want to vary object instances for generalization.
  • Backup storage: External SSD or NAS for nightly backups. A single day of data collection produces 20–50 GB. Losing a day of collection to a disk failure is expensive.

Choosing Your Teleoperation Method

The teleoperation method you choose directly affects data quality, collection throughput, and operator fatigue. Use this decision tree:

Decision Tree: Task Precision to Method

  • Sub-millimeter precision required (insertion, threading, fine assembly) → Leader-follower arm. The kinematic matching between leader and follower provides the best positional accuracy. Cost: $6,000–$10,000 for the leader arm.
  • Centimeter-level precision sufficient (pick-place, sorting, packing) → VR controller (Meta Quest 3). Natural hand movement, lower cost ($500), faster operator onboarding. Suitable for gross manipulation tasks.
  • 6-DOF end-effector control with force feedback (contact-rich tasks, surface following) → Leader-follower arm with gravity compensation. The operator feels resistance through the leader arm, enabling force-sensitive demonstrations.
  • Dexterous hand manipulation (finger-level grasping, in-hand rotation) → Exoskeleton gloves (SenseGlove Nova 2 or HaptX G1). See our bimanual setup guide.
  • Quick prototyping, single-axis tasksSpaceMouse or keyboard. Lowest barrier to entry. Suitable for validating your recording pipeline before investing in better hardware.

For most teams starting out, the recommendation is: begin with a leader-follower setup if you can afford it, or VR controller if budget is constrained. The data quality difference is measurable — leader-follower data typically produces 10–15% higher policy success rates for precision tasks.

Camera and Sensor Setup

Camera setup is one of the most underestimated factors in data collection quality. A camera that shifts by 5 mm between collection and deployment can drop your policy's grasp success rate by 20% or more.

Camera Types and When to Use Each

  • Wrist camera (mounted on robot end-effector): Provides a close-up view of the manipulation point. Essential for precision tasks (insertion, fine grasping). Moves with the robot, so it always frames the action. Recommended: Intel RealSense D405 (compact, global shutter, 640x480 at 30 fps). Mount with a rigid 3D-printed bracket — avoid tape or adhesive mounts that shift over time.
  • Overhead camera (fixed, looking down at workspace): Provides global scene context. Essential for pick-place tasks where the policy needs to localize objects in the workspace. Mount 120–150 cm above the table, centered over the workspace. Recommended: Intel RealSense D435 or Logitech C920.
  • Side camera (fixed, at table height): Captures approach trajectories and grasp profiles that are occluded from overhead. Useful for tasks involving vertical approach (stacking, inserting from above). Mount on the far side of the workspace from the operator, at table height.
  • Multi-view setup (3+ cameras): The standard for high-quality collection. ALOHA uses a 3-camera stack: left wrist, right wrist, overhead. Adding cameras increases data volume linearly but improves policy robustness to viewpoint variation.

Mounting and Calibration Tips

  • Use rigid aluminum extrusion or steel brackets. Never use clamps that can slip or adhesive mounts.
  • After mounting, mark the camera position with a paint pen or engraving. If a camera gets bumped, you can verify whether it moved.
  • Calibrate camera extrinsics using an ArUco board before every multi-day collection campaign. Store the calibration alongside your episode data.
  • Set cameras to manual exposure and fixed white balance. Auto-exposure causes brightness flickering between frames that degrades visual feature learning.
  • Verify frame synchronization: cameras and joint data must be timestamped from the same clock. Use hardware triggers (RealSense multi-cam sync) or software synchronization via a shared NTP clock.

Episode Design

An "episode" is a single demonstration of the task from start to finish. How you design episodes determines the distribution your policy learns from — and therefore what it can and cannot do at deployment time.

Defining the Task

Write a precise task specification before collecting any data. Include:

  • Task goal: "Pick up the red cube from the table and place it in the blue bowl." Be specific about which object, where it starts, and where it ends.
  • Success criteria: "The cube is inside the bowl, the gripper is open, and the arm has returned to home position." Define success unambiguously so every operator and every QA reviewer applies the same standard.
  • Allowed strategies: "Approach the cube from above with a top-down grasp." Or explicitly: "Any grasp approach is acceptable." If you constrain strategy, you get more consistent data (better for ACT). If you allow multiple strategies, you need more data but the policy may be more robust (better for Diffusion Policy).
  • Episode duration: Define a maximum episode length. Typical tabletop pick-place: 5–15 seconds. If an episode exceeds 2x the median duration, something went wrong — discard it.

Start State Distribution

The start state distribution determines your policy's deployment robustness. If every demonstration starts with the object in the exact same position, the policy will fail when the object is 3 cm to the left.

  • Phase 1 (first 50 demos): Use a small start-state distribution. Vary object position within a 5 cm radius of a central point. This ensures consistent demonstrations while the operator builds skill.
  • Phase 2 (demos 51–200): Expand to the full intended deployment distribution. Vary object position across the entire reachable workspace. Vary object orientation if relevant.
  • Phase 3 (demos 200+): Add distractors, object variants, lighting variation, and background changes. This phase targets generalization.

Reset Protocol

Between episodes, you must reset the scene to a valid start state. Define the reset protocol explicitly:

  1. Return robot to home configuration (predefined joint positions).
  2. Place the task object in a new position within the start-state distribution.
  3. Verify the scene matches the task specification (object visible, no occlusions, gripper open).
  4. Wait 1 second for the scene to settle (no residual object motion).
  5. Begin recording.

Sloppy resets are the number-one source of bad episodes. Automate the reset where possible — at minimum, automate the robot returning to home position.

Recording Your First 10 Episodes

The first 10 episodes are about validating your pipeline, not collecting training data. Follow this walkthrough:

Step-by-Step Walkthrough

  1. Launch the recording script. Verify that HDF5 output is being written to the correct directory. Check that the file size grows during recording (indicates data is flowing).
  2. Position the operator. The operator should sit or stand comfortably at the teleoperation interface with a clear view of the workspace. For leader-follower: hands on the leader arms. For VR: headset on, controllers in hand with workspace visible in passthrough mode.
  3. Record episode 1. Execute the task smoothly. Do not rush — smooth, deliberate motions produce better policies than fast, jerky ones. Typical speed: 50–70% of the fastest you could go.
  4. Stop recording and inspect the episode. Open the HDF5 file. Verify:
    • Joint position data (/observations/qpos) has the expected number of timesteps (episode_duration_seconds x 50 Hz).
    • Camera images (/observations/images/cam_high) have the expected frame count (duration x 30 fps).
    • Action data (/action) matches joint position dimensions.
    • Play back the images and verify they are in sync with the joint data.
  5. Record episodes 2–10. Vary start positions slightly. After each episode, do a quick visual check of the recorded images.
  6. Run a training test. Feed these 10 episodes into your training script. You do not expect a good policy — you are verifying that the training pipeline accepts your data format without errors. If training completes without crashing, your pipeline is validated.

Quality Control for Each Episode

Every episode should pass this 8-point quality checklist before being added to the training dataset. Reviewing takes 30–60 seconds per episode and prevents weeks of debugging bad policies later.

8-Point Episode QA Checklist

  1. Task completed successfully? The episode ends with the task goal achieved per the success criteria. Discard partial completions unless you are specifically collecting recovery data.
  2. No operator hesitation >2 seconds? Long pauses mid-task create "stationary" data points that confuse the policy. If the operator paused to think, re-record.
  3. Smooth trajectory? Play back the joint position data. Look for sudden jumps (teleoperation glitches), oscillations (controller instability), or unnatural speed changes.
  4. Camera images clear? Check for motion blur, occlusion of the manipulation point, and correct exposure. If the wrist camera image is blurry during the grasp, that episode is useless for visual policy learning.
  5. Correct start state? Verify the episode starts from within the intended start-state distribution. An episode that starts from a weird configuration introduces out-of-distribution data.
  6. Episode duration within bounds? Duration should be within 2 standard deviations of the mean. Anomalously short episodes may indicate incomplete tasks. Anomalously long episodes indicate operator difficulty.
  7. Data integrity? HDF5 file is not corrupted. All expected data fields are present. Timestamps are monotonically increasing. No NaN values in joint data.
  8. Consistent strategy? If you are constraining strategy (recommended for ACT), verify the episode follows the designated approach. Mixed strategies in a small dataset cause mode averaging failures.

Common Data Collection Mistakes

1. Inconsistent Camera Position Between Sessions

What happens: You collect 100 episodes on Monday, then bump the overhead camera while cleaning up. Tuesday's 100 episodes have a slightly different viewpoint. The mixed dataset trains a policy that performs poorly from both viewpoints.

Fix: Rigidly mount cameras. Photograph the camera positions. At the start of each session, verify camera extrinsics against a reference ArUco board calibration. If calibration has shifted, re-mount before collecting.

2. Operator Fatigue Degrading Data Quality

What happens: The first 50 episodes of the day are smooth and precise. By episode 150, the operator is tired — demonstrations are jerkier, slower, and less consistent. The policy learns an average of good and bad demonstrations.

Fix: Enforce 10-minute breaks every 45 minutes. Limit collection sessions to 4 hours per operator per day. Track quality metrics (episode duration, trajectory smoothness) over time and stop collection when quality degrades.

3. Narrow Start-State Distribution

What happens: All demonstrations start with the object in roughly the same position. The policy achieves 90% success at that position but 10% success at any other position.

Fix: Systematically randomize object positions during collection. Use a grid pattern: divide the workspace into a 4x4 grid and collect at least 3 episodes starting from each grid cell. Track start-state coverage as a dataset metric.

4. Not Validating the Recording Pipeline Before Scaling

What happens: You collect 500 episodes over a week, then discover the action data was recorded at 25 Hz instead of 50 Hz due to a configuration error. The entire dataset is unusable for your target policy.

Fix: Always validate with 10 test episodes first. Run the complete training pipeline on those 10 episodes. Verify data dimensions, frequencies, and format compatibility before investing in large-scale collection.

5. Mixed Strategies in a Small Dataset

What happens: Three operators each use a different grasp approach. With only 150 total episodes (50 per approach), no single approach has enough data. ACT averages the approaches and produces an invalid middle trajectory.

Fix: For datasets under 300 episodes, standardize on a single strategy. Train operators to use the same approach. For larger datasets (300+), multiple strategies are fine — consider Diffusion Policy which handles multi-modality natively.

6. Ignoring Lighting Variation

What happens: All data is collected under consistent lab lighting. The policy fails when deployed in a slightly different lighting condition (afternoon sun, different overhead light).

Fix: During Phase 2 and 3 collection, deliberately vary lighting. Turn off overhead lights for some episodes. Add a desk lamp from different angles. Close/open window blinds. This forces the visual encoder to learn lighting-invariant features.

Scaling from 50 to 500 Episodes

Scaling data collection introduces organizational challenges that do not exist at small scale. Here is how to manage the transition:

Operator Training

When adding new operators beyond the original 1–2:

  • Have new operators watch 10 example episodes before they start.
  • Require 20 practice episodes that pass QA before counting toward the dataset.
  • Pair new operators with an experienced operator for the first session.
  • Track per-operator quality metrics: mean episode duration, trajectory smoothness score, QA rejection rate.

Shift Scheduling

For large-scale collection (500+ episodes):

  • Schedule 2–3 operators per day in 3-hour shifts with 30-minute overlap for handoff.
  • Each shift produces approximately 40–60 episodes (at 3–4 minutes per episode including reset and QA).
  • Designate one person as the "data lead" who reviews quality metrics at end of each day and flags degradation.
  • Budget: at SVRC rates, professional operators collect at $15–25 per validated episode including QA. In-house collection costs vary but typically run $8–15 per episode when accounting for operator time and overhead.

Quality Degradation Detection

Monitor these metrics daily during large-scale collection:

  • Mean episode duration: Should remain stable (±10%). Increasing duration indicates operator fatigue or increasing task difficulty.
  • QA rejection rate: Should stay below 15%. If it rises above 20%, stop collection and investigate.
  • Trajectory smoothness: Compute the jerk (third derivative of position) of each episode. Flag episodes with mean jerk >2x the dataset average.
  • Start-state coverage: Plot start positions on a 2D heatmap. Verify uniform coverage across the intended distribution. Redirect operators to under-sampled regions.

Data Storage and Organization

Folder Structure

project-name/
  raw/
    2026-04-12/
      episode_001.hdf5
      episode_002.hdf5
      ...
    2026-04-13/
      episode_101.hdf5
      ...
  validated/
    episode_001.hdf5    # passed QA
    episode_002.hdf5
    ...
  rejected/
    episode_045.hdf5    # failed QA, kept for reference
    ...
  metadata/
    collection_log.csv  # operator, date, start_state, duration, QA status
    camera_calibration_2026-04-12.json
    task_specification.md
  training/
    train/              # 85-90% of validated episodes
    val/                # 10-15% of validated episodes

Naming Conventions

  • Episode files: episode_{sequential_number:04d}.hdf5 (e.g., episode_0001.hdf5).
  • Include metadata in the HDF5 file attributes: operator ID, collection date, start state description, QA status, task name.
  • Never reuse episode numbers. If you delete episode 45, the next episode is still 46, not 45.

Backup Strategy

  • Daily: rsync the raw/ directory to an external SSD or NAS.
  • Weekly: upload the validated/ directory to cloud storage (AWS S3, Google Cloud Storage).
  • After each collection campaign: create a versioned archive (e.g., dataset_v1.0_2026-04-15.tar.gz) and upload to your team's shared storage.
  • Never edit HDF5 files in place. If you need to modify data (e.g., crop episodes, fix timestamps), create a new file and keep the original in raw/.

Uploading to Training Pipeline

LeRobot Ingestion

To convert your HDF5 dataset to LeRobot format for training with ACT or Diffusion Policy:

# Convert HDF5 episodes to LeRobot Parquet format
python -m lerobot.scripts.convert_dataset \
  --raw-dir ./validated/ \
  --repo-id my-org/my-task-v1 \
  --raw-format hdf5_aloha

# Verify the converted dataset
python -m lerobot.scripts.visualize_dataset \
  --repo-id my-org/my-task-v1 \
  --episode-index 0

HDF5 Structure Verification

Before converting, verify your HDF5 files contain the expected structure:

# Quick verification script
import h5py
import numpy as np

with h5py.File('episode_0001.hdf5', 'r') as f:
    # Check required fields
    assert 'observations' in f
    assert 'action' in f

    qpos = f['observations/qpos'][:]
    images = f['observations/images/cam_high'][:]
    actions = f['action'][:]

    print(f"Joint data: {qpos.shape}")    # expect (T, num_joints)
    print(f"Images: {images.shape}")       # expect (T_img, H, W, 3)
    print(f"Actions: {actions.shape}")     # expect (T, num_joints)

    # Check for NaN values
    assert not np.any(np.isnan(qpos)), "NaN in joint data!"
    assert not np.any(np.isnan(actions)), "NaN in actions!"

    # Check timestamp monotonicity
    if 'timestamps' in f:
        ts = f['timestamps'][:]
        assert np.all(np.diff(ts) > 0), "Non-monotonic timestamps!"

For detailed format specifications and conversion between HDF5, RLDS, and LeRobot formats, see our data format guide.