Reliable & Scalable Synthetic Data for Physical AI (Part 1): Taming NVIDIA Cosmos with RoBoost Agent

Scaling Physical AI requires reliable synthetic data. Learn how RoBoost Agent integrates NVIDIA Cosmos to transform world models into trustworthy data engines for robotics and autonomous driving.
Reliable & Scalable Synthetic Data for Physical AI (Part 1): Taming NVIDIA Cosmos with RoBoost Agent

Introduction

Figure 1. RoBoost Agent orchestrates NVIDIA Cosmos into a reliable & scalable synthetic data pipeline for Physical AI.
Figure 1. RoBoost Agent orchestrates NVIDIA Cosmos into a reliable & scalable synthetic data pipeline for Physical AI.

Asymmetry of Physical AI Data

Three years after the emergence of ChatGPT, the engineering takeaway is clear. Large Language Models (LLMs) are effective at solving problems in the digital domain. They can write, code, reason, and automate complex workflows. This success is based on their massive data advantage. LLMs were trained on an internet-scale record of human text, effectively learning centuries of accumulated knowledge.
However, Physical AI does not benefit from this data advantage. The high-fidelity real-world data required for training physical AI systems has no such pre-existing source.

The Edge Case Bottleneck

Unlike in the digital domain, data collection for Physical AI is fundamentally expensive. Training robots or autonomous vehicles requires not only visual observations but also corresponding control signals, such as steering angles, joint torques, and gripper forces. These signals can only be captured by physically operating the real system, which introduces significant hardware costs and maintenance overhead.
While gathering standard operating data is already costly under these constraints, capturing the long tail of rare, safety-critical edge cases becomes prohibitively so. Scenarios like heavy rain at night, near-miss collisions, or objects with unusual surface textures rarely occur and cannot be reliably staged. Ultimately, this data bottleneck limits model performance in practice and explains why Physical AI is progressing much slower than its theoretical potential suggests.

World Models as a Simulation Engine

To address this bottleneck, world models have emerged as a promising solution. A world model is a generative model that predicts how a physical scene will evolve over time, rendering those predictions into video. By capturing complex dynamics such as object motion, material properties, and environmental interactions, world models go beyond standard video generation. While conventional models prioritize visual aesthetics, world models simulate the physical mechanisms behind the pixels. This physical grounding makes them well suited for producing high-fidelity synthetic data in scenarios where real-world collection is impractical or dangerous.
NVIDIA Cosmos is a concrete example of this approach, offering three capabilities for physical systems. Cosmos-Predict 2.5 simulates future states by generating video frames, determining what happens next in a physical sequence. Cosmos-Transfer 2.5 converts a known scenario into entirely new conditions, such as different weather, lighting, or material properties. Cosmos-Reason 2 analyzes what is happening in a scene and evaluates whether the dynamics are physically consistent.

The Practical Limits of the Cosmos Workflow

The NVIDIA Cosmos Cookbook offers hands-on guides to easily leverage Cosmos through complete end-to-end workflows. In particular, the Cosmos-Transfer 2.5 recipes demonstrate how to transfer the style of a video while maintaining the physical structure of the original input.
However, there is a critical gap between visually convincing augmentation and training-grade data for Physical AI. In production pipelines, simply generating a video, visually inspecting it, and adding it to the dataset fails systematically. The primary failure is not visual fidelity, but semantic consistency and task label integrity. Once a generated sample diverges from the targeted task semantics, it becomes harmful to training even if it appears physically plausible and temporally consistent.
To illustrate these limitations concretely, we examine representative failure cases in two domains: autonomous driving and robotic manipulation.

Autonomous Driving: Semantic Consistency Problem

In autonomous driving, the breakdown happens when global restyling fails to preserve critical details. Cosmos-Transfer 2.5 convincingly changes the overall look, but doesn't reliably keep safety-critical semantics unchanged. Small elements that govern driving behavior (e.g., traffic lights and lane markings) may be inconsistently rendered or even hallucinated. This creates videos that look coherent yet no longer match the original behavioral label.
 
Figure 2. Cosmos-Transfer 2.5 variations of an autonomous driving video under different environmental conditions (top-left: original; top-center: rainy night; top-right: snowy daytime; bottom-left: foggy dawn; bottom-center: sunset; bottom-right: rainy daytime).
As shown in Figure 2, Cosmos-Transfer 2.5 unintentionally modifies or introduces critical artifacts despite following the Cosmos Cookbook prompting style. In this example, a traffic light absent from the original video appears in the generated output. This creates a semantically invalid scenario where the vehicle appears to cross an intersection against a red signal. While the video remains visually coherent and temporally stable, its behavior no longer matches the original label.

Robotic Manipulation: Task Label Integrity Problem

This limitation is even more critical in robotic manipulation. Manipulation tasks are easily influenced by small, action-critical details that directly affect the policy's next action (e.g., contact points, object boundaries, and material cues). Here, the goal is not just restyling the entire scene, but maintaining alignment between the transferred video and the intended task labels. However, even though Cosmos-Transfer 2.5 is designed to maintain physical structure, a prompt-driven one-shot workflow can trigger background drift, lighting shifts, or distorted shapes in nearby objects.
Figure 3. Variations of robot manipulation videos generated by Cosmos-Transfer 2.5. In each grid, the top-left panel is the original video, and the remaining three panels show table-texture transfer results (top-right: blue plastic; bottom-left: marble; bottom-right: glass). (Left) Task label: “move the green cloth from left to right”. (Right) Task label: “move the blue cloth off the stove to the lower right side”.
In Figure 3, the goal was to modify only the table's material, but the output introduces unintended changes elsewhere in the scene. This creates a mismatch between the described action in the task’s label and the actual content shown in the video. When trained on such data, the policy may learn from contradictory supervision and rely on spurious visual cues rather than robust manipulation skills.

Cosmos-Reason 2 as a critic: physics checks can still pass the wrong videos

The Cosmos Cookbook provides a comprehensive prompting guide for utilizing Cosmos-Reason 2 as a quality filter for synthetic data rejection. Acting as an automated video critic, the model evaluates generated videos for physical plausibility via physically grounded reasoning. While this baseline successfully catches obvious generation failures, such as objects disappearing or broken temporal consistency, the filter is fundamentally limited by what it is designed to verify.
Figure 4. Qualitative analysis on evaluation results from Cosmos-reason 2. Evaluation protocol follows the “Cosmos Reason as Reward” recipe from Cosmos Cookbook.
As shown in Figure 4, even if a generated video is physically plausible but semantically misaligned with the task, it will still pass the filter. In autonomous driving, a traffic signal might remain temporally consistent frame-to-frame while contradicting the intended behavioral label or driving decision. In robotic manipulation, a subtle material change might follow real-world physics perfectly but alter the object's physical affordances in ways the existing task labels don't account for. The critical takeaway is that physical plausibility is necessary but insufficient; without semantic alignment to the task labels, a physically correct video may become harmful to the training process.

RoBoost Agent: An Orchestration Layer for Reliable Data

Figure 5. Overview of RoBoost Agent pipeline
Figure 5. Overview of RoBoost Agent pipeline
While world models can already produce photorealistic videos, strict control is required to generate reliable real-world datasets for Physical AI training. However, the vanilla Cosmos workflow struggles to preserve the original context when applying modifications.
To solve this problem, we introduce the RoBoost Agent. The RoBoost Agent provides a controllable pipeline by adding an orchestration layer on top of the original workflow. This enables the pipeline to generate variants from a real or simulated video (such as changes in weather, time of day, or textures) while avoiding unintended changes and physical inconsistencies.
The core design is straightforward: analyze the source semantics, apply minimal modification, constrain generation with structural signals, and verify every output. Each step exists to keep the augmented video aligned with the original task labels, not just visually plausible.
The pipeline consists of the following four steps:

Step 1: Analysis

Figure 6. Step 1 (Analysis) converts the source video into a grounded scene description that captures task-relevant structure and semantics. This output is used as the reference for controlled editing.
Figure 6. Step 1 (Analysis) converts the source video into a grounded scene description that captures task-relevant structure and semantics. This output is used as the reference for controlled editing.
Step 1 captures the semantics of the source video before any modification happens. We employ Cosmos-Reason 2 to extract a grounded description of the original video, capturing only the structural and task-relevant information required for downstream augmentation. This description serves as the baseline that Step 2 edits against, so anything not explicitly targeted for modification carries through unchanged.

Step 2: Modification

Figure 7. Step 2 (Modification) edits the Step 1 scene description with a minimal semantic delta, changing only the target attributes while preserving the original task context.
Figure 7. Step 2 (Modification) edits the Step 1 scene description with a minimal semantic delta, changing only the target attributes while preserving the original task context.
Step 2 introduces the target variation as a minimal delta on the reference description, rather than rewriting it from scratch. We use NVIDIA Nemotron-3-Nano as a prompt editor to change only the attributes that need to differ (like weather or lighting) while leaving everything else intact. By keeping edits small, the pipeline minimizes semantic drift and preserves task-label consistency.

Step 3: Generation

Figure 8. Step 3 (Generation) generates candidate videos based on the Step 2 prompt and structure-preserving signals to maintain consistency.
Figure 8. Step 3 (Generation) generates candidate videos based on the Step 2 prompt and structure-preserving signals to maintain consistency.
Step 3 turns the edited description into candidate videos under explicit structural constraints. Using the modified prompt, Cosmos-Transfer 2.5 produces new videos while being conditioned on the source video’s structure-preserving signals. We calculate the depth maps, segmentation masks, and edge outputs using external models, and dynamically utilize these signals depending on the task. This spatial conditioning prevents unintended changes to the background or objects that must remain unchanged.

Step 4: Validation

Figure 9. Step 4 (Validation) verifies each generated video for physical scores and task-specific scores, with optional object-level checks.
Figure 9. Step 4 (Validation) verifies each generated video for physical scores and task-specific scores, with optional object-level checks.
Step 4 acts as the final gate, where every candidate must pass strict verification before entering the dataset. Because generation is non-deterministic, some outputs still break training-grade requirements despite the upstream constraints. We use Cosmos-Reason 2 as the primary critic to evaluate physical scores and task-specific scores. When a dataset demands higher precision, the agent adds object-level procedures. By integrating detection models and local masks, we score specific regions of interest rather than relying on global judgments alone. Only videos that survive this stage are accepted.
Ultimately, the RoBoost Agent is designed for scalable reliability. By anchoring source semantics, employing guided editing, and verifying outputs at every stage, the pipeline ensures consistent data augmentation at scale. The final output is not just a realistic video, but validated, high-fidelity data suitable for training.

Measuring Dataset Yield and Pipeline Efficiency

Experiment Setup

We compared the RoBoost Agent against the vanilla Cosmos workflow to understand its practical impact. To prevent human bias in the evaluation, we utilized a Vision-Language Model (VLM)-as-a-Judge approach using Cosmos-Reason 2 within our validation pipeline to assess the quality of the generated videos.
Our evaluation framework consisted of:
  • Dataset: We evaluated 600 videos across two domains: Autonomous Driving and Robotic Manipulation. In each domain, we compared 150 generations from the vanilla Cosmos workflow with 150 from the RoBoost Agent. The Autonomous Driving videos were derived from the nuScenes dataset (nuScenes) and the Robotic Manipulation videos came from BridgeData V2 (BridgeData V2).
  • Evaluation Metrics: We measured the ratio of high-quality videos usable for training using two categories of scores. The metric design followed the evaluation framework in the Cosmos Cookbook’s Cosmos Reason as Reward guideline, which emphasizes physical plausibility and spatial-temporal reasoning in generated videos.
    • Physical Scores: Gravity, object interaction, motion consistency, lighting coherence, object permanence, and temporal consistency.
    • Task-Specific Scores: Traffic rules (autonomous driving) and task consistency (robotic manipulation).
  • Scoring System: Quality was graded on a scale of 1 to 5, with 3 indicating moderate quality. Any samples with scores lower than 3 were classified as a failure.

Autonomous Driving Results

 
Figure 10. Variations of autonomous driving video generated by the RoBoost Agent (top-left: original; top-center: rainy night; top-right: snowy daytime; bottom-left: foggy dawn; bottom-center: sunset; bottom-right: rainy daytime).
 
The quality improvements directly address the semantic breakdowns observed in the vanilla Cosmos workflow. For autonomous driving data augmentation, the RoBoost Agent preserved critical traffic semantics. As shown in Figure 10, this reduced decision-critical artifacts, including hallucinated red lights that appeared in the vanilla Cosmos results (Figure 2).
Figure 11. Validation results for autonomous driving samples, comparing the video outputs generated by the vanilla Cosmos workflow (left) and the RoBoost Agent (right).
Figure 11. Validation results for autonomous driving samples, comparing the video outputs generated by the vanilla Cosmos workflow (left) and the RoBoost Agent (right).
These gains were consistent throughout the RoBoost validation step. While output videos from both the vanilla Cosmos workflow and the RoBoost Agent demonstrated physical plausibility, the validation step revealed unintended traffic rule violations that could compromise data reliability. As previously noted, the vanilla Cosmos workflow hallucinated red lights and thus failed the validation, whereas the RoBoost Agent successfully preserved the traffic light and passed.

Robotic Manipulation Results

Figure 12. Variations of robotic manipulation video generated by the RoBoost Agent. In each grid, the top-left panel is the original video, and the remaining three panels show table-texture transfer results (top-right: blue plastic; bottom-left: marble; bottom-right: glass). (Left) Task label: “move the green cloth from left to right”. (Right) Task label: “move the blue cloth off the stove to the lower right side”.
The trend remained consistent in the robotic manipulation domain. While the vanilla Cosmos workflow often introduced background drift and material distortion outside the intended edit regions, the RoBoost Agent kept changes more tightly localized. Figure 12 demonstrates a clear enhancement in context consistency. Compared to the previous results in Figure 3, only the targeted table material has changed.
Figure 13. Validation results for robotic manipulation samples, comparing the video outputs generated by the vanilla Cosmos workflow (left) and the RoBoost Agent (right).
Figure 13. Validation results for robotic manipulation samples, comparing the video outputs generated by the vanilla Cosmos workflow (left) and the RoBoost Agent (right).
These observations were maintained during the RoBoost validation step. Task consistency was broken in vanilla Cosmos workflow, resulting in a failure to pass the validation gate. In contrast, the RoBoost Agent successfully managed the core content and passed.

Quantitative Comparison Results

Figure 14. Pass rate comparison between the vanilla Cosmos workflow and the RoBoost Agent across 150 generated videos. Pass is defined as scoring 3 or above across all evaluation criteria on a 1–5 scale.
Figure 14. Pass rate comparison between the vanilla Cosmos workflow and the RoBoost Agent across 150 generated videos. Pass is defined as scoring 3 or above across all evaluation criteria on a 1–5 scale.
These qualitative differences directly translate into measurable improvements in dataset yield. In autonomous driving tests, the proportion of usable, training-ready videos increased from 80.7% (121/150) to 89.3% (134/150). For robotic manipulation, the gain was even larger, with yield improving from 44.7% (67/150) to 78.0% (117/150).
Since synthetic dataset generation relies on massive data volumes, this improved yield saves substantial compute and time. A higher proportion of usable videos results in significantly fewer computational resources are wasted on failed generations, and far less manual review is required to reach a target volume of high-quality data.

The Production Reality of Inference Speed and Compute Cost

Beyond achieving semantic control, the fundamental constraint when scaling synthetic data is inference cost. Video generation demands high computational overhead and heavily consumes available resources. Even if the output quality is flawless, a pipeline is not practically viable if rendering a single video requires excessive time or rapidly exhausts the GPU budget.
To make world model data generation reliable enough for production use, we optimized the RoBoost Agent end-to-end. Relative to the vanilla Cosmos workflow under the same video length and output settings, the optimized RoBoost pipeline runs 3.3x faster and reduces total GPU compute consumption to 30% of baseline. Lowering this cost curve is the critical factor that transforms world models from impressive technical demonstrations into practical data generation engine.

Conclusion

Physical AI cannot scale on real-world data collection alone. Collecting such data is simply too expensive, too slow, and too risky to capture manually, especially for the long tail of edge cases. World models offer a practical way to close this gap by expanding a small amount of real or simulated experience into a large set of controlled variants.
However, using world models as a one-shot process often produces outputs that look realistic but are not reliable enough for training. The RoBoost Agent keeps the augmentation anchored to the source video. It applies targeted semantic changes using structure-preserving controls, and aggressively filters the final outputs. Through this approach, the RoBoost Agent offers a clear path forward: scale your experience and preserve semantic reliability for training-grade data.
While we only briefly introduced compute savings in this post, Part 2 will explore these techniques in detail. We will explain how the RoBoost Agent optimizes inference speed and reduces costs to make data generation scalable and practical in deployment.
 
Share article

The official SqueezeBits Tech blog