Reliable & Scalable Synthetic Data for Physical AI (Part 2): Making Cosmos 3.1 x Faster for Production

Explore why Physical AI deployment needs synthetic data at scale with Squeezebits' research and discover how to overcome inference bottlenecks to accelerate Roboost Agent.
Reliable & Scalable Synthetic Data for Physical AI (Part 2): Making Cosmos 3.1 x Faster for Production

The Economic Efficiency of Synthetic Video Pipelines: Compute Cost, Speed, and Yield

In our previous blog, Reliable & Scalable Synthetic Data for Physical AI (Part 1): Taming NVIDIA Cosmos with RoBoost Agent, we focused on what makes synthetic data truly useful for training. Visually convincing outputs are not enough. A pipeline must preserve the required semantics and physical consistency, and it must verify those properties reliably. Better quality leads to higher yield because more generated samples survive validation and remain usable for training.
This post focuses on the other half of the scaling problem. High yield improves the value of a synthetic data pipeline, but it does not make the pipeline practical at scale by itself. When a pretrained foundation model is adapted to a specific domain such as a factory floor, a logistics site, or a particular road environment, large amounts of synthetic data are required in most cases. At that point, the cost of running the generation pipeline becomes a critical constraint.
That cost depends not only on how well the pipeline generates samples, but also on how efficiently it produces them. Part 1 addressed the quality side of the equation by increasing the fraction of usable outputs. In this post, we address the inference side by asking how much synthetic video generation actually costs in practice, and how inference optimization changes that cost structure in real world deployment.

Why Physical AI deployment needs synthetic data at scale

Two challenges appear repeatedly when robotics and autonomous driving systems move toward real deployment. One is the need to cover situations that are underrepresented in existing data. The other is the need to adapt models to a new environment, region, facility, or operating condition. In both cases, the practical challenge is the same: building enough target-domain data at reasonable cost.
This is where synthetic data becomes essential. Recreating every rare or safety-critical situation in the real world is expensive, slow, and often difficult to stage at all. The recent emergence of world models, such as NVIDIA GR00T N1.6 and VLAW suggest that real-world data alone is often not enough to push policy performance beyond a certain point. Synthetic data helps fill those gaps, but its value depends on two conditions. The pipeline must generate data that remains usable after validation, and it must be efficient enough to support deployment at scale.

1. Autonomous Driving: edge cases at scale and the cost of safety

Autonomous driving faces a long tail problem at deployment. Road environments are dynamic, multiple entities interact at once, and the system must make safe decisions under strict time constraint. Because the vehicle itself is safety critical, even rare situations matter. The system may need to react to an unusually dressed pedestrian, an unexpected obstacle, or an extremely rare event that almost never appears in routine driving data.
The difficulty is that these edge cases are effectively unbounded. Real world collection alone cannot cover them at the scale required for training. Reproducing rare traffic situations with physical actors and vehicles is expensive, operationally complex, and sometimes unsafe. For that reason, the industry has largely moved toward a hybrid approach that combines real driving data with simulation and synthetic generation.
This shift makes synthetic data generation a practical requirement, not just an optional tool. Waymo, for example, has reported more than 20 billion miles of simulation to replay real scenarios and to build new virtual ones. More recently, it also introduced the Waymo World Model, built on Google DeepMind Genie 3, as part of its effort to improve safety and performance through synthetic edge case generation. Tesla has also described a pipeline in which labeled real driving data is used to generate and vary simulation content for training. Taken together, these examples show a common pattern. Leading players are not relying on real-world collection alone. They are combining real data with synthetic generation to cover situations that are difficult to collect at scale.

2. Robotic Manipulation: adapting to new environments while reducing failure cost

When a robot policy model is deployed in a real workspace such as a factory or warehouse, the model must operate under highly variable conditions. Lighting shifts and reflective surfaces can interfere with perception, and obstacles may appear in complex arrangements. In these environments, a single failure can damage hardware or halt operations entirely, making trial-and-error adaptation expensive from the start.
As a result, policy model training pipelines are also shifting toward an approach where a small number of real demonstrations serve as anchors, while large-scale simulation and world-model based synthetic data are used to amplify and validate the training set. This approach becomes even more important during domain adaptation. Making a robot robust to unfamiliar lighting, cluttered layouts, or diverse multi-task settings demands a substantial amount of target-domain data.
Results from NVIDIA GR00T N1.6 and Chelsea Finn’s VLAW point in the same direction. In GR00T N1.6, a policy trained with only 100 real samples achieved a 25.6% success rate, while adding 3,000 synthetic samples increased it to 40.9%. VLAW also reported a setting in which the mean success rate increased from 46% to 86% through iterative training with world-model-generated rollouts. These results suggest that in robotics, the effective strategy is often not to collect large amounts of real data alone, but to use a smaller real dataset to improve the quality of synthetic data and then feed that back into policy learning.
Figure2. NVIDIA Isaac GR00T: https://developer.nvidia.com/isaac/gr00t
Figure2. NVIDIA Isaac GR00T: https://developer.nvidia.com/isaac/gr00t

Data generation with world model introduces a new bottleneck: compute cost

Once your team decides to use synthetic data alongside real data, the next constraint is the compute cost of running the world model itself. To make the discussion concrete, we estimate the cost based on a target of 100 hours of generated data. Readers can scale the same logic to match their own use case.
According to NVIDIA Cosmos benchmark figures, generating a 5-second video at 720p and 16 FPS with Cosmos-Transfer 2.5 takes 719.4 seconds on a single NVIDIA H100 NVL GPU under a specific benchmark setting using segmentation control. Based on that reference point, we use a simplified back-of-the-envelope estimate for generation cost:
Reference generation cost = (target video duration / 5 seconds) × 719.4 seconds × GPU hourly price
This estimate should be interpreted as a benchmark-based reference cost, not a universal deployment number. In practice, the actual cost depends on control modality, resolution, frame count, the number of views, batching strategy, and hardware configuration.
Under this assumption:
  • generating 100 hours of video requires about 14,400 H100 GPU-hours
  • on one 8-GPU H100 node, that corresponds to about 75 days
  • at an assumed price of $3.00 per GPU-hour, the total cost is about $43,200
These numbers are intended to illustrate the order of magnitude of infrastructure cost. They should not be read as a fixed production estimate for all Physical AI pipelines, especially because real deployments often use different control signals and multi-view generation settings. Still, the estimate is useful because the total cost scales roughly linearly with generated volume. However, as discussed in Part 1, not every generated video survives validation. Some samples are filtered out, so the pipeline must usually generate more than the final usable target.
The effective cost therefore becomes:
Usable-data cost = generation cost / yield
If the yield is 50%, 200 hours must be generated to obtain 100 usable hours. In other words, yield remains part of the overall cost equation. Part 1 focused on improving that yield. Part 2 focuses on the other major bottleneck: speed. Once inference is accelerated, the cost of usable data changes accordingly:
Optimized usable-data cost = generation cost / (yield × speedup)
From this point on, we focus on how inference acceleration changes the cost structure in practice while preserving training-grade quality.

How SqueezeBits lowers the cost structure in a meaningful way

In practice, teams sometimes reduce cost by lowering resolution, reducing frame rate, or relying on discounted compute. Even so, the fundamental cost driver remains the heavy compute demand of the world model itself.
SqueezeBits addresses this bottleneck through model light weighting and pipeline optimization. Under the same generation settings, this improves video generation throughput by up to 3.1x without requiring the most common quality tradeoff, such as lowering resolution or frame rate.
To make this concrete, Table 1 breaks the estimate into four stages: the ideal generation cost, the yield-adjusted cost, the cost after yield improvement, and the final cost after acceleration. This makes the contributions of Part 1 and Part 2 explicit in the same framework. Figure 3 then summarizes the practical outcome. In both robotic manipulation and autonomous driving settings, yield improvement lowers wasted compute, and acceleration pushes the final usable-data cost down further.
Table 1. Cost breakdown for usable synthetic data generation across robotics manipulation and autonomous driving scenarios. The table separates ideal generation cost, yield-adjusted cost, cost after yield improvement, and final cost after applying the 3.1× speedup.
Table 1. Cost breakdown for usable synthetic data generation across robotics manipulation and autonomous driving scenarios. The table separates ideal generation cost, yield-adjusted cost, cost after yield improvement, and final cost after applying the 3.1× speedup.
In this calculation, let XX be the target usable data volume and yy the pipeline yield. If the ideal generation cost is ZZ, then the yield-adjusted cost becomes Y=Z/yY = Z/y. If the improved yield is yy', the adjusted cost becomes Y=Z/yY' = Z/y'. With an additional speed-up factor ss, the final infrastructure cost becomes Y/sY'/s.
The exact values depend on the number of views assumed for each domain. For robotic manipulation, we use a conservative single-view assumption, although real training setups often use one to four views for each scene. For AD, we assume six views based on the nuScenes setting, even though some real systems use more than ten.
Figure 3. Usable-data cost comparison after yield adjustment, yield improvement, and additional acceleration across robotics manipulation and autonomous driving(AD) settings.
Figure 3. Usable-data cost comparison after yield adjustment, yield improvement, and additional acceleration across robotics manipulation and autonomous driving(AD) settings.
The important point is not the exact dollar amount, which will vary by hardware price and deployment setup, but the structure of the savings: Part 1 improves yield, and Part 2 compounds that gain by reducing the cost of each generated hour.

Accelerating the RoBoost Agent: Overcoming the Inference Bottleneck

In our previous blog post, we focused on improving pipeline quality to increase yield. That work strengthened reliability and made the generated data usable by preserving semantic consistency and task label integrity. However, scaling synthetic data generation requires more than output quality. As the pipeline runs at larger volume, the bottleneck shifts from model accuracy to end to end cost of inference. Since the RoBoost Agent depends on compute intensive models throughout the pipeline, inference becomes the largest overhead in total cost for large scale generation.
To make the system scalable and cost efficient in practice, we optimized inference with three techniques. We used step caching to avoid redundant computation, applied quantization to reduce memory and compute, and leveraged kernel level optimization to minimize execution overhead.

Bypassing Redundant Computation with Step Caching

Diffusion models, including world models, face a major computational bottleneck because every denoising step runs full-attention across the entire latent space. Unlike autoregressive text generation, diffusion based image and video generation cannot rely on KV caching to store and reuse attention states across sequential tokens. Therefore, each step repeats similar computation even when the denoising trajectory changes slightly.

Adopting Step Caching

Step caching is an effective way to reduce redundant computation by skipping computations when consecutive steps produce nearly identical update direction. This behavior often appears in modern flow matching models because they learn a smooth velocity field for latent updates. When the predicted velocity remains stable across adjacent steps, the system can bypass the expensive diffusion transformer (DiT) blocks.

First-Block Caching

Figure 4. Overview of the First-Block caching method. This approach evaluates the change in the first block's residual to determine whether to compute the full network, or skip the remaining blocks and reuse the cached output from the previous step.
Figure 4. Overview of the First-Block caching method. This approach evaluates the change in the first block's residual to determine whether to compute the full network, or skip the remaining blocks and reuse the cached output from the previous step.
First-block caching is a popular step caching method introduced in the HuggingFace Diffusers library. It runs the first DiT block and compares its residual with the previous step to estimate whether the update direction has changed or not. If the change is small, the system skips the remaining blocks and reuses the cached outputs.
To provide an intuitive view of first-block caching, we describe it with the following simplified equations. Let ht0h^0_t be the input latent at step tt, and let B1B_1 denote the first DiT block. The output of the first blocks is
ht1=B1(ht0)h^1_t = B_1(h^0_t)
We then define the residual produced by the first block as
Let rprevr_{\mathrm{prev}} denote the most recently cached residual. We measure the relative change at step tt with the normalized residual difference
δt=rtrprevrprev+ϵ\delta_{t} = \frac{||r_t - r_{prev}||}{||r_{prev}|| + \epsilon}
Given a predefined threshold τ\tau, the system assumes that the deeper blocks will produce a nearly identical update if δt<τ\delta_{t} < \tau. In that case, the RoBoost Agent skips the remaining blocks and reuses the cached outputs from the previous step. Otherwise, the agent executes the full network, updates the residual cache, and proceeds normally. This mechanism is crucial for the RoBoost Agent because it avoids redundant computation during stable denoising steps and makes large scale data generation more cost efficient.

Scaling World Model Inference with Quantization

Figure 5. Overview of different quantization approaches for matrix multiplication, highlighting three methods: Baseline without Quantization (Weight 16-bit & Activation 16-bit), Weight-Only Quantization (Weight 8-bit and Activation 16-bit), and Weight-Activation Quantization (Weight 8-bit Activation 8-bit).
Figure 5. Overview of different quantization approaches for matrix multiplication, highlighting three methods: Baseline without Quantization (Weight 16-bit & Activation 16-bit), Weight-Only Quantization (Weight 8-bit and Activation 16-bit), and Weight-Activation Quantization (Weight 8-bit Activation 8-bit).
Quantization accelerates inference by representing tensors at lower numerical precision. This reduces memory-bandwidth demand and allows accelerators to use low precision compute units to increases throughput. In practice, teams usually choose between two deployment patterns based on the primary hardware bottleneck.

Weight-Only Quantization

Weight-only quantization targets memory bound workloads by compressing the model parameter footprint. As shown in Figure 5, it lowers the weight precision while keeping activations at higher precision, such as BF16 or FP16. Because the hardware fetches fewer bytes for weights, this approach reduces memory-bandwidth pressure. The system still dequantizes the weights to higher precision for computation, so the speedup mainly comes from reduced data movement rather than faster arithmetic.

Weight-Activation Quantization

Weight-activation quantization targets both memory and compute bound workloads by running operations directly in low precision. By lowering the precision of both weights and activations, the system avoids dequantization before computation and unlocks low-precision compute units with higher maximum FLOPs. Earlier INT8 and INT4 formats often caused large accuracy drops when applied to both tensors, but modern accelerators address this tradeoff with FP8 and FP4 datatypes. As a result, many production systems use weight-activation quantization to balance speed and quality.
We applied weight-activation quantization to the RoBoost Agent because world model inference is typically compute bound. By quantizing both weights and activations with carefully chosen precisions in selected modules, we leveraged low precision compute units to accelerate inference. This approach aligns with accelerator support from major vendors, including NVIDIA, AMD, Intel, and etc.

Maximizing Inference Throughput with Graph Compilation and Kernel Tuning

Kernel execution often suffers from framework overhead that limits performance. In eager mode, the system pays repeated costs from Python-level dispatch, frequent GPU kernel launches, and unfused operator sequences that trigger unnecessary memory reads and writes. Performance also depends heavily on whether the selected kernel matches the target device and tensor shape. To reduce this overhead, we compiled the forward graph and benchmarked multiple kernel implementations to maximize throughput.

Compiled Graph Execution

We use torch.compile to capture the full computation graph just in time and generate an optimized execution plan. This compilation enables operator fusion, which combines sequential operations into larger and more efficient kernels.

Backend Aware Kernel Selection

To maximize performance across different hardware configurations, we evaluated multiple attention backends together with graph compilation. We benchmarked FlashAttention, FlashInfer, and SageAttention, and we implemented custom kernels for specific operations and tensor shapes when needed. Based on the device, hidden dimension, and sequence length in target workload, we selects the fastest backend that preserves correctness throughout data generation.
Graph compilation and kernel selections serve as throughput multipliers for the RoBoost Agent. The pipeline repeatedly executes the same models across thousands of video clips and sampling steps, so the one time compilation cost is quickly amortized. When combined with scenario specific attention backends, torch.compile reduces framework overhead and helps sustain high GPU utilization throughout data generation.

Evaluating Speed and Quality: Winning on Both

The optimization stack reduces inference cost by a wide margin. However, cost savings are valuable only when the generated videos preserve training grade quality. We benchmarked the optimized RoBoost engine against the vanilla Cosmos-Transfer 2.5 model on both inference speed and output quality.

Inference Speed Benchmark Setup

For inference speed benchmark, we measured end-to-end latency of Cosmos-Transfer 2.5 for a single sample with fixed number of frames and resolution. We also fixed all hardware and generation parameters to isolate inference throughput from conditioning variability.
  • Device: 1 x NVIDIA RTX Pro 6000 GPU
  • Video specification: 480p / 720p resolution, 93 frames per video
  • Control signal: A single pre-computed depth map

Quality Benchmark Setup

We used the PAI-Bench-C dataset, a specialized benchmark for conditional video generation. Specifically, we randomly sampled 50 tasks from the autonomous driving domain and 50 tasks from the robotic manipulation domain. For each of these 100 tasks, we applied 6 distinct caption variations, resulting in a total of 600 video transfers used for our metric extraction. Each source video was transferred using its corresponding caption variation, with four control modalities applied at equal weight: blur, edge, depth, and segmentation. The baseline for all quality comparisons is vanilla Cosmos-Transfer 2.5 at BF16 precision.
Our evaluation protocol follows the Cosmos Cookbook evaluation framework and PAI-Bench-C, which can be categorized into two key areas: visual quality and control fidelity.
  • Visual quality metrics: Dover Score, Diversity Score
  • Control fidelity metrics: Blur SSIM, Canny-F1, Depth RMSE, Seg mIOU

Inference Speed Benchmark Results

Figure 6. End-to-end inference latency comparison between vanilla Cosmos-Transfer 2.5 and the RoBoost Agent engine under identical generation conditions. Measured on a single video with depth control input at 480×640 and 720×1280 resolutions.
Figure 6. End-to-end inference latency comparison between vanilla Cosmos-Transfer 2.5 and the RoBoost Agent engine under identical generation conditions. Measured on a single video with depth control input at 480×640 and 720×1280 resolutions.
As shown in Figure 6, our optimized model achieves up to 3.1×\times speedup compared to vanilla Cosmos-Transfer 2.5 while maintaining strong performance at both low and high resolutions. Three core optimizations drive most of this gain. Step caching to reduce the total computations, quantization for lighter computation, and kernel optimization for lower per-step DiT latency.
The optimized pipeline requires only 30% of the GPU compute used by the vanilla baseline, resulting in a 70% reduction in per-video inference cost. These gains can also improve aggregate throughput in multi GPU deployments, where lower per video compute directly increases throughput across the cluster. This is especially important for large scale data generation, which may require tens of thousands of videos in each iteration.

Quality Benchmark Results

Table 2. Benchmark results on visual quality, diversity, and control fidelity for vanilla Cosmos-Transfer 2.5 and the RoBoost Agent engine. Despite substantial inference acceleration, the optimized pipeline remains close to the baseline across all metrics.
Table 2. Benchmark results on visual quality, diversity, and control fidelity for vanilla Cosmos-Transfer 2.5 and the RoBoost Agent engine. Despite substantial inference acceleration, the optimized pipeline remains close to the baseline across all metrics.
Despite the significant speedup, the optimized pipeline maintains comparable output quality across the evaluated metrics. The Dover Score, which measures clarity, compression artifacts, and motion smoothness, scored 9.694 (only a ~2.4% marginal drop from the baseline), confirming that the optimization stack preserves perceptual video quality. Notably, the Diversity Score actually improved, reaching 0.287 compared to the baseline's 0.261.
Control fidelity metrics demonstrate similar results. Across all control fidelity metrics (Blur SSIM, Canny-F1, Depth RMSE, and Seg mIOU), the optimized pipeline delivered stable performance without noticeable degradation. Blur SSIM, Canny-F1, and Depth RMSE remained nearly identical to the baseline, while Seg mIOU showed a slight improvement to 0.754 (vs. baseline 0.746). This confirms that conditioning signals are faithfully preserved. For training-grade data, this is what matters most: a video that drifts from its structural controls is unusable, regardless of realism.

Qualitative Comparison for Results

Figure 7. Qualitative comparison of vanilla Cosmos-Transfer 2.5 and RoBoost Agent engine videos for robotic manipulation task.
Figure 8. Qualitative comparison of vanilla Cosmos-Transfer 2.5 and RoBoost Agent engine videos for autonomous driving task.
Figure 7 and Figure 8 illustrate these quantitative successes in practice by comparing both pipelines on same source videos. Regardless of the scene and prompt, the optimized outputs closely match the baseline's visual coherence and detail fidelity. The preserved edge boundaries, material textures, and spatial relationships demonstrate fine-grained consistency that raw metrics alone cannot capture.

Try RoBoost Agent: Fastest Way to Make “Training-grade” Synthetic Data

"In this new world of AI, compute equals revenues." - Jensen Huang
He's right, but there's a catch. If compute equals revenue, then wasted compute equals burned margins.
The same principle applies when scaling world-model generation pipelines for domain adaptation. Tracking the sheer volume of generated videos is no longer a viable metric. The actual economic efficiency of your pipeline depends entirely on the ratio of 'training-grade' results you actually retain per GPU-hour after verification and filtering. RoBoost Agent is a solution designed to fundamentally improve this cost structure by accelerating inference and maximizing your usable data yield, thereby eliminating the waste of computing resources.
Who this is for
RoBoost is a strong fit if your team is grappling with practical challenges like the following:
  • Limits of physical data collection: You need over 100 hours of target-domain data for a factory, warehouse, or new road region, but the cost and time of physical data collection are prohibitive.
  • Low usable data yield: You can generate videos, but the verification process rejects too many samples, making the pipeline's total cost unmanageable.
  • Bottlenecks in generation speed and compute cost: You need a production-grade generation pipeline equipped with a proper optimization stack, rather than temporary, ad-hoc fixes.
Join the waitlist
RoBoost is currently granting early access through a waitlist. If you are actively solving problems related to reliable synthetic video generation and cost structure optimization, we encourage you to reach out. You can apply at RoBoost.
To help us tailor our conversation, please share your domain, target generation hours, required resolution, and conditioning signals. It is also highly helpful if you can share your team's specific definition of trainable data.
Share article

The official SqueezeBits Tech blog