How Far is Video Generation from World Model: A Physical Law Perspective
How Far is Video Generation from World Model: A Physical Law Perspective
Model | #Param |
---|---|
DiT-S | 22.5M |
DiT-B | 89.5M |
DiT-L | 310.0M |
DiT-XL | 456.0M |
Figure 3: The error in the velocity of balls between the ground truth state in the simulator and the values parsed from the generated video by the diffusion model, given the first 3 frames
They also trained DiT-XL on the uniform motion 3M dataset but observed no improvement in OOD generalization.
In general, by simply training model on video, does not give a world model that can understand physics.
Physical Informed Driving World Model
The main framework below: