World models that support controllable and editable spatiotemporal environments are valuable for robotics, enabling scalable training data, reproducible evaluation, and flexible task design. While recent text-to-video models generate realistic dynamics, they are constrained to 2D views and offer limited interaction. We introduce MorphoSim, a language-guided framework that generates 4D scenes with multi-view consistency and object-level controls. From natural language instructions, MorphoSim produces dynamic environments where objects can be directed, recolored, or removed, and scenes can be observed from arbitrary viewpoints. The framework integrates trajectory-guided generation with feature field distillation, allowing edits to be applied interactively without full re-generation. Experiments show that MorphoSim maintains high scene fidelity while enabling controllability and editability. The code is available at https://github.com/eric-ai-lab/Morph4D.
Figure 1: MorphoSim can do interative and controllable generation and editing.
Figure 2: Overview of the MorphoSim pipeline with Command Parameterizer, Scene Generation, and Scene Editing modules.
Figure 3: MorphoSim can do color editing, object extraction, and removal.
Figure 4: MorphoSim can do motion control during generation.
@article{he2025morphosim,
title = {MorphoSim: An Interactive, Controllable, and Editable Language-guided 4D World Simulator},
author = {He, Xuehai and Zhou, Shijie and Venkateswaran, Thivyanth and Zheng, Kaizhi and Wan, Ziyu and Kadambi, Achuta and Wang, Xin Eric},
year = {2025},
journal = {arXiv preprint arXiv:2510.04390}
}