Genie 2: A Foundation World Model for Generating Diverse 3D Environments

This post introduces Genie 2, a novel foundation world model developed by Google DeepMind, capable of generating an endless variety of action-controllable, playable 3D environments. This breakthrough technology aims to address the bottleneck of insufficient and diverse training environments for embodied AI agents, paving the way for more general and capable AI systems.

The Role of Games in AI Research

Games have historically played a crucial role in advancing AI. Their structured challenges and measurable progress make them ideal for testing and developing AI capabilities. Google DeepMind's journey began with early work on Atari games, leading to significant milestones like AlphaGo and AlphaStar. However, training more general embodied agents has been hindered by the lack of rich and diverse training environments.

Genie 2: A Leap Forward

Genie 2 builds upon previous work, extending the generation of environments from 2D to 3D. It functions as a world model, simulating virtual worlds and the consequences of actions within them. Trained on a large-scale video dataset, Genie 2 exhibits emergent capabilities such as object interactions, complex character animation, physics simulation, and the ability to predict agent behavior.

Key Capabilities and Features:

Action Controls: Genie 2 intelligently responds to keyboard inputs, correctly identifying and moving characters within the environment, distinguishing between controllable characters and static objects.
Counterfactual Generation: The model can generate diverse trajectories from the same starting frame, enabling the simulation of counterfactual experiences for agent training. Different actions taken by a player lead to varied outcomes.
Long Horizon Memory: Genie 2 can remember and accurately render parts of the world that are no longer in view, ensuring consistency over time.
Long Video Generation: The model generates new, plausible content on the fly, maintaining a consistent world for extended periods (up to a minute).
Diverse Environments: Genie 2 can create environments from various perspectives, including first-person, isometric, and third-person views.
3D Scene Generation: The model has learned to create complex 3D visual scenes.
Object Interactions: Genie 2 models various object interactions, such as bursting balloons, opening doors, and triggering explosive barrels.
Character Animation: It can animate diverse characters performing various activities.
NPC Modeling: Genie 2 models other agents and their complex interactions.
Physics Simulation: The model accurately simulates physics, including water effects and gravity.
Lighting Effects: It models point and directional lighting, as well as reflections and bloom.
Real-World Image Prompting: Genie 2 can be prompted with real-world images, enabling it to model phenomena like grass blowing in the wind or water flowing in a river.

Rapid Prototyping and Creative Workflows

Genie 2 significantly simplifies the rapid prototyping of diverse interactive experiences. Researchers and designers can quickly experiment with novel environments, accelerating the creative process for environment design and agent training. By using concept art or drawings as prompts, users can generate fully interactive environments.

AI Agents in World Models

Genie 2 facilitates the creation of rich and diverse environments for AI agents. This allows researchers to generate evaluation tasks that agents have not encountered during training. The post demonstrates this with a SIMA agent, a generalist AI agent developed in collaboration with game developers, successfully following natural-language instructions in environments generated by Genie 2.

SIMA's Capabilities: SIMA can complete tasks in various 3D game worlds by understanding natural-language instructions. It interacts with the environment using keyboard and mouse inputs, with Genie 2 generating the game frames.
Testing Genie 2: SIMA is used to test Genie 2's consistency by exploring generated environments and following instructions like "Open the blue door," "Open the red door," "Turn around," and "Go behind the house."

Diffusion World Model Architecture

Genie 2 is built upon an autoregressive latent diffusion model. The process involves:

Autoencoder: Video frames are passed through an autoencoder.
Transformer Dynamics Model: Latent frames are fed into a transformer dynamics model trained with a causal mask, similar to large language models.
Inference: At inference time, Genie 2 samples autoregressively, processing actions and past latent frames on a frame-by-frame basis.
Classifier-Free Guidance: This technique is used to enhance action controllability.

The models showcased are from an undistilled base model, highlighting the potential of the technology. A distilled version offers real-time performance with a slight reduction in output quality.

Responsible Development

Google DeepMind emphasizes responsible AI development. Genie 2, like SIMA, contributes to the goal of creating more general AI systems that can understand and safely execute a wide range of tasks, benefiting people both online and in the real world.

Acknowledgements and Citation

The post acknowledges the extensive team effort behind Genie 2, including lead researcher Jack Parker-Holder and key contributors. It also provides a BibTeX link for citing the research.

A generalist AI agent for 3D virtual environments: Introducing SIMA.

Social Media Links

Links to Google DeepMind's social media channels (X, Instagram, YouTube, LinkedIn, GitHub) are provided.

Newsletter Signup

Users can sign up for updates via email.

Genie 2: A Foundation World Model for Generating Diverse 3D Environments