Genie是什么? 根据官方Google DeepMind博客文章,Genie是一个基础世界模型,它是在从互联网获取的视频上进行训练的。该模型能够“从合成图像、照片,甚至草图中生成各种可玩的(可操作的)世界。”
研究论文《Genie: 生成交互式环境》指出,Genie是第一个通过未标记的互联网视频以无监督方式训练的生成交互式环境。在尺寸方面,Genie包括了110亿参数,并且由一个时空视频标记器、自回归动态模型以及一个简单且可扩展的潜在行动模型组成。
这些技术规格使得Genie能够在生成的环境中以逐帧的方式行动,即使在没有训练、标签或其他特定领域要求的情况下也能够实现。
What is Genie?
According to the official Google DeepMind blog post, Genie is a foundation world model that is trained on videos sourced from the Internet. The model can “generate an endless variety of playable (action-controllable) worlds from synthetic images, photographs, and even sketches.”
The research paper ‘Genie: Generative Interactive Environments’ states that Genie is the first generative interactive environment that has been trained in an unsupervised manner from unlabelled internet videos. When it comes to size, Genie stands at 11B parameters and consists of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model.
These technical specifications let Genie act in generated environments on a frame-by-frame basis even in the absence of training, labels, or any other domain-specific requirements.