This new open-source AI, CogVideoX, could change how we create videos forever

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Researchers from Tsinghua University and Zhipu AI have unleashed CogVideoX, an open-source text-to-video model that threatens to disrupt the AI landscape dominated by startups like Runway, Luma AI, and Pika Labs. This breakthrough, detailed in a recent arXiv paper, puts advanced video generation capabilities into the hands of developers worldwide.

??Hot New Release: CogVideoX-5B, a new text-to-video model from @thukeg group (the group behind GLM LLM series)

– More examples from the 5B model in this thread?– GPU vram requirement on Diffusers: 20.7GB for BF16 and 11.4GB for INT8– Inference for 50 steps on BF16: 90s on… pic.twitter.com/GAyWmst5GW

— Gradio (@Gradio) August 27, 2024

CogVideoX generates high-quality, coherent videos up to six seconds long from text prompts. The model outperforms well-known competitors like VideoCrafter-2.0 and OpenSora across multiple metrics, according to the researchers’ benchmarks.

The crown jewel of the project, CogVideoX-5B, boasts 5 billion parameters and produces 720×480 resolution videos at 8 frames per second. While these specs may not match the bleeding edge of proprietary systems, CogVideoX’s open-source nature is its true innovation.

How open-source models are leveling the playing field

By making their code and model weights publicly available, the Tsinghua team has effectively democratized a technology that was previously the exclusive domain of well-funded tech companies. This move could accelerate progress in AI-generated video by harnessing the collective power of the global developer community.

The researchers achieved CogVideoX’s impressive performance through several technical innovations. They implemented a 3D Variational Autoencoder (VAE) to efficiently compress videos and developed an “expert transformer” to improve text-video alignment.

CogVideoX just released the weights for its 5B model! ? ✨

It’s the best open weights text-to-video model – competitive with Runway / Luma / Pika. With ?@diffuserslib, it fits on < 10GB VRAM ?

(ah, and they changed the smaller 2B model license to Apache 2.0 ?) pic.twitter.com/5fxAk6BuLv

— apolinario ? (@multimodalart) August 27, 2024

“To improve the alignment between videos and texts, we propose an expert Transformer with expert adaptive LayerNorm to facilitate the fusion between the two modalities,” the paper states. This advancement allows for more nuanced interpretation of text prompts and more accurate video generation.

The release of CogVideoX represents a significant shift in the AI landscape. Smaller companies and individual developers now have access to capabilities that were previously out of reach due to resource constraints. This leveling of the playing field could spark a wave of innovation in industries ranging from advertising and entertainment to education and scientific visualization.

The double-edged sword: Balancing innovation and ethical concerns in AI video generation

However, the widespread availability of such powerful technology is not without risks. The potential for misuse in creating deepfakes or misleading content is a genuine concern that the AI community must address. The researchers acknowledge these ethical implications, calling for responsible use of the technology.

As AI-generated video becomes more accessible and sophisticated, we’re entering uncharted territory in the realm of digital content creation. The release of CogVideoX may mark a turning point, shifting the balance of power away from larger players in the field and towards a more distributed, open-source model of AI development.

CogVideoX 5B – Open weights Text to Video AI model is out, matching the likes of luma/ runway/ pika! ?

Powered by diffusers – requires less than 10GB VRAM to run inference! ⚡

Checkout the free demo below to play with it! pic.twitter.com/Q0YT0RIpGb

— Vaibhav (VB) Srivastav (@reach_vb) August 27, 2024

The true impact of this democratization remains to be seen. Will it unleash a new era of creativity and innovation, or will it exacerbate existing challenges around misinformation and digital manipulation? As the technology continues to evolve, policymakers and ethicists will need to work closely with the AI community to establish guidelines for responsible development and use.

What’s certain is that with CogVideoX now in the wild, the future of AI-generated video is no longer confined to the labs of Silicon Valley. It’s in the hands of developers around the world, for better or for worse.

VB Daily

Stay in the know! Get the latest news in your inbox daily

By subscribing, you agree to VentureBeat’s Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Source link