Alibaba's HappyHorse Topped the Video Leaderboard Before Anyone Knew Who Built It

An anonymous 15B-parameter video model appeared on Artificial Analysis, took #1 in text-to-video with 1389 Elo, and then turned out to be Alibaba's. HappyHorse-1.0 generates video and audio in a single pass.

Around April 7, a model with no listed affiliation appeared on Artificial Analysis and started winning. Within days, HappyHorse-1.0 sat at #1 in text-to-video generation with 1389 Elo — 115 points ahead of second-place Dreamina Seedance 2.0. In image-to-video, it claimed #1 at 1392 Elo.

Then somebody figured out who built it.

The model came from Alibaba’s Taotian Group — specifically, the Future Life Lab team led by Zhang Di, the former VP of Kuaishou and technical lead behind Kling AI. The anonymous submission wasn’t an accident. It was a calculated move to let the model’s quality speak before the brand attached.

One Model, Four Modalities

HappyHorse-1.0 is a 15 billion parameter model built on a single unified Transformer with 40 layers. Text tokens, a reference image latent, and noisy video and audio tokens all flow through one token sequence — denoised together in a single forward pass. The first and last 4 layers use modality-specific projections; the middle 32 share parameters across everything.

That’s the architectural distinction. Most video generation pipelines chain specialized models — a text-to-video model feeds into an audio model feeds into a super-resolution upscaler. HappyHorse does it all in one pass. Video and audio are jointly generated, which means the audio actually matches the visual content rather than being layered on as an afterthought.

It supports all four generation modes: text-to-video and image-to-video, each with and without native audio.

Speed and Resolution

Standard output runs 5–8 seconds at 1080p. On an H100, generation takes roughly 38 seconds at full resolution — or 2 seconds at 256p if you’re iterating quickly. A DMD-2 distillation step reduces sampling to 8 denoising steps without classifier-free guidance, which is how they hit those inference speeds.

Not the longest clips in the field — Helios generates real-time 14B-parameter video at longer durations. But for quality per second of generation time, HappyHorse is setting the pace.

The Anonymous Playbook

Alibaba isn’t the first to submit anonymously to benchmarks. But the gap between HappyHorse and the second-place model — 115 Elo points in text-to-video — made the reveal genuinely surprising. That’s not a marginal win. Elo gaps that large mean HappyHorse was producing visibly better output to the blind evaluators, consistently, across a range of prompts.

The model is open-source, with API access through fal.ai planned for April 30. At 15B parameters, it’s small enough to self-host on reasonable hardware — a sharp contrast to the hundreds-of-billions-parameter models that dominate text generation.

Whether HappyHorse holds #1 once the next wave of video models ship is an open question. But the unified architecture — generating video and synchronized audio in a single pass from a 15B model — is the kind of engineering that changes what developers assume is possible at this scale.