I don't usually share when things go wrong. Like most, my public work tends to be the stuff that worked, but I learned a lot from this project, and I want to share what I learned about the problem, and how not to do research.

I tried to train a Genie-style world model that could learn to segment two players and their independent action spaces from Pong video alone, without labels or hardcoded structure. The model kept collapsing to degenerate solutions. It would always track the ball as "player 1" or even treat the score as an agent. I eventually realized I was trying to force behavior, ignoring the core problem that unsupervised multi-actor discovery requires data diversity and scale. With only one game type and a single camera angle, there's no modelling pressure to learn the "right" segmentation. Any learned decomposition that enables good reconstruction is equally valid.

But I should have known this before I wrote any training code. I spent weeks adding designing custom loss functions, tweaking hyperparameters and trying to debug specific symptoms one at a time.

Background on World Models and Latent Actions

The idea behind Genie and similar work is to learn a latent action space from video without any action labels. Given a sequence of frames $x_1, x_2, \ldots, x_T$ , we first learn a video tokenizer that maps frames to discrete latents: $z_t = \text{Tokenize}(x_t)$ .

Then, an action posterior is learned that infers what "action" caused the transition: $a_t \sim q(a \mid z_t, z_{t+1})$ using this video tokenization. This is different than what I expected during my first read; to me it made the most sense that this would be a heavily labelled dataset trained with a Veo prior! Instead, we learn actions, then learn to use those actions.

Last, with both, a dynamics model is learned that predicts the next state given current state and action: $\hat{z}_{t+1} = f(z_t, a_t)$

Genie Architecture Source: TinyWorlds by Anand Maj.

If the dynamics model is trained to minimize reconstruction error $\|z_{t+1} - \hat{z}_{t+1}\|^2$ , the action posterior is forced to extract whatever information is needed to predict the future that isn't already in $z_t$ . This is beautiful because no action labels are required.

This works great for single-player games where there's one character and the camera follows them. However, I wanted to push it further with multiple independently-controllable agents, visible in third-person. This means we could have multiplayer world models, with a dynamics model that learns to take in multiple latent actions. This is really cool because it creates an interesting multi-agent setup for RL training. Pong seemed like the simplest testbed, with two players, clear spatial separation, and very simple (linear) dynamics.

Intuitively, if we are predicting a new frame based off of a state and action, then we could just add more actions. Have a next frame predicted based off of a state and $n$ actions.

What I Built

A lot of the heavy lifting for my implementation was done by Anand Maj's TinyWorlds, an open-source implementation of the Genie architecture. I read the paper, understood the components, extended it for two players.

Video Tokenizer (FSQ-VAE)

This was trained with Finite Scalar Quantization (FSQ), which avoids the codebook collapse problems of VQ-VAE by directly quantizing each dimension to a fixed number of bins. With $d=5$ dimensions and $B=4$ bins per dimension, that's $4^5 = 1024$ tokens per spatial position.

class VideoTokenizer(nn.Module):
    def __init__(self, frame_size=(64, 64), patch=4, d=128, 
    latent_dim=5, num_bins=4):
        self.enc_pe = PatchEmbedding(frame_size, patch, d)
        self.enc = STTransformer(d, heads=8, ff=256, n_blocks=4)
        self.latent = nn.Linear(d, latent_dim)
        self.fsq = FSQ(latent_dim, num_bins)
        
        self.dec = STTransformer(d, heads=8, ff=256, n_blocks=4)
        self.to_pixels = nn.Linear(d, 3 * patch * patch)

The tokenizer is trained with smooth L1 reconstruction loss and then frozen for downstream training.

Spatial Mask Network

There were two straightforward ways to approach player identification. First, have two egocentric cameras side by side that were each controlled by a differen actor. This would be as if you were playing a first-person game, and visualized each player's perspective. Second, you could have a third person view of the game with multiple players visible. For the game that I had selected, the second option seemed to make more sense so I jumped into that choice.

For the second case, I created a mask network that takes tokenizer latents and outputs soft assignments to three categories: Player 1, Player 2, and Environment.

\text{masks} = \text{softmax}\Big(\text{MLP}(z_t \oplus \text{pos}) + \beta\Big)

where $\text{pos} \in [0,1]^2$ is normalized spatial position and $\beta$ is a learned per-position bias. The idea was that this soft initialization would break the left/right symmetry and guide the model toward the correct segmentation. A very forced decision based on the data I had available.

Action Posterior and Factorized Dynamics

Actions are inferred via masked pooling. It aggregates the latents in each player's region, then predicts what action caused the transition. I also implemented a dynamics model that is factorized to respect the multi-agent structure:

\hat{z}_{t+1} = g_0(z_t) + m_1 \odot u_1(z_t, a_1) + m_2 \odot u_2(z_t, a_2) + c \cdot g_{\text{int}}(z_t, a_1, a_2)

where $g_0$ predicts autonomous dynamics, $u_1, u_2$ predict player-specific updates, and $g_{\text{int}}$ handles interactions. I think the architecture is reasonable, but there may be flaws I still don't understand.

What Actually Happened

The model refused to learn the hypothesized segmentation (specifically left paddle as P1, right paddle as P2). Instead, it found solutions I didn't want but couldn't really argue were wrong.

The mask network puts all probability mass on $m_{\text{env}}$ $m_{env}$ . Reconstruction still works because $g_0$ $g_{0}$ just learns to predict frame-to-frame changes directly, ignoring the player branches entirely. The non-player regions made up most of the variance between any two frames.
- An interesting metric was the center-of-mass for each player. In a successful run, you'd expect $\text{CoM}_{P1} \approx 0.1$ (left edge) and $\text{CoM}_{P2} \approx 0.9$ (right edge). Instead I got both clustering around 0.5 (ball), or one at 0.5 and one at 0.9 (ball + one paddle), or both on the same side.
The ball has the highest motion variance between frames. If your objective rewards predicting change, the ball is the thing that changes most. Often, it would identify the ball as one of the two players instead of part of some constantly changing environment.
The score counters change discretely and predictably. One run I noticed the model decided this was a player!

Forcing Behavior

When the model's behavior diverged from my expectations, I added losses that penalized it for doing so.

weights = {
    'recon': 1.0,      # reconstruction loss
    'coverage': 5.0,   # forcing the model in specific regions
    'overlap': 50.0,   # but, regions cannot overlap!
    'cf': 0.0,         # counterfactual consistency
    'kl': 0.0,         # KL(posterior || prior)
    'sens': 0.5,       # sensitivity to action changes
}

Each loss was a patch, so there was just a lot of noise in gradients.

The sensitivity loss is my favorite example of how this goes sideways. Simply, I was trying to maximize the change in P1's region when P1's action changes. But think about the game I selected; in Pong, the ball has the highest motion variance in the scene when it is present. So the optimal solution was to make the spatial mask track the ball, because that maximizes sensitivity. I didn't even think about this because I was so focused on "naturally" learning to mask the players without forcing it (because that felt like cheating).

I've decomposed my thinking into three lessons.

1. Read for Assumptions, Not Just Architecture

To reiterate, I read the Genie paper, understood the architecture, looked at TinyWorlds code, understood the components. Then I extended it.

But I didn't properly question the key assumptions of the design. The single-agent setup creates a natural asymmetry that makes the whole approach tractable. For the most part, there's one controllable character, the camera follows them, and everything else is just environment. The camera is attached to the player, so actions cause ego-motion. You don't need to segment multiple actors because there's only one that matters, and it's implicitly defined by the camera.

Once you go third-person with multiple visible actors, the model has to figure out which visual entities are independently controllable, learn separate action spaces for each, figure out how they interact, and do all of this without labels. I would have realized this was a much harder problem if I'd spent a day actually thinking about why the original design was the way it was, instead of just understanding what it did.

It's rare that things work due to divine benevolence beyond our understanding. The choices that seem incidental are often doing the most work. Like I normally do with research (but not this time), I question what breaks if a part of the design is changed. I didn't, because I was convinced it would work and wanted to get results as soon as I could. That, in turn, cost me weeks.

2. Failure Modes Are Information

When the model kept finding degenerate solutions, I treated it as a bug to fix. Most often, I would add a loss to penalize it.

But, the model wasn't being stupid. It was doing exactly what I asked it to do which was minimize reconstruction error. For example, the ball is the highest-variance object in the scene. If you're trying to predict the next frame, tracking where the ball goes is genuinely useful. The model found a valid solution to the objective I gave it!

This was information I kept ignoring. When your model consistently lands on solutions you don't want, it's telling you something about the problem. Either your objective doesn't prefer the solution you want, or multiple solutions are equally valid under your objective. For me it was the second, since the reconstruction loss doesn't care what is classified as what. Any decomposition that predicts well works fine.

Every new patch was another attempt to force the model toward my preferred solution without asking whether the problem was well-posed. Another example is the soft spatial prior I gave to our mask initialization. I knew the model needed help breaking symmetry, so I initialized the bias to favor P1 on the left, P2 on the right. But it's a learnable parameter. The model gradient-descended away from it in a few thousand steps. I was giving hints when I needed to give hard constraints, and the reason I kept giving hints is that I still believed, somewhere, that the model could figure it out with just a nudge.

3. Identifiability Is a Gating Question

I never took the time to question whether my data uniquely determined the structure I wanted to learn.

The reconstruction objective is:

\min_{\theta} \mathbb{E}_{(x_t, x_{t+1}) \sim \mathcal{D}} \Big[ \big\|z_{t+1} - f_\theta(z_t, m_\theta, q_\theta(z_t, z_{t+1}))\big\|^2 \Big]

Any decomposition into $(m_1, m_2, m_{\text{env}}, a_1, a_2)$ that achieves low reconstruction error is equally valid from the model's perspective. The true factorization (paddles are players, ball is environment) is just one of many minima. My data, being purely Pong videos, does not specify this.

This is an identifiability problem. Here I'm trying to identify independent agents from pixels, but a single game doesn't provide enough structure to make the true factorization uniquely optimal.

For a single game, Pong, it is not suited by any means. The paddles look identical, so nothing visually distinguishes player from opponent. The ball, paddles, and score all change between frames, so any of them could plausibly be "controlled." If you don't have the data, you are in a poor position to learn a structure that is not naturally present.

When This Would Work

The research direction wasn't wrong. I'm fairly confident it can work. If you train on thousands of different games, the true factorization will appear. The ball-as-player hack that works for Pong doesn't transfer to Street Fighter and score-as-player hack doesn't work in racing games. Diversity forces the model to find what actually generalizes, which is that player inputs control player avatars.

Decisions I made were under the assumption of data diversity & magnitude. For example the counterfactual loss prevents changing P1's action from affecting P2's region. On a single game, there are many ways to satisfy this. Across thousands of games, this is achievable with the true factorization.

This is exactly what DeepMind had when they built Genie!! Internet-scale video data. Thousands of games, many camera angles. The diversity was doing work that I was trying to replace with loss functions. With that kind of data, I'd argue you'd need just reconstruction, counterfactual consistency, and a KL term to prevent action collapse.

I was so excited about the idea, but I blinded myself because I just wanted to get results. These findings might seem obvious in hindsight, but I think that's the point—the mistakes that cost you weeks usually are.

How not to do research