In the middle of a project for learned compression, I was having a lot of problems with my distributed RL training setup and reward collapse. The codebase was becoming uncomfortably large but, at the perfect time, I got access to Tinker and I was up until 3am moving my code over to it.

I was initially very skeptical of Tinker, thinking it abstracted away too much of the nuance of RL. My project was pretty different than anything I've done before, having to co-train two models, lots of sampling and I needed a lot of low-level control over the training loop. I still wanted to give it a fair shot.

Onboarding to Tinker is super simple, just an API key and documentation. They, very gratefully, added $150 to my account to help me get started with projects.

What I loved

First of all, the API is super easy to use. I don't have to worry about distributed training and inference, I can focus on designing the experiment and architecutre. There is a list of pre-set models with their pricing per Million tokens of sampling, pre-fill and training time, as well as per Gb of storage. MoE support without lifting a finger was super easy to use as well. On the platform you can store model checkpoints and view which training runs are active, as well as amount spent.

My entire experiment reduced in code lines by ~5x after using Tinker. It was broken down into the three abstractions by Tinker:

sample_async (inference/rollout)
forward_backward_async (compute loss + grads)
optim_step_async (apply optimizer)

There was no hard-coded RLHF loop; Tinker doesn’t force a specific reward model or ranking pipeline, meaning I could build an entirely custom ΔlogP + length reward in plain Python. I could run multiple models in one script, all sharing the same base model and tokenizer. I used PPO for the experiment, but with the built-in loss all I had to supply was target_tokens, logprobs and advantages. Tinker handled ratio computation, clipping, and per-token policy loss server-side, so I didn't have to touch CUDA or PyTorch.

As a small extension of how much control you get, running forward_backward_async(..., loss_fn="cross_entropy") returns loss_fn_outputs[i]["logprobs"] for each datum! Which was exactly what I needed to construct the reward and metrics for my experiment. This is what made the dense, token-level reward possible. Everything being async was a little confusing at first (see below) but also made it possible to run multiple models concurrently!

future = await client.forward_backward_async(datums, loss_fn="cross_entropy")
future2 = await client2.forward_backward_async(datums2, loss_fn="cross_entropy")
result = await future
result2 = await future2

Using LoRA showed no problems as well. It gave us cheap, lightweight adaptation on top of a large base model and super trivial to spin up 3 clients for different roles without worrying about full model finetunes or sharding. For the RL Environment abstraction, it's really nicely setup. Env with initial_observation() and step() that operate on tokens, not strings. EnvGroupBuilder to instantiate multiple envs per batch and RLDataset as a dataset of these builders.

Finally, it was super clear errors that were thrown. It made it super clear what was not supported, which lines were wrong and why they were wrong. It was super helpful to debug my code quickly! I think this is a design choice that a lot of APIs/SDKs don't expose.

What I wish I had

Right now, SamplingClient.sample() is a one prompt to many completions function. You can batch across prompts by just calling sample multiple times and asyncio.gather-ing, which works, but when I'm doing RL with many prompts per step, it’d be nice to have a first-class batch sampler that understands this pattern and does the minimal number of prefill passes. For the experiment, I had a mini-batcher with asyncio, which is fine but a bit verbose.

I also had a bit of trouble with TensorData (thank you ChatGPT for helping) where I build loss_fn_inputs as a dict of numpy arrays or TensorData objects. Then, Tinker returns TensorData in loss_fn_outputs, which you then convert back to numpy via .to_numpy() before doing analysis. Maybe I'm missing something, but for the experiment it was hard juggling three models and several different logprob tensors.

The main concern was actually with logging. I had to wrap everything with wandb to have independent logs. The Tinker logs for the experiments that were being run weren't very useful, it just told me that an experiment with some UUID existed, which model was being used and how long ago it started. I think this is a layup for logging inside of Tinker, which is likely being worked on, but I honestly never had much use to actually look at the dashboard, except marvel at how cheap everything was. A simple tinker.log(…) call would be awesome.

Finally, there is barely any multimodal RL support in most RL or inference libraries. If this is a next feature for easy multimodal RL, I'm all for it.

Conclusion

Tinker made me so much faster to do research. Most importantly, it works—which made my experiment possible in the first place. It is very willing to get out of your way, which is exactly what you want when you’re trying to coax a model into inventing its own compression scheme. I am super optimistic about tools like Tinker and how they will make RL research faster! This is going to become a new normal for me in future experiments.

My Experience using Tinker

What I loved

What I wish I had

Conclusion