Dense reconstruction is the scaffold of machine learning

19 Nov 2025

7 minute read

We deconstruct the human world, stone to sand. We take the sand and use it to make glass, and then feel surprised at seeing our reflection look back at us. For now machine learning is deconstruction for the sake of reconstruction.

I argue most ML progress comes from one kind of training objective: dense reconstruction.

– take real-world data
– decompose it into parts (key)
– train a model to generate the parts and the whole

We’re hand-building a ladder of such tasks, smoothly increasing in difficulty, so our models can learn. What’s at the top?

When a task can be framed as reconstruction, it becomes much easier to get a verified signal on the model’s performance – by checking if its reconstruction looks similar to the target world. When we can’t frame it in this way, evaluating open-ended generations requires the design of often brittle or expensive training signals.

Most of the wins of machine learning come from reconstruction objectives. However, our exploration algorithms and optimizers are limited, so we are actually only able to learn reconstructions that are dense, i.e. that can be supervised in a dense way, with a smooth scaling of supervision signal with difficulty. That is when we can really learn.

Almost every recent paradigm of machine learning comes with a new way of eking out data from the world and turning it into a reconstruction objective, along with a scale-up in compute ¹:

basic supervised learning: how can I train a model to reconstruct a labeled distribution?
contrastive learning: how can I learn the differences between objects in the world?
auto encoders (literally reconstruction loss), diffusion denoising
next token prediction: compress the distribution of web text
SFT+RL on human process and problem solving ²

As our algorithms and compute capacity improve, our models are also able to learn increasingly general concepts, representations and rules from our provided reconstruction signals. In the limit, you can train a model on a huge set of raw physical observables (the motion of the clouds, the stars, of chaotic systems, forecasting, etc…) and have it discover things no human ever has.

But the bottleneck to training such a model is not just a question of scale. It’s also about the sparsity of the learning signal – and in fact, most of the progress has been when we made our learning task dense.

Reconstructions are learned when they can be made dense #

When models are able to learn the reconstruction of some data, it is when the reconstruction can be made dense: the model can get supervision on a ladder of reconstruction subtasks whose difficulties increase gradually: each new rung is only a small step beyond what it can already do, so gradients remain informative instead of vanishing into noise. (eg. First, make a blob, then a few lines going out of it, then a stick figure, then two or three, then give them a face, then make it red, then put them in a city, and then make it futuristic.) This is like a curriculum, or a schedule of learning problems where the pass rate is never too low – the model is trained on tasks where it has some probability on the answer, and learns from that to assign correct probability on harder and harder tasks.

These distributions and objectives can be made dense in several ways:

the distribution is already dense – it manifests an implicit curriculum – a sub task structure where different subtasks have different difficulties, and depend on each other, so the model can gradually solve more and more of them smoothly
- In the Quantization model of neural scaling, power-law scaling is attributed to a natural, compositional ladder of sub-tasks in human language. If language naturally has this structure, with tasks increasing in difficulty – local patterns then short-range dependencies, long-range structure, reasoning – this could explain why LLMs pre-train so well.
the distribution has a structure that can be biased or transformed synthetically to make it manifest a curriculum
- in diffusion models, we gradually add noise to images with very careful noise scheduling, allowing us to leverage the continuous structure of image data to vary the difficulty of image reconstruction
- in language model coding reinforcement learning (and other similar domains), we create inverse problems: take existing codebases and corrupt parts of them, training the agent to fix it. We can smoothly vary how much we destroy to vary difficulty (just one function versus a whole module versus half the codebase), like in Breakpoint.
- even in language model in context learning, various results suggest biasing the algorithmic structure of the task being learned can be essential to generalization.
humans curate datapoints to create a dense curriculum – manually picking and assorting reconstruction tasks based on their difficulty
- Mercor and Scale, data curation, etc…

Finally, the mixing of various steps and phases of training enable a strong form of this dense training. Each phase of training depends on the intelligence acquired in the previous one, and this allows us to also gradually increase the difficulty and complexity of the distributions we train on within a model. The release of o1 and the growth of LM RL as a successful practice is in large part due to the advances in the capabilities and patterns learned in the base model.

In contrast, sparse tasks, where this smooth difficulty ladder does not exist or cannot be conjured into existence, are much much harder to learn (eg 0-1 loss image reconstruction, learning combinatorial functions, RL on human tasks without supervision or process SFT, …). Bridging that gap and solving open-ended problems (eg Riemann hypothesis, any kind of long horizon 0-1 RL, especially from scratch without a foundation model, etc…) without any reconstruction signal seems even harder.

But the significant, and incredibly impactful fact about human society, is that it is also largely optimized to manifest a dense difficulty distribution of tasks. In science, we make discoveries that build on an existing scientific corpus, we gradually learn harder and harder concepts, and we do this in a way such that at every timestep our outputs need to be legible to the rest of the world, at their current, nearby level of knowledge. If you take all human data, you are ingesting trillions of tokens that encode tasks at many different difficulty levels – corresponding to the countless activities and competences of all the humans in the world. I think this is a big part of why LLMs have been able to train so well on our data, because we are giving them our own natural ladder.

The natural next question is – just how far can we go up this ladder? When we exit the realm of the problems we can manually decompose and curate for the sake of training the models, will they have bootstrapped far up enough to learn the skills to do the decomposition and create a curriculum on their own, or not?

one notable exception to this is self-play environments, where external data can be counter-productive, like Chess or Go. Self play environments are extremely learnable because the fact you can play yourself allows you to naturally have an adjusted difficulty curriculum AND a dense signal on the effect of changes in your decision-making (ie counterfactuals). ↩
you can argue that eg doing LLM RL on math is not a reconstruction task. Many recent results suggest that RL on LLMs is eliciting and amplifying patterns present in the base model, which it has learned by imitating the process and actions humans make in its training data. Once the right actions have a high probability, RL can explore and amplify them, but I argue that this is the kind of dense reconstruction I am referring to, although in a looser sense. ↩