THE MODELS HAVE EYES

The old deal with image models was simple. You asked for a picture and the model gave you one. A logo concept, a fake product shot, a cinematic robot in a rainy alley, sometimes beautiful, usually detached from the rest of the work.

That framing is already too small.

I use GPT Image 2 regularly, including for the cover images on this site, and the failures that interest me are never aesthetic. They are spatial. A hand meeting a tool at an impossible angle, a cable that connects to nothing.

Which is exactly why the successes matter. A good image model is not just a renderer. It is a model holding objects, labels, sequence, occlusion, and intent in the same space.

That is a different kind of intelligence. Language can tell you what to do. Images can show you what the world should look like while you are doing it.

Take a faulty dishwasher. A text model can say, remove the lower rack, take out the filter assembly, check the drain pump cover. That may be correct, and it may also be useless if you do not know what any of those things look like.

A diagram helps, but a diagram is generic, a simplified abstraction made for everyone, which means it is not quite made for you. Your dishwasher has the screws in slightly different places. The part you are looking at is grimy, upside down, and blocked by the exact shadow the manual never bothered to draw.

The next interface is not a manual. It is a sequence of generated and inspected visual states.

You take a picture. The model labels the visible parts: filter, impeller cover, drain channel, retaining clip. You ask what to do next, and instead of answering only in prose it shows you the next state, the rack removed, the filter lifted, the specific tab circled, then the state after that.

This is the important shift. The model can depict the future. Not the future as prophecy, the future as an intended state.

For physical tasks that is often what you actually need. Not a paragraph, but what the world should look like after the next move, held up against what is in front of you.

That makes image generation a planning surface. A model that can generate the next correct visual state is tracking relationships: that a hose runs behind a panel, that a screwdriver must meet the screw at a plausible angle, that the removed filter should now be on the counter and not still locked in place.

Those details sound small until you realize they are the whole problem. Most of the world is not made of concepts. It is made of parts in relation to other parts.

Text models are strong at symbolic movement, but the physical world punishes vague symbols. Open the panel is not an action until you know which panel, where the fastener is, how much force is normal, and what the safe intermediate state looks like.

This is why image generation is a prerequisite for spatial understanding. A sentence can hide an impossible instruction. A picture cannot hide it as easily, and if the generated hand reaches through a pipe, the failure is visible.

Pixels become a form of accountability.

Labels matter for the same reason. A label is not a caption pasted on top of an image, it is anchored to a thing, and if the model says do not pull this wire, it has to point to the wire.

There is a huge difference between check the drain hose and an image that says this ribbed tube, not the braided supply line. The second version respects the user's actual uncertainty. It does not assume vocabulary or mechanical confidence.

The pattern goes well beyond repair. A nurse learning a new device needs to see what properly seated looks like from the angle she will actually have in the room. A designer told to make the hierarchy clearer needs a concrete visual alternative, same content, same constraints, showing what clearer could mean.

This is where generated images stop being outputs and start being intermediate reasoning artifacts. The model is saying, here is my understanding of the world state, is this what you mean.

The user corrects, uploads the next photo, and the model compares the actual state to the intended one. The instruction stops being a static answer and becomes a guided visual conversation: a photo of the current state, labels for what matters, an image of the next state, a warning about what not to touch, verification against the next photo, repeated until the task is done.

That is not make me an image. That is help me act in the world.

There are obvious limits. A generated repair sequence is not proof, and electricity, gas, medicine, and structural work need authority, caution, and sometimes a licensed person. A confident-looking picture can be worse than a vague paragraph if the user mistakes plausibility for truth.

So the product design matters. These systems need to distinguish what they can see in the user's photo from what they are inferring, and to say this is a likely layout, not your exact model. A generated future state should be presented as a plan to check, not a fact to obey.

Still, the direction seems clear to me. For years we talked about multimodal models as if images were just another input type, upload a photo, get a caption. That undersells the shift.

Images are not just inputs. They are working memory for space.

When a model can produce a useful image of the next step, it has created a temporary world you can inspect together. It might be a dishwasher with the pump cover removed, a circuit board with the right connector highlighted, a prototype interface with the button moved to where the user will actually look.

The model is no longer only answering. It is staging.

That is why the image model matters before the robot, before the agent, before the fully autonomous helper. Before a system can act well in space, it has to imagine space well enough to be corrected.

The hills had eyes because the landscape was watching. The models have eyes because the interface is finally learning where things are.