Diffusion Model

Every image Midjourney has ever created started as static. Pure random noise, the visual equivalent of white noise on a detuned TV. The model looked at that chaos and, step by step, carved an image out of it, the way a sculptor reveals a figure hidden inside a block of marble.

That’s a diffusion model. And the idea behind it came from watching ink dissolve in water.

What a Diffusion Model Actually Is

A diffusion model is a type of generative AI that creates images (and increasingly video, audio, and 3D objects) by learning to reverse the process of destruction. During training, the model takes real images and gradually corrupts them with random noise, step by step, until nothing remains but static. Then it learns to run that process backward: starting from noise and removing it, one small step at a time, until a clean image emerges.

If that sounds counterintuitive, think of it this way. Teaching an AI to create a photorealistic image from scratch in a single leap is absurdly hard. But teaching it to make a very noisy image slightly less noisy? That’s manageable. Do it a few hundred times in sequence, and you’ve walked all the way from static to a finished picture.

The “diffusion” in the name comes directly from physics: the same word scientists use for how molecules spread through a space, like perfume dispersing through a room or a drop of ink dissolving in a glass of water. The AI technique borrows not just the name but the actual mathematics.

From Colored Blobs to Cover Art

In 2015, a postdoctoral researcher at Stanford named Jascha Sohl-Dickstein had an unusual background for someone about to reshape AI. He studied machine learning, but his side interest was nonequilibrium thermodynamics, the branch of physics that describes how systems exchange energy with their environment. Sohl-Dickstein noticed something about physical diffusion: if you watch a drop of ink dissolve in water, you start with a complex, concentrated pattern and end with a featureless, uniform blue. The process destroys all structure. But the math said that with sufficiently small steps, it could be reversed.

He published a paper, built the first diffusion model, and produced results that were, charitably, terrible. He later described squinting at the output and saying, “I think that colored blob looks like a truck.” Nobody was impressed. The field’s attention was focused on a rival technique called Generative Adversarial Networks (GANs), which produced sharper images by pitting two neural networks against each other in a counterfeiting arms race.

The diffusion approach sat largely dormant for five years.

Then, in 2020, a UC Berkeley researcher named Jonathan Ho decided to revisit the idea. Ho was drawn to diffusion models not for practical reasons but because he thought they were “the most mathematically beautiful subdiscipline of machine learning.” His paper, “Denoising Diffusion Probabilistic Models,” combined Sohl-Dickstein’s framework with modern neural network architectures and produced images that matched, and sometimes beat, the best GANs. The colored blobs were gone. The images were sharp, detailed, and diverse.

A year later, OpenAI researchers published results under the pointed title “Diffusion Models Beat GANs on Image Synthesis.” The evidence was decisive. The old king was dead.

The final piece fell into place in Germany. Robin Rombach and his colleagues at LMU Munich developed a technique called latent diffusion, which compressed images into a smaller mathematical space before running the diffusion process. This made generation dramatically faster and cheaper. Their research became Stable Diffusion, released as open-source software in August 2022. Midjourney, DALL-E 2, and Stable Diffusion all launched within months of each other that year, powered by the same fundamental idea that Sohl-Dickstein had roughed out seven years earlier with his colored blobs.

One of the strangest details of this story: three different researchers (Sohl-Dickstein in 2015, Yang Song independently at Stanford in 2019, and Ho in 2020) arrived at essentially the same mathematical framework from completely different starting points, without initially knowing about each other’s work. Song later compared their convergence to the discovery that two seemingly different theories in quantum physics were actually the same theory in different notation. Sometimes an idea is so right that multiple people find it independently.

How the Sculpting Actually Works

The process has two halves, and understanding both makes the whole thing click.

The forward process is the easy part. Take a real image and add a tiny amount of random noise. Then add a little more. Then more. After hundreds of steps, the original image has been completely destroyed, replaced by pure static. The model does this to thousands of training images, learning what noise looks like at every stage of corruption.

The reverse process is where the magic happens. The model trains a neural network to look at a noisy image and predict what noise was added at that step. Once it can do that reliably, you can run the whole process in reverse: start with pure random noise, remove a little bit of the predicted noise, and repeat. Each step is a small, tractable problem. Stacked together, they produce something extraordinary.

When you type a prompt (“oil painting of a lighthouse on a cliff at sunset, moody atmosphere”), the model uses your words to guide each denoising step, nudging the emerging image toward what you described. The text doesn’t search a database. It steers the sculpting.

Why This Matters for Your Writing Life

If you’ve ever typed a description into Midjourney and watched an image materialize, you’ve used a diffusion model. The technology powers virtually every AI image generator available to authors today.

Book covers are the obvious starting point. Self-published authors use Midjourney, DALL-E, and Stable Diffusion to generate cover concepts, mock up visual ideas before hiring a designer, or produce final cover art for genre fiction. What once required a $500 to $2,000 investment for every concept can now start with a text prompt and a few minutes of iteration.

Character and scene visualization is a natural next step. Authors generate portraits of their characters, concept art for settings, and mood boards for entire series. These images don’t replace the writing, but they can anchor it. Seeing your protagonist’s face, even an approximate version, changes how you write them.

Marketing materials round things out. Social media graphics, newsletter headers, promotional images, advertising visuals. The kind of polished visual content that once required either design skills or a design budget is now accessible through a text description.

Understanding that these tools work by sculpting noise into images, not by searching a database or collaging existing pictures, helps you use them more effectively. When Midjourney produces something close but not quite right, you know what’s happening: the model’s denoising process interpreted your words differently than you intended. The fix isn’t to search harder. It’s to describe more precisely, guiding the step-by-step sculpting process toward what you actually see in your mind.