Content Filter

In 1996, residents of Scunthorpe, England, discovered they couldn’t create AOL accounts. The signup form kept rejecting their hometown. The reason? AOL’s profanity filter had spotted a four-letter word hiding inside “Scunthorpe” and decided the whole thing was obscene. No appeal, no context, no human review. Just a pattern-matching algorithm doing exactly what it was told, with zero understanding of what it was actually looking at.

Thirty years later, AI content filters have gotten vastly more sophisticated. They’re powered by neural networks instead of substring matching. But the fundamental problem, software that can scan text without truly understanding it, hasn’t gone away. And if you’ve ever had an AI writing tool refuse to let your villain threaten anyone, or watched a perfectly innocent kiss scene get flagged mid-sentence, you’ve met a modern descendant of the filter that defeated Scunthorpe.

What a Content Filter Actually Does

A content filter is an automated system that scans text (either your input, the AI’s output, or both) and decides whether it crosses a line. If it does, the content gets blocked, modified, or flagged before anyone sees it.

If you’ve read about guardrails, you might be wondering how filters are different. Think of it this way: guardrails are the whole safety architecture surrounding an AI model, everything from system prompts to behavioral training to rate limits. Content filters are one specific mechanism within that architecture. If guardrails are a building’s security system, content filters are the metal detectors at the entrance. They don’t set the security policy. They enforce one particular part of it.

Most commercial AI tools run content filters at two checkpoints. Input filters scan your prompt before the model ever sees it. If your request trips a classifier, you get a refusal and the model never processes your words at all. Output filters wait until the model has generated a response, then scan it before showing it to you. This is why you’ll sometimes see an AI start writing a scene and then abruptly stop or pivot to a disclaimer. The model produced the content just fine. The output filter caught it on the way out.

From Spam Folders to AI Studios

Content filtering has been around almost as long as the internet itself. AOL’s early email spam filters in the mid-1990s were simple keyword blocklists, and their failures (like poor Scunthorpe) became legendary. By the early 2000s, Bayesian filters used statistical probability to catch spam, a significant leap in sophistication. The Children’s Internet Protection Act of 2000 required libraries and schools to install content filters, making the technology a matter of federal policy.

Social media scaled the problem enormously. Facebook, YouTube, and Instagram spent the 2010s building machine learning classifiers to moderate billions of posts, combining automated detection with armies of human reviewers. These systems got better at catching genuinely harmful content, but context remained their blind spot.

When OpenAI launched the GPT-3 API in June 2020, they included a dedicated content filter endpoint, one of the first filtering systems built specifically for generative AI. It classified text as “safe,” “sensitive,” or “unsafe.” OpenAI admitted they initially “set it to err on the side of caution,” which meant a high rate of false positives. By 2022, that endpoint had evolved into a separate Moderation API, now powered by a GPT-4o-based classifier that scores text across categories like hate, violence, self-harm, and sexual content.

Anthropic took a different path. Their Constitutional Classifiers system monitors Claude’s own internal neural activations, the patterns that fire when the model processes something potentially harmful, rather than relying on a completely separate model to judge the output. Their second-generation system achieves a false positive rate of just 0.05%, meaning only about one in two thousand innocent queries gets incorrectly blocked.

How the Scoring Works

Modern content filters don’t just check for banned words. They use classifier models (sometimes as small as 38 million parameters, sometimes as large as GPT-4o) to assign a numerical score across multiple categories. Microsoft’s Azure platform, for example, rates content across four standard categories (hate, sexual, violence, self-harm) at severity levels of safe, low, medium, and high. Developers can then set their own thresholds, choosing to block only “high” severity violence while allowing “low” and “medium” through.

This is why two different AI writing tools using the same underlying model can behave very differently. The model might be identical. The content filter thresholds are not.

Why This Matters for Your Writing Life

Content filters are the single most common source of friction for fiction writers using AI tools. The core problem is that filters evaluate text in isolation. They can’t distinguish between a character threatening violence in your thriller and an actual threat. A first kiss gets the same scrutiny as explicit content. A character picking a lock in a heist novel can trigger the same classifier that blocks actual lock-picking instructions.

The good news is that different tools set their thresholds in very different places.

ChatGPT applies relatively strict filtering that has tightened over time. Writers have reported that even moderate violence and mature themes standard in published literary fiction can trigger refusals. Claude handles mature themes with more nuance when given clear fictional framing, though it still has firm boundaries around certain categories. Sudowrite offers a pragmatic middle ground, and its proprietary Muse model was built with fiction in mind and carries lighter filtering. NovelAI runs its own models with minimal content filtering, making it the go-to for horror, dark fantasy, and romance writers who need creative freedom.

Understanding that content filters are external to the model (not part of its intelligence) gives you a practical advantage. You can often work within them by providing context. Framing your request explicitly as fiction, naming the genre, specifying a character’s role in the story: these give the filter more information to work with. It won’t always help, but “Write a scene from a crime thriller where the detective confronts a suspect” will clear more filters than “Write a threatening confrontation.”

The residents of Scunthorpe eventually got their AOL accounts, after the company added their town to an exception list. AI content filtering is still working toward its own version of that fix, systems sophisticated enough to protect against genuine harm without catching fiction in the crossfire.