The Three-Stage Correction Pipeline

The Exact Workflow I Built to Make AI Stop Getting Ancient Rome Wrong

Ancient Roman blacksmith's workshop with workbench and sparks flying.

By L. M. Hawkes · HawkesAdventures.com

The first two articles in this series documented the problem: AI image generation fails at historical accuracy in nine specific, consistent, and predictable ways. If you haven’t read those pieces, the short version is this – AI doesn’t malfunction when it puts a katana in ancient Rome or lights a Roman interior with Victorian gas-bracket lanterns. It does exactly what it was trained to do. The failures are systematic, not random.

Systematic failures have systematic solutions.

After enough rejected images, a pattern became clear: the failures were documentable, which meant they were correctable – not by prompting harder, but by building a structured feedback loop between generation, audit, and re-generation. Over four months and 1,106 images, that feedback loop became a three-stage pipeline.

This article describes the pipeline. The next article delivers the annotated prompts that power it, with the full production-ready versions available at HawkesAdventures.com.

Why “Prompt Harder” Doesn’t Work

The instinctive response to AI accuracy failures is to add more instructions to the prompt. Be more specific. Use stronger language. Add more negatives.

This helps. It is not sufficient.

The problem is that a longer prompt is not the same as a smarter prompt. Adding “no katanas” to a prompt reduces katana appearances – it does not eliminate them, and it does nothing about the Victorian lanterns you didn’t think to exclude, or the Gothic arch in the background you didn’t notice until the third time you looked at the image, or the gladiator’s sword that is worn correctly on the hip but is clearly a medieval falchion rather than a gladius.

The failures are diverse, interacting, and context-dependent. A static prompt, however detailed, addresses the failures you anticipated. A correction pipeline addresses the failures that actually occurred.

The difference between prompting harder and building a pipeline is the difference between guessing and learning.

Stage 1 – Initial Generation

Every image begins with a structured prompt engineered for four specific qualities: photographic realism, period-accurate atmosphere, Roman specificity, and controlled randomness.

Photographic realism is stated as a requirement, not a suggestion. The instruction “must look like a photograph, not an illustration” suppresses the painterly, fantasy-art aesthetic that AI defaults to when generating historical content and pushes output toward documentary visual language.

Period-accurate atmosphere is established through palette and lighting direction rather than through historical instruction. Specifying a bronze and ochre palette and gritty atmospheric lighting produces imagery that feels Roman without requiring the model to independently reach for Roman visual references. The palette does the period work.

Roman specificity is asserted at the material level – brick and stone construction, hand-forged metal, ceramic and bronze lamp vessels, worn and aged surfaces. Vague period references produce generic ancient imagery. Material-level specificity produces Roman imagery.

Controlled randomness is managed through the chaos parameter. A low chaos setting – --chaos 5 – minimizes the model’s tendency to introduce unexpected elements that frequently manifest as anachronisms. Higher chaos settings produce more visually interesting variation but also more historically problematic output. For accuracy-constrained work, low chaos is the correct tradeoff.

The initial prompt also carries a core set of explicit exclusions – the failure categories documented in Article 2, translated into specific visual negative prompts. No glass-paneled lanterns. No pointed arches. No enclosed visors. No swords on backs. These exclusions travel with every prompt in the catalog, regardless of scene.

Stage 2 – Structured Audit

Every generated image runs through a structured evaluation before it qualifies for the catalog. This is not a casual review. It is a systematic check against defined criteria, applied consistently across every image regardless of how good it looks at first glance.

The audit operates in layers, each designed to catch a different class of failure.

The Visual Anchoring Pass comes first. Before any historical evaluation happens, the evaluator identifies and lists the primary visual elements actually present in the image – people, clothing and armor, weapons and equipment, architecture, environment, lighting. This step exists to prevent a specific failure mode: evaluating elements that aren’t actually visible. An artifact that can’t be seen can’t be flagged. This pass grounds the entire evaluation in what is demonstrably present.

The Evidence Confirmation Pass follows. Every suspected issue must be confirmed against clearly visible image evidence before it can be flagged as a failure. Suspicion is not sufficient. If a potential problem cannot be confirmed from visible pixels, it is marked uncertain rather than assigned as a violation. This prevents false positives – the evaluator flagging things that look like they might be wrong without confirming that they actually are.

The historical and accuracy evaluation then checks each visible element against the Roman Historical Consistency Ruleset – the canonical set of period constraints covering lighting, clothing, footwear, armor, weapons, architecture, symbols, materials, and hairstyle. Each failure is assigned a severity level and documented with the specific visible evidence that triggered it.

The showcase-worthiness rating closes the audit. Images that pass all historical checks receive a 1–5 star rating for composition, visual impact, and period conviction. This rating is what drives catalog curation – a technically accurate image with weak composition doesn’t serve the catalog any better than an inaccurate one.

Across 1,106 images, roughly 7% achieved five-star showcase status and approximately 12% achieved four-star status. The rest were rejected or flagged for correction. That yield sounds low. It isn’t – it reflects the standard, not the failure rate. A curated catalog of 100 genuinely exceptional, historically accurate images is a fundamentally different product than a large catalog of adequate ones.

Stage 3 – Corrective Re-Prompting

This is the most powerful stage and the one most people skip.

Rejected images don’t get discarded. Their audit results feed directly into a refined Midjourney prompt that explicitly addresses the identified failures. The correction loop turns audit failures into prompt intelligence. Every rejected image makes the next generation smarter.

The corrective re-prompting process takes three inputs:

The original scene description
The specific failures identified in the audit, with their visible evidence
The core period-accuracy criteria

It outputs a refined Midjourney prompt with targeted negative specifications for exactly what went wrong in that specific image.

The specificity is what makes the difference. Vague exclusions don’t work. “No modern lighting” is not as effective as “Roman oil lamp on simple iron bracket, no glass panels, no pipe runs, no conduit, warm flickering flame only, no Victorian or post-Roman lighting elements.” The model responds to precise visual descriptions of exactly what you don’t want, in the visual language it was trained on.

A prompt that produced Victorian bracket lanterns with glass panels becomes a prompt that produces a ceramic oil lamp casting warm directional light across rough stone. The audit identified the failure. The correction addressed it precisely. The re-generation incorporates that precision.

Borderline images are worth running through this stage too. A borderline image – one that passes historical checks but rates two or three stars for composition or atmosphere – frequently produces a four or five star version when given a targeted corrective re-prompt. The instinct to discard borderlines is wrong. Mine them instead.

What the Pipeline Builds

The three-stage pipeline is not just a quality control mechanism. It is a learning system.

Each cycle through the pipeline adds to the accumulated prompt intelligence for a given scene type, period, and accuracy constraint set. By the time you have processed several hundred images, your prompts carry the documented failure history of everything that went wrong before them. The failure rate decreases. The yield improves. The catalog that emerges is not just larger – it is structurally better than anything a static prompting approach could produce.

The output of this pipeline is the Vault of Ages – a curated catalog of cinematic historical illustrations for ancient Roman gladiator culture, built to a historical accuracy standard that is documentable, defensible, and distinctive. Showcase and four-star tier images are available at HawkesAdventures.com under personal and commercial licenses, including derivative use for tabletop RPG supplements, game modules, interactive fiction, and digital and print publications.

The Prompts

The next article in this series delivers the annotated prompt architecture for both the audit stage and the corrective re-prompting stage – enough to implement the pipeline for your own catalog, adapted to your own historical period.

The full production-ready versions – the complete Roman Historical Consistency Ruleset, the structured issue handoff format, the canonical failure tag vocabulary, and the corrective re-prompting template – are available as a free resource at HawkesAdventures.com.

The methodology is period-agnostic. Swap ancient Rome for Viking Age Scandinavia, feudal Japan, or ancient Greece, adjust the specific accuracy constraints for the period, and the pipeline runs the same way. Future articles in this series will cover what AI gets systematically wrong about those periods specifically – and how the same three-stage approach addresses it.

L. M. Hawkes writes cinematic, historically grounded interactive gamebooks drawing from the warrior traditions of Rome, Greece, Japan, the Viking Age, and the great battles of antiquity. The Vault of Ages Art Pack Configurator – a curated catalogue of historically accurate cinematic illustration – is available at HawkesAdventures.com under personal and commercial licenses.

This is Part 3 of a 6-part series.

Previously, Part 2: The White Marble Lie

Coming next week, Part 4: The Full Prompts – the annotated audit and corrective re-prompting architecture, with complete production-ready documents at HawkesAdventures.com.

Tags: Artificial Intelligence · Midjourney · History · Workflow · Prompt Engineering · Game Design · Ancient Rome · Historical Fiction

The Three-Stage Correction Pipeline

The Exact Workflow I Built to Make AI Stop Getting Ancient Rome Wrong

Why “Prompt Harder” Doesn’t Work

Stage 1 – Initial Generation

Stage 2 – Structured Audit

Stage 3 – Corrective Re-Prompting

What the Pipeline Builds

The Prompts

The Pipeline Works for Any Historical Period

The Full Prompts

AI Keeps Putting Katanas in Ancient Rome

The White Marble Lie

The Database Behind the Art

L. M. Hawkes

The Three-Stage Correction Pipeline

The Exact Workflow I Built to Make AI Stop Getting Ancient Rome Wrong

Why “Prompt Harder” Doesn’t Work

Stage 1 – Initial Generation

Stage 2 – Structured Audit

Stage 3 – Corrective Re-Prompting

What the Pipeline Builds

The Prompts

Similar Posts

L. M. Hawkes