When the Evaluator Became Part of the Generation System
Why One AI Was No Longer Enough for Historical Reconstruction

By L. M. Hawkes · HawkesAdventures.com
The first six articles in this series followed a progression that began with a simple problem: AI kept getting ancient Rome wrong.
At first, the failures looked like ordinary prompt failures. A Roman legionary carried a katana. A gladiator looked like a modern MMA fighter. A Roman interior was lit with Victorian gas-bracket lanterns. A city street looked too clean, too white, too new, too much like a museum reconstruction and not enough like an inhabited ancient place.
The obvious response was to write better prompts.
That helped. It did not solve the problem.
Over time, the workflow changed. It moved from generation, to audit, to corrective re-prompting, to metadata validation, to a transferable Historical Consistency Ruleset that could be adapted beyond Rome. By the end of Article 6, the pipeline had become period-agnostic: the structure stayed the same, while the ruleset changed for Rome, the Viking Age, feudal Japan, ancient Greece, or any other historical setting.
That completed the first arc.
But once the pipeline could transfer across periods, another question became unavoidable: what happens when the correction process itself outgrows a single AI system?
That is where the next stage begins.
Each step made the system more structured. Each step also revealed the same larger truth: historically-constrained AI image generation is not a single-model problem.
The breakthrough was not finding one AI system that understood ancient Rome.
The breakthrough was building a system where different AI systems could be used for different parts of the correction loop, with a structured evaluator converting visible historical failures into targeted repair intelligence.
That changed the work completely.
The Original Assumption
The earliest version of the workflow assumed a fairly direct relationship between prompt quality and image quality.
Write a better Midjourney prompt. Generate a better image. If the image fails, revise the prompt and try again.
That is a reasonable starting point. It is also the point where most AI image workflows stop. The user improves the prompt, the model produces another image, and the process continues until the result is acceptable or the user gives up.
For historically constrained work, that approach breaks down quickly.
The reason is not that the prompt is too short or that the model is too weak. The reason is that historical failures are not one thing. They are families of failures, and different failure families require different forms of correction.
A Gothic arch in a Roman street scene is not the same kind of problem as a gladiator with the wrong body composition. A flat, transparent glass window is not the same kind of problem as a crowd with no visible social hierarchy. A pristine marble courtyard is not the same kind of problem as a sword worn on the back.
They are all historical accuracy failures, but they do not repair the same way.
That was the first escalation. A general corrective prompt was no longer enough. The system needed to know what kind of failure had occurred before it could decide what kind of repair instruction would help.
Different Models Fail Differently
The next discovery was more operational: different AI systems fail differently.
That sounds obvious after you have seen it for a while, but it is not how most people think about AI tools. The common question is, “Which model is best?” That is the wrong question for this kind of work.
The better question is: which model is best at this specific correction class?
One system may be stronger at preserving scene continuity during a localized environmental repair. Another may be better at correcting human physique or body proportions without damaging the rest of the image. Another may be better at turning a structured audit into precise corrective language. Another may be better at enforcing a written ruleset and refusing to pass an image when the visible evidence does not support it.
In practice, that means the workflow stops being single-model generation and becomes model orchestration.
Midjourney may still produce the initial visual language. Nano Banana may be better for certain localized environmental repairs where the scene needs to remain intact but one failure needs to be corrected. Seedream may be better for some body-composition or physique corrections where the visible human form is the core issue. A text model may be better at translating audit failures into correction prompts that are specific, structured, and free of vague historical language.
The point is not that one tool is universally superior. The point is that the repair target determines the tool.
That distinction matters. If the question is “Which AI should I use?”, the workflow becomes tool-chasing. If the question is “What failure class am I repairing?”, the workflow becomes a production system.
The Evaluator Changed Roles
The most important shift happened inside the audit stage.
Originally, the evaluator functioned as a quality gate. It answered a binary production question: does this image pass or fail?
That was useful, but limited. A pass/fail system can protect the catalog from bad images. It cannot, by itself, improve the next image.
The evaluator became more powerful when its output stopped being merely judgmental and became operational.
A useful evaluator does not just say, “This image is historically wrong.” It identifies the visible evidence. It assigns the issue to a category. It distinguishes confirmed failures from uncertain visibility. It describes why the failure matters. It produces a structured handoff that another step in the workflow can use.
At that point, the evaluator is no longer a passive reviewer. It is a production intelligence layer.
That is the core architectural realization of this stage of the project.
The evaluator is not there only to reject images. It is there to explain what kind of repair the image needs.
A failure tagged as architecture_cross_era does not receive the same treatment as a failure tagged social_status_error. A surface_authenticity problem does not receive the same repair language as a population_realism problem. A lighting violation does not need the same model behavior as a body-composition correction.
Once failures are tagged, localized, and described, they stop being subjective complaints. They become machine-readable repair targets.
From Audit Notes to Repair Targets
This changed the meaning of an audit failure.
In a casual workflow, an audit note might say, “The background architecture looks wrong.” That may be true, but it is not enough to drive a correction. Wrong how? Medieval? Renaissance? Too pristine? Too monumental? Too symmetrical? Built from the wrong material? Showing a pointed arch where a round Roman arch should be visible?
The correction depends on the answer.
The structured evaluator forces that answer into the output.
For example:
- architecture_cross_era: a building form belongs to a later period, such as a Gothic pointed arch, Renaissance facade, or medieval fortification element appearing in a Roman setting.
- social_status_error: the clothing, cleanliness, ornamentation, or bearing of a figure contradicts the role the scene assigns to that person.
- surface_authenticity: the materials look too new, too polished, too machined, too clean, or too much like a modern reconstruction rather than an inhabited ancient environment.
- population_realism: the crowd fails to show the social, economic, age, labor, and status variation that would make the environment feel like a functioning ancient city.
Each tag carries different repair implications.
An architecture_cross_era correction may need strong negative language against pointed arches, crenellations, Renaissance facades, or modern monumental references, paired with positive language for brick, concrete, tufa, round Roman arches, and period-appropriate construction.
A social_status_error correction may need role-specific garment quality, cleanliness, posture, and ornamentation. A senator should not look like a street laborer. A slave should not be dressed as though he has servants. A merchant, freedman, gladiator, lanista, soldier, matron, and senator should not collapse into one generic “Roman citizen” aesthetic.
A surface_authenticity correction may need wear, smoke staining, patched plaster, scuffed leather, hand-forged irregularity, cloudy glass if glass is visible at all, and environmental dirt that belongs to actual use rather than theatrical grime.
A population_realism correction may need age variation, occupational diversity, uneven garment quality, background activity, believable density, and the asymmetry of real urban life.
The evaluator makes these distinctions explicit. That is what allows the repair process to become targeted instead of generic.
Repair Prompts Became Families
Once the failure taxonomy became more granular, corrective prompts had to evolve with it.
Early corrective prompting was built around the specific visible failure in a rejected image. That was already a major improvement over generic retries. A prompt that says “no modern lighting” is weak. A prompt that says “Roman oil lamp on simple iron bracket, no glass panels, no pipe runs, no conduit, warm flickering flame only” is stronger because it describes the failure and the replacement in visual terms.
But as the catalog grew, a second pattern emerged. Similar failures recurred across different images, different scenes, and different settings.
That meant the correction language itself could be classified.
There were architecture repair prompts. Lighting repair prompts. Material repair prompts. Gladiator equipment repair prompts. Body-composition repair prompts. Social hierarchy repair prompts. Environmental realism repair prompts. Anti-sterility prompts. Population realism prompts.
Each family developed its own language because each family needed to fight a different AI default.
The architecture family fights cross-era contamination. The material family fights polished, machined, and museum-like surfaces. The body-composition family fights the modern fitness aesthetic. The population family fights the model’s tendency to produce theatrical crowds instead of socially legible communities. The social hierarchy family fights the flattening of Roman society into one visual class.
This is where the process stopped being “make a better prompt” and became something more like prompt engineering as repair logistics.
The system was no longer generating a correction from scratch every time. It was selecting and adapting from accumulated repair intelligence.
The Ruleset Started Learning From Failure
The Roman Historical Consistency Ruleset began as a guardrail: lighting, clothing, footwear, armor, weapons, architecture, symbols, materials, hairstyle, and other period constraints that could be checked against visible evidence.
That ruleset became more valuable when failures started feeding back into it.
Every newly discovered recurring failure exposed either a missing rule, an insufficiently specific rule, or a rule that needed better repair language. The evaluator found the weakness. The failure tag named it. The repair prompt addressed it. If the same failure recurred, the ruleset was tightened.
This is the recursive part of the system.
The evaluator improves the repair prompts. The repair prompts improve the generated images. The new generations expose new edge cases. Those edge cases improve the evaluator.
The system compounds because failure is not wasted.
That is a major difference between a manual review workflow and a structured correction workflow. In a manual review workflow, a rejected image is usually just a loss. In a structured correction workflow, a rejected image is data. It tells the system what the model still does not understand, what the prompt still fails to suppress, and where the ruleset needs to become more explicit.
The catalog improves because the correction system remembers.
Why This Matters Commercially
This level of structure would be excessive for casual image generation.
If the goal is a single atmospheric illustration for private use, a few retries may be enough. If the image looks good, it is good.
A commercial historical catalog has a different standard.
The images need to be searchable, licensable, reusable, and defensible. They need to survive scrutiny from people who know the subject. They need to support tabletop RPG designers, interactive fiction authors, game developers, educators, publishers, and historical content creators who do not want a plausible-looking image that collapses when someone notices the sword, the arch, the lamp, the glass, the body type, the crowd, or the class signal.
That standard cannot depend on vibes.
It needs a workflow that can identify visible failures, classify them, repair them, preserve the metadata trail, and improve over time.
This is why the evaluator matters commercially. It is not only an accuracy tool. It is a product infrastructure tool.
It protects the catalog from visible historical errors, but it also produces the structured intelligence that makes correction scalable. Without that, the catalog depends on the memory and attention of the person reviewing each image. With it, the workflow can enforce the same historical governance across hundreds of assets.
That is the difference between a folder of impressive AI images and a historically constrained visual archive.
The Broader Shift
Looking back across the series, the progression is clearer than it was while building it.
The first stage was discovery: AI historical failures are patterned.
The second was classification: those failures can be named and grouped.
The third was workflow: generation, audit, and corrective re-prompting can form a pipeline.
The fourth was architecture: prompts need structure, not just more adjectives.
The fifth was infrastructure: passing images need metadata, validation, and commercial organization.
The sixth was transferability: the same structure can be adapted to other historical periods by replacing the period-specific ruleset.
This follow-on stage adds another layer: orchestration.
The system no longer depends on one AI producing a perfect image in one pass. It depends on a coordinated correction loop in which each model has a role, each failure has a category, each category has repair implications, and each repair attempt produces more information for the next cycle.
That is a different kind of AI workflow.
It is less glamorous than the usual story about a single magical prompt. It is also much closer to how reliable production systems are built.
What This Built
The project did not begin as an attempt to build a multi-model historical correction system.
It began with a simpler frustration: ancient Rome kept coming out wrong.
But the failures were too consistent to ignore, and too specific to fix with generic prompting. The katana was only the most obvious symptom. The deeper problem was that generative AI does not naturally possess historical coherence. It possesses statistical associations, many of which are contaminated by later periods, fantasy media, museum reconstructions, tourism imagery, and modern visual expectations.
Historical coherence had to be imposed from outside the model.
That required a ruleset. Then an evaluator. Then structured issue tags. Then corrective prompt families. Then model-specific repair choices. Then a feedback loop where every failure strengthened the system.
The result is not simply a better way to generate Roman images.
It is a historically constrained visual reconstruction system: one that uses specialized AI models, structured historical evaluation, targeted correction prompts, and recursive governance to force coherence onto systems that do not naturally produce it.
That is the important lesson.
The future of serious AI historical illustration is not one perfect model and one perfect prompt. It is a correction architecture: visible evidence, structured failure, targeted repair, model specialization, and accumulated rules.
Build that, and the AI begins to do something more useful than generate images that look historical.
It begins to participate in a system that can be taught what history requires.
L. M. Hawkes writes cinematic, historically grounded interactive gamebooks drawing from the warrior traditions of Rome, Greece, Japan, the Viking Age, and the great battles of antiquity. The Vault of Ages Art Pack Configurator is a curated catalog of historically accurate cinematic illustration, available at HawkesAdventures.com under personal and commercial licenses.
This is a follow-on article to the original six-part series. The first six articles established the failure taxonomy, correction pipeline, prompt architecture, metadata infrastructure, and cross-period transferability of the workflow. This article describes the next evolution: evaluator-driven, multi-model historical correction.
Tags: Artificial Intelligence · History · Midjourney · Prompt Engineering · Workflow · Ancient Rome · Historical Fiction · Game Design · AI Art · Historical Reconstruction
