Hawkes Adventures

When Copyright Refusal Forced a Stronger Business Architecture

LMHawkes — Sat, 20 Jun 2026 14:13:00 +0000

When Copyright Refusal Forced a Stronger Business Architecture

When I first learned that copyright protection for AI-generated images was going to be more complicated than I expected, my immediate reaction was concern.

Like many people building with generative AI, I had unconsciously adopted a simple assumption:

The image was the product.

If the image was the product, then protecting the image seemed essential.

Without strong protection, what stopped competitors from generating similar images, copying successful ideas, or recreating years of work with a few prompts?

At first glance, it looked like a serious problem.

Looking back, it may have been one of the most useful strategic shocks my business ever received.

Once I started examining where the effort was actually going, a different picture emerged.

The images were visible. The infrastructure was consuming most of the time. Evaluation, classification, organization, correction, and system design were gradually absorbing more effort than image generation itself.

That forced a new question.

If most of the work was happening outside the image, where was the real value being created?

The Wrong Question

The question I was asking was:

“How do I protect the image?”

What I should have been asking was:

“What actually creates value here?”

Those are not the same question.

As I examined the problem more closely, I realized something uncomfortable.

The individual images were not where most of the effort was going.

The effort was going into everything surrounding the images.

The image was simply the visible output.

The Evaluator

Generating a Roman scene is relatively easy.

Generating a historically plausible Roman scene is much harder.

Over time I found myself spending more effort evaluating images than creating them.

I built increasingly strict review systems to identify:

anachronisms
incorrect equipment
architectural mistakes
implausible lighting
inconsistent social signals
material culture errors

The value was not merely producing images.

The value was learning how to recognize when images were wrong.

Every failure produced information that improved future work.

That realization changed how I viewed the project.

The Metadata

The next surprise came from organization.

A folder full of images has limited value.

A structured archive becomes more valuable with every addition.

As the collection grew, I found myself investing heavily in:

scene classification
historical categorization
commercial-use tagging
environmental classification
visual subject indexing
educational organization

The metadata increasingly became more difficult to build than the image itself.

The archive was no longer a collection of files.

It was becoming a system for discovering, understanding, and reusing historical visual content.

The Ontology

Eventually another pattern emerged.

To evaluate and classify images consistently, I needed a model of the world being depicted.

Questions that seemed simple suddenly became surprisingly complex.

What makes an environment feel Roman?

How should status be visible?

What objects belong together?

How does architecture influence movement, lighting, and behavior?

What visual cues communicate military authority versus civilian life?

I gradually found myself constructing a world model rather than merely generating artwork.

The project was becoming less about images and more about relationships between people, objects, environments, and historical context.

The Archive

The archive itself became a strategic asset.

Not because it contained images.

Because it contained reviewed images.

Classified images.

Corrected images.

Organized images.

Searchable images.

Images connected to a growing body of knowledge.

The archive was accumulating structure.

And structure compounds.

The Infrastructure

At some point I realized that a competitor could theoretically recreate an image.

Recreating the surrounding infrastructure would be far more difficult.

The evaluator.

The metadata system.

The classification rules.

The archive organization.

The quality-control processes.

The educational framework.

The production workflows.

Those systems had taken considerable time and effort to develop.

More importantly, they reinforced one another.

Each improvement strengthened several other parts of the business simultaneously.

The moat was not a single asset.

The moat was the architecture.

The Intelligence Layer

The most important realization came last.

The business was no longer just producing images.

It was accumulating knowledge.

Every review improved future reviews.

Every correction improved future prompts.

Every classification improved future organization.

Every failure became a lesson.

The system was learning.

Not automatically.

Not magically.

But through deliberate iteration.

That accumulated intelligence became more valuable than any individual image could ever be.

What Copyright Refusal Taught Me

Initially, copyright uncertainty felt like a threat because I believed the image was the product.

It forced me to ask a deeper question.

What if the image is only one output of a much larger system?

Once I began examining the business through that lens, the answer became obvious.

The strongest assets were never the images themselves.

They were the evaluator.

The metadata.

The ontology.

The archive.

The accumulated intelligence.

The infrastructure that connected them all together.

What appeared to be a setback ended up exposing a more durable foundation.

The experience didn’t strengthen the business despite the challenge.

It strengthened the business because it forced me to discover where the real value had been accumulating all along.

L. M. Hawkes writes cinematic, historically grounded interactive gamebooks drawing from the warrior traditions of Rome, Greece, Japan, the Viking Age, and the great battles of antiquity. The Vault of Ages Art Pack Configurator – a curated catalogue of historically accurate cinematic illustration – is available at HawkesAdventures.com under personal and commercial licenses.

This article continues the exploration of evaluator-driven historical image correction and the systems developed through the Vault of Ages and Spot the Anachronism projects. Both initiatives emerged from the same effort to transform historical accuracy from an artistic preference into a repeatable process.

Tags: AI Systems · Publishing Infrastructure · Evaluation · Metadata · Historical Reconstruction

The post When Copyright Refusal Forced a Stronger Business Architecture appeared first on Hawkes Adventures.

When the Evaluator Became Part of the Generation System

LMHawkes — Sat, 13 Jun 2026 14:28:00 +0000

When the Evaluator Became Part of the Generation System

Why One AI Was No Longer Enough for Historical Reconstruction

By L. M. Hawkes · HawkesAdventures.com

The first six articles in this series followed a progression that began with a simple problem: AI kept getting ancient Rome wrong.

At first, the failures looked like ordinary prompt failures. A Roman legionary carried a katana. A gladiator looked like a modern MMA fighter. A Roman interior was lit with Victorian gas-bracket lanterns. A city street looked too clean, too white, too new, too much like a museum reconstruction and not enough like an inhabited ancient place.

The obvious response was to write better prompts.

That helped. It did not solve the problem.

Over time, the workflow changed. It moved from generation, to audit, to corrective re-prompting, to metadata validation, to a transferable Historical Consistency Ruleset that could be adapted beyond Rome. By the end of Article 6, the pipeline had become period-agnostic: the structure stayed the same, while the ruleset changed for Rome, the Viking Age, feudal Japan, ancient Greece, or any other historical setting.

That completed the first arc.

But once the pipeline could transfer across periods, another question became unavoidable: what happens when the correction process itself outgrows a single AI system?

That is where the next stage begins.

Each step made the system more structured. Each step also revealed the same larger truth: historically-constrained AI image generation is not a single-model problem.

The breakthrough was not finding one AI system that understood ancient Rome.

The breakthrough was building a system where different AI systems could be used for different parts of the correction loop, with a structured evaluator converting visible historical failures into targeted repair intelligence.

That changed the work completely.

The Original Assumption

The earliest version of the workflow assumed a fairly direct relationship between prompt quality and image quality.

Write a better Midjourney prompt. Generate a better image. If the image fails, revise the prompt and try again.

That is a reasonable starting point. It is also the point where most AI image workflows stop. The user improves the prompt, the model produces another image, and the process continues until the result is acceptable or the user gives up.

For historically constrained work, that approach breaks down quickly.

The reason is not that the prompt is too short or that the model is too weak. The reason is that historical failures are not one thing. They are families of failures, and different failure families require different forms of correction.

A Gothic arch in a Roman street scene is not the same kind of problem as a gladiator with the wrong body composition. A flat, transparent glass window is not the same kind of problem as a crowd with no visible social hierarchy. A pristine marble courtyard is not the same kind of problem as a sword worn on the back.

They are all historical accuracy failures, but they do not repair the same way.

That was the first escalation. A general corrective prompt was no longer enough. The system needed to know what kind of failure had occurred before it could decide what kind of repair instruction would help.

Different Models Fail Differently

The next discovery was more operational: different AI systems fail differently.

That sounds obvious after you have seen it for a while, but it is not how most people think about AI tools. The common question is, “Which model is best?” That is the wrong question for this kind of work.

The better question is: which model is best at this specific correction class?

One system may be stronger at preserving scene continuity during a localized environmental repair. Another may be better at correcting human physique or body proportions without damaging the rest of the image. Another may be better at turning a structured audit into precise corrective language. Another may be better at enforcing a written ruleset and refusing to pass an image when the visible evidence does not support it.

In practice, that means the workflow stops being single-model generation and becomes model orchestration.

Midjourney may still produce the initial visual language. Nano Banana may be better for certain localized environmental repairs where the scene needs to remain intact but one failure needs to be corrected. Seedream may be better for some body-composition or physique corrections where the visible human form is the core issue. A text model may be better at translating audit failures into correction prompts that are specific, structured, and free of vague historical language.

The point is not that one tool is universally superior. The point is that the repair target determines the tool.

That distinction matters. If the question is “Which AI should I use?”, the workflow becomes tool-chasing. If the question is “What failure class am I repairing?”, the workflow becomes a production system.

The Evaluator Changed Roles

The most important shift happened inside the audit stage.

Originally, the evaluator functioned as a quality gate. It answered a binary production question: does this image pass or fail?

That was useful, but limited. A pass/fail system can protect the catalog from bad images. It cannot, by itself, improve the next image.

The evaluator became more powerful when its output stopped being merely judgmental and became operational.

A useful evaluator does not just say, “This image is historically wrong.” It identifies the visible evidence. It assigns the issue to a category. It distinguishes confirmed failures from uncertain visibility. It describes why the failure matters. It produces a structured handoff that another step in the workflow can use.

At that point, the evaluator is no longer a passive reviewer. It is a production intelligence layer.

That is the core architectural realization of this stage of the project.

The evaluator is not there only to reject images. It is there to explain what kind of repair the image needs.

A failure tagged as architecture_cross_era does not receive the same treatment as a failure tagged social_status_error. A surface_authenticity problem does not receive the same repair language as a population_realism problem. A lighting violation does not need the same model behavior as a body-composition correction.

Once failures are tagged, localized, and described, they stop being subjective complaints. They become machine-readable repair targets.

From Audit Notes to Repair Targets

This changed the meaning of an audit failure.

In a casual workflow, an audit note might say, “The background architecture looks wrong.” That may be true, but it is not enough to drive a correction. Wrong how? Medieval? Renaissance? Too pristine? Too monumental? Too symmetrical? Built from the wrong material? Showing a pointed arch where a round Roman arch should be visible?

The correction depends on the answer.

The structured evaluator forces that answer into the output.

For example:

architecture_cross_era: a building form belongs to a later period, such as a Gothic pointed arch, Renaissance facade, or medieval fortification element appearing in a Roman setting.
social_status_error: the clothing, cleanliness, ornamentation, or bearing of a figure contradicts the role the scene assigns to that person.
surface_authenticity: the materials look too new, too polished, too machined, too clean, or too much like a modern reconstruction rather than an inhabited ancient environment.
population_realism: the crowd fails to show the social, economic, age, labor, and status variation that would make the environment feel like a functioning ancient city.

Each tag carries different repair implications.

An architecture_cross_era correction may need strong negative language against pointed arches, crenellations, Renaissance facades, or modern monumental references, paired with positive language for brick, concrete, tufa, round Roman arches, and period-appropriate construction.

A social_status_error correction may need role-specific garment quality, cleanliness, posture, and ornamentation. A senator should not look like a street laborer. A slave should not be dressed as though he has servants. A merchant, freedman, gladiator, lanista, soldier, matron, and senator should not collapse into one generic “Roman citizen” aesthetic.

A surface_authenticity correction may need wear, smoke staining, patched plaster, scuffed leather, hand-forged irregularity, cloudy glass if glass is visible at all, and environmental dirt that belongs to actual use rather than theatrical grime.

A population_realism correction may need age variation, occupational diversity, uneven garment quality, background activity, believable density, and the asymmetry of real urban life.

The evaluator makes these distinctions explicit. That is what allows the repair process to become targeted instead of generic.

Repair Prompts Became Families

Once the failure taxonomy became more granular, corrective prompts had to evolve with it.

Early corrective prompting was built around the specific visible failure in a rejected image. That was already a major improvement over generic retries. A prompt that says “no modern lighting” is weak. A prompt that says “Roman oil lamp on simple iron bracket, no glass panels, no pipe runs, no conduit, warm flickering flame only” is stronger because it describes the failure and the replacement in visual terms.

But as the catalog grew, a second pattern emerged. Similar failures recurred across different images, different scenes, and different settings.

That meant the correction language itself could be classified.

There were architecture repair prompts. Lighting repair prompts. Material repair prompts. Gladiator equipment repair prompts. Body-composition repair prompts. Social hierarchy repair prompts. Environmental realism repair prompts. Anti-sterility prompts. Population realism prompts.

Each family developed its own language because each family needed to fight a different AI default.

The architecture family fights cross-era contamination. The material family fights polished, machined, and museum-like surfaces. The body-composition family fights the modern fitness aesthetic. The population family fights the model’s tendency to produce theatrical crowds instead of socially legible communities. The social hierarchy family fights the flattening of Roman society into one visual class.

This is where the process stopped being “make a better prompt” and became something more like prompt engineering as repair logistics.

The system was no longer generating a correction from scratch every time. It was selecting and adapting from accumulated repair intelligence.

The Ruleset Started Learning From Failure

The Roman Historical Consistency Ruleset began as a guardrail: lighting, clothing, footwear, armor, weapons, architecture, symbols, materials, hairstyle, and other period constraints that could be checked against visible evidence.

That ruleset became more valuable when failures started feeding back into it.

Every newly discovered recurring failure exposed either a missing rule, an insufficiently specific rule, or a rule that needed better repair language. The evaluator found the weakness. The failure tag named it. The repair prompt addressed it. If the same failure recurred, the ruleset was tightened.

This is the recursive part of the system.

The evaluator improves the repair prompts. The repair prompts improve the generated images. The new generations expose new edge cases. Those edge cases improve the evaluator.

The system compounds because failure is not wasted.

That is a major difference between a manual review workflow and a structured correction workflow. In a manual review workflow, a rejected image is usually just a loss. In a structured correction workflow, a rejected image is data. It tells the system what the model still does not understand, what the prompt still fails to suppress, and where the ruleset needs to become more explicit.

The catalog improves because the correction system remembers.

Why This Matters Commercially

This level of structure would be excessive for casual image generation.

If the goal is a single atmospheric illustration for private use, a few retries may be enough. If the image looks good, it is good.

A commercial historical catalog has a different standard.

The images need to be searchable, licensable, reusable, and defensible. They need to survive scrutiny from people who know the subject. They need to support tabletop RPG designers, interactive fiction authors, game developers, educators, publishers, and historical content creators who do not want a plausible-looking image that collapses when someone notices the sword, the arch, the lamp, the glass, the body type, the crowd, or the class signal.

That standard cannot depend on vibes.

It needs a workflow that can identify visible failures, classify them, repair them, preserve the metadata trail, and improve over time.

This is why the evaluator matters commercially. It is not only an accuracy tool. It is a product infrastructure tool.

It protects the catalog from visible historical errors, but it also produces the structured intelligence that makes correction scalable. Without that, the catalog depends on the memory and attention of the person reviewing each image. With it, the workflow can enforce the same historical governance across hundreds of assets.

That is the difference between a folder of impressive AI images and a historically constrained visual archive.

The Broader Shift

Looking back across the series, the progression is clearer than it was while building it.

The first stage was discovery: AI historical failures are patterned.

The second was classification: those failures can be named and grouped.

The third was workflow: generation, audit, and corrective re-prompting can form a pipeline.

The fourth was architecture: prompts need structure, not just more adjectives.

The fifth was infrastructure: passing images need metadata, validation, and commercial organization.

The sixth was transferability: the same structure can be adapted to other historical periods by replacing the period-specific ruleset.

This follow-on stage adds another layer: orchestration.

The system no longer depends on one AI producing a perfect image in one pass. It depends on a coordinated correction loop in which each model has a role, each failure has a category, each category has repair implications, and each repair attempt produces more information for the next cycle.

That is a different kind of AI workflow.

It is less glamorous than the usual story about a single magical prompt. It is also much closer to how reliable production systems are built.

What This Built

The project did not begin as an attempt to build a multi-model historical correction system.

It began with a simpler frustration: ancient Rome kept coming out wrong.

But the failures were too consistent to ignore, and too specific to fix with generic prompting. The katana was only the most obvious symptom. The deeper problem was that generative AI does not naturally possess historical coherence. It possesses statistical associations, many of which are contaminated by later periods, fantasy media, museum reconstructions, tourism imagery, and modern visual expectations.

Historical coherence had to be imposed from outside the model.

That required a ruleset. Then an evaluator. Then structured issue tags. Then corrective prompt families. Then model-specific repair choices. Then a feedback loop where every failure strengthened the system.

The result is not simply a better way to generate Roman images.

It is a historically constrained visual reconstruction system: one that uses specialized AI models, structured historical evaluation, targeted correction prompts, and recursive governance to force coherence onto systems that do not naturally produce it.

That is the important lesson.

The future of serious AI historical illustration is not one perfect model and one perfect prompt. It is a correction architecture: visible evidence, structured failure, targeted repair, model specialization, and accumulated rules.

Build that, and the AI begins to do something more useful than generate images that look historical.

It begins to participate in a system that can be taught what history requires.

L. M. Hawkes writes cinematic, historically grounded interactive gamebooks drawing from the warrior traditions of Rome, Greece, Japan, the Viking Age, and the great battles of antiquity. The Vault of Ages Art Pack Configurator is a curated catalog of historically accurate cinematic illustration, available at HawkesAdventures.com under personal and commercial licenses.

This is a follow-on article to the original six-part series. The first six articles established the failure taxonomy, correction pipeline, prompt architecture, metadata infrastructure, and cross-period transferability of the workflow. This article describes the next evolution: evaluator-driven, multi-model historical correction.

Tags: Artificial Intelligence · History · Midjourney · Prompt Engineering · Workflow · Ancient Rome · Historical Fiction · Game Design · AI Art · Historical Reconstruction

The post When the Evaluator Became Part of the Generation System appeared first on Hawkes Adventures.

The Pipeline Works for Any Historical Period

LMHawkes — Sat, 06 Jun 2026 14:05:00 +0000

The Pipeline Works for Any Historical Period

What AI Gets Wrong About Vikings, Feudal Japan, and Ancient Greece – And How to Fix It

By L. M. Hawkes · HawkesAdventures.com

This is the sixth and final article in a series that began with a Roman legionary carrying a katana and ended with a production-grade metadata infrastructure supporting a commercially licensed historical image catalog. The journey between those two points covered nine failure categories, a three-stage correction pipeline, a layered prompt architecture, and a database system built to make accuracy-constrained AI illustration viable at scale.

The question this article addresses is the natural one for anyone who has followed along: does any of this transfer?

The answer is yes – almost entirely. The pipeline is period-agnostic. The prompt architecture is period-agnostic. The metadata system is period-agnostic. What changes when you move from ancient Rome to Viking Age Scandinavia, feudal Japan, or ancient Greece is the specific content of the accuracy constraints – the ruleset that defines what belongs in the period and what doesn’t. The structure that enforces those constraints stays the same.

But before the transfer mechanics, the failures. Because every historical period has its own version of the katana problem – its own set of systematic AI defaults that are predictable, documentable, and correctable once you know what to look for.

What AI Gets Wrong About the Viking Age

The Viking Age runs roughly from the late 8th century through the mid-11th century. AI’s failures in this period follow the same underlying logic as its Roman failures: the training data contains far more post-Viking imagery than Viking imagery, and the model reaches for the richest available pool.

The horned helmet. This is the Viking equivalent of the katana problem – the single most jarring and most persistent anachronism in AI-generated Viking content. Horned helmets on Viking warriors are an invention of 19th century Romanticism, codified in theatrical costume and nationalist imagery and distributed so widely that they now dominate the popular visual imagination of the period. Actual Viking helmets were simple iron or leather caps, sometimes with a nasal guard, occasionally with spectacle-style eye protection. The one near-complete Viking Age helmet recovered archaeologically – the Gjermundbu helmet – has no horns.

The horned helmet is so embedded in AI training data that suppressing it requires explicit, persistent negative prompting. Even then, it reappears. It is the single most important thing to exclude when generating Viking Age content.

Medieval armor contamination. The same Medieval bleed that puts Gothic plate armor on Roman soldiers puts it on Viking warriors. Viking Age combatants wore mail – brynja – when they wore metal armor at all. Plate armor of any kind is an anachronism for this period. Chain mail is correct; the articulated plate gauntlets, enclosed visors, and Gothic pauldrons that AI defaults to are not.

The longship problem. AI-generated Viking longships are frequently too large, too symmetrical, and too elaborately decorated. Real Viking longships were lean, low, and functional – built for speed and shallow-water navigation. AI produces something closer to a fantasy galleon with Viking aesthetic detailing than a historically plausible drakkar.

Anachronistic settlement architecture. Viking Age settlements were timber construction – longhouses with turf or thatch roofs, built to regional Scandinavian patterns. AI frequently introduces stone fortification elements, Medieval tower construction, and European castle architecture that postdates the Viking Age by centuries. The correct Viking Age settlement looks nothing like a Medieval castle.

The berserker default. AI defaults to the berserker archetype – wild-eyed, barely clothed, chaotically armed – for Viking warriors generally. This conflates a specific and rare ritual warrior tradition with the entire culture. Most Viking Age fighters were organized, disciplined, and equipped in ways that reflect genuine military culture. The berserker aesthetic is the Viking equivalent of the gladiator spectacle default: it emphasizes the most dramatic and least representative version of the subject.

What AI Gets Wrong About Feudal Japan

Feudal Japan is, paradoxically, one of the historical periods most represented in AI training data and most consistently misrepresented in AI output. The volume of Japanese-inspired content in the training data is enormous – but much of it is fantasy, cinematic, and anachronistic rather than historically grounded.

Anachronistic armor styles across periods. Japanese armor evolved significantly between the Heian period and the Edo period – roughly 900 years of development. AI collapses this entirely, producing armor that mixes elements from incompatible periods freely. Ō-yoroi lamellar armor appropriate to the Heian and Kamakura periods appears on figures in Edo-period settings. Tosei-gusoku plate-and-mail hybrid armor from the Sengoku period appears in contexts that predate it by centuries. The mixing is systematic and invisible to anyone who doesn’t know the specific period being depicted.

The lone samurai aesthetic. AI defaults to the solitary, dramatically posed samurai warrior – a cinematic archetype rooted in 20th century film rather than historical practice. Actual samurai were members of a complex bureaucratic and social institution, operating within hierarchical structures of loyalty and obligation. The lone warrior aesthetic erases that institutional context entirely, producing imagery that looks feudal Japanese without being historically grounded.

Weapon and period mismatches. The katana, which as noted causes problems even in Roman contexts, is AI’s default Japanese weapon regardless of period. Earlier Japanese sword forms – the tachi, the chokutō, the naginata as a primary weapon – are systematically underrepresented. Prompting for period-specific weapon forms requires explicit terminology and persistent negative prompting against the katana default.

Fantasy contamination. The volume of fantasy-Japanese content in AI training data – anime, video games, wuxia-adjacent imagery, mythological illustration – vastly exceeds the volume of historically grounded feudal Japanese content. The result is imagery that reads as Japanese-inspired rather than Japanese-accurate: impossible sword geometries, armor with fantasy ornamentation that has no historical basis, architectural elements drawn from cinematic interpretations rather than documented structures. Prompting for specific historical periods by name – Heian, Kamakura, Muromachi, Sengoku, Edo – and specifying documented material culture for each period suppresses this significantly.

Architecture defaults to the cinematic. AI-generated feudal Japanese architecture defaults to the multi-tiered castle pagoda aesthetic familiar from film and tourism imagery. Actual period-appropriate architecture varied enormously by era, function, and region. Early period structures, shinden-zukuri aristocratic residential compounds, and the modest built environment of farming and merchant communities are almost entirely absent from AI defaults. The castle is the default; everything else requires explicit prompting.

What AI Gets Wrong About Ancient Greece

Ancient Greece presents a specific challenge that is distinct from Rome, the Viking Age, and feudal Japan: the two civilizations are so frequently conflated in popular imagery – and in AI training data – that generating specifically Greek content without Roman contamination requires deliberate effort.

The Rome-Greece conflation. AI training data contains enormous volumes of imagery that blends Greek and Roman visual vocabulary without distinction. The result is images that are generically “classical” rather than specifically Greek – Roman architectural forms in Greek settings, Roman military equipment on Greek warriors, Roman social aesthetics in Greek domestic scenes. The Parthenon and the Pantheon are separated by five centuries and a civilization. AI treats them as interchangeable backdrops.

Hoplite equipment errors. The Greek hoplite is among the most visually distinctive warriors in ancient history – the large round aspis shield, the linothorax linen cuirass or bronze thorax, the Corinthian or Attic helmet with its distinctive crest. AI conflates this with Roman legionary equipment constantly, producing figures that carry the right general aesthetic but the wrong specific gear. The aspis becomes a Roman scutum. The linothorax becomes lorica segmentata. The Corinthian helmet acquires a Roman neck guard. Each substitution is individually small and collectively corrosive to historical specificity.

The white marble problem, amplified. The white marble default described in Article 2 is even more pronounced for ancient Greece than for Rome – the Parthenon and its kin are among the most-reproduced images in Western cultural history, and they are universally depicted as white stone. The original structures were painted in vivid polychrome – red, blue, gold, and green on the sculptural friezes and architectural details. Prompting for painted surfaces, visible pigment on architectural elements, and polychrome decoration produces dramatically more accurate results and imagery that most viewers will find genuinely surprising.

The symposium and agora default. AI defaults to either battle scenes or monumental architecture when generating ancient Greek content, collapsing the enormous social and intellectual texture of Greek life into its most visually dramatic elements. The symposium, the agora, the gymnasium, the domestic oikos – the everyday settings where most Greek life occurred – require explicit prompting to generate and tend to drift toward Roman visual vocabulary without persistent correction.

Adapting the Pipeline

The transfer from ancient Rome to any of these periods requires changes in exactly one place: the Historical Consistency Ruleset in the audit prompt and its mirror in the corrective re-prompting prompt.

The structure stays identical. The Visual Anchoring Pass, the Evidence Confirmation Pass, the Negative Knowledge Gate, the tier validation model, the showcase-worthiness rating criteria, the structured issue handoff format – none of these change. They are architectural constants. The period-specific content slots in where the Roman rules currently sit.

For practical adaptation:

Replace the weapon constraints with period-accurate equivalents – Viking seax and ulfberht sword forms instead of gladius and spatha; Heian tachi instead of katana; Greek xiphos and dory spear instead of Roman short sword
Replace the armor constraints with documented period types – Viking brynja mail instead of lorica segmentata; Greek linothorax and thorax instead of Roman lorica
Replace the architectural constraints with period-specific forms – Norse longhouse timber construction instead of Roman round arches; Greek Doric and Ionic orders instead of Roman construction materials
Replace the lighting constraints only if the period requires it – open flame is the correct constraint for all pre-industrial periods, so this layer transfers without modification
Add period-specific anachronism categories – horned helmets for Viking content, katana defaults for any Japanese period, Rome-Greece conflation for Greek content

The corrective re-prompting structure transfers without modification. The feedback loop – audit failure becomes targeted negative prompt language – works identically regardless of the period being corrected.

What This Series Built

Six articles. One pipeline. A methodology that began with a Roman legionary carrying a katana and produced a commercially structured historical image catalog, a validated metadata architecture, and a transferable framework for accuracy-constrained AI illustration across any historical period where getting the details right matters.

The Vault of Ages catalog – cinematic, historically accurate illustration for ancient Roman gladiator culture – is available now at HawkesAdventures.com. Viking Age and ancient Greek catalogs are in development, built on the same pipeline and held to the same accuracy standard.

Future articles in the L. M. Hawkes series will cover those periods specifically as the catalogs develop – what the failures looked like, what the corrections produced, and what the community evidence shows about where AI’s historical imagination falls short for each civilization.

The complete production-ready prompt documents for the Roman pipeline – the full Historical Consistency Ruleset, the structured issue handoff format, the canonical failure tag vocabulary, and the corrective re-prompting template – are available as a free resource at HawkesAdventures.com.

Build the pipeline before you generate the first image. The rest follows.

This is Part 6 of 6.

Previously, Part 5: The Database Behind the Art

The complete series – along with the full prompt document library – is available at HawkesAdventures.com.

“Go build something accurate!“

Tags: Artificial Intelligence · History · Midjourney · Viking Age · Feudal Japan · Ancient Greece · Ancient Rome · Prompt Engineering · Game Design · Historical Fiction

The post The Pipeline Works for Any Historical Period appeared first on Hawkes Adventures.

The Database Behind the Art

LMHawkes — Sat, 30 May 2026 14:05:00 +0000

The Database Behind the Art

What Happens After an Image Passes the Audit

By L. M. Hawkes · HawkesAdventures.com

The previous four articles in this series covered the problem, the taxonomy, the pipeline, and the prompts. If you’ve followed along, you know how a historically accurate AI-generated image gets made – the structured generation, the layered audit, the corrective re-prompting loop that turns failures into intelligence.

This article covers what happens next.

Passing the audit is not the end of the process. It is the beginning of a different one. An image that has cleared the historical accuracy bar and earned a four or five star showcase rating still needs to be classified, described, cataloged, and made commercially useful. At catalog scale – hundreds of images across multiple series, historical periods, and use cases – that work cannot be done casually. It requires a system.

This article describes that system: the metadata architecture behind the Vault of Ages catalog. It is the least glamorous part of the workflow and the part most people would never build. It is also the part that makes the catalog a product rather than a folder of images.

Why Metadata Matters at Scale

A single well-labeled image is a convenience. A consistently labeled catalog of several hundred images is an asset.

The difference is findability, usability, and commercial clarity. A buyer licensing images for a tabletop RPG supplement needs to find images by scene type, subject, composition, and mood – not by scrolling through thumbnails. A publisher licensing for interactive fiction needs to know which images are portrait-eligible, which are print-eligible, and which carry commercial use rights. A developer building a configurator interface needs structured data, not filenames.

Without consistent metadata, a large catalog is difficult to search, difficult to license, and difficult to present. With it, the same catalog becomes a structured commercial archive that can be queried, filtered, configured, and delivered programmatically.

The metadata system behind the Vault of Ages was built to serve all of those use cases simultaneously.

The Database Architecture

Every image in the catalog resolves to a record in a central database table – wp\_hawkes\_images – with a unique image ID as the authoritative identifier. The image ID is the anchor for everything else. Filenames, variants, and derivative files all resolve back to the same database record. The database is the source of truth; the filename is just a pointer.

This design decision has a practical consequence that matters at scale: you can rename files, generate new variants, resize for different platforms, and create derivative formats without any of those operations affecting the underlying record. The image remains the same cataloged asset regardless of what happens to the files that represent it.

The image ID itself follows a strict format – alphanumeric only, no hyphens, no underscores. This is not aesthetic preference. It is a data integrity rule that prevents a category of lookup failures that occur when ID formats are inconsistent across a large dataset.

What Gets Recorded

Every image record carries two categories of metadata: scalar fields and child table entries.

Scalar fields describe the image as a single entity – its primary subject, scene type, environment, composition, lighting style, shot distance, view angle, orientation, aspect ratio, and a museum-style factual description. They also carry accessibility data – alt text written to literal accessibility standards, not repurposed from the description field – and a set of commercial eligibility flags covering showcase status, print eligibility, portrait eligibility, marketing eligibility, and art pack inclusion.

The distinction between description and alt text is enforced as a system rule, not a suggestion. A description is factual prose about what the image depicts. Alt text is a literal accessibility statement of what a screen reader needs to convey. They serve different functions, they are written differently, and they are never interchangeable.

Child table entries capture the multi-value attributes that don’t fit cleanly into scalar fields: keywords, moods, historical elements, secondary subjects, notable objects, context tags, audience assignments, and use cases. Each of these lives in its own table, linked to the image ID, and each is validated against a canonical vocabulary of allowed values.

The Validation System

Metadata at catalog scale degrades without enforcement. Fields get left blank. Values get entered inconsistently. Enum fields accumulate variants – “gladiator,” “Gladiator,” “GLADIATOR,” “roman gladiator” – that are semantically identical but structurally incompatible with search and filtering.

The metadata system uses a three-tier validation model to prevent this.

Tier 1 – Direct visual validation. Fields that can be verified against the image itself – subject, composition, lighting style, environment, architecture, scene type – are validated by direct comparison between the metadata record and the actual image content. A record that says the primary subject is a Roman officer when the image clearly shows a gladiator is a proposed update, not an acceptable discrepancy.

Tier 2 – AI-suggested fields requiring human review. Commercial fields – audience assignment, showcase status, print eligibility, marketing eligibility – cannot be verified visually. The system generates proposed values for these fields and marks them explicitly as AI-suggested, requiring human confirmation before they enter the record. The system never auto-applies commercial metadata. It proposes; a human decides.

Tier 3 – Structural and database validation. Image ID format compliance, NULL detection, blank detection, enum validity – these are checked mechanically against schema rules. An image ID containing a hyphen fails format validation regardless of what the image shows. A required field containing NULL is flagged for resolution regardless of how good the image is.

The validation system produces two outputs: a structured JSON report of every field’s status with explicit reasoning, and a set of SQL statements for manual review. Nothing is auto-executed. Every proposed change is human-reviewed before it touches the database.

The Anti-Hallucination Design

The metadata system was built with one failure mode as the primary design constraint: the tendency of AI systems to invent values for fields they cannot assess.

An AI evaluator asked to fill in metadata for an image will produce values for every field – including fields where the image provides no evidence. Footwear not visible in the image becomes “Roman caligae” because that is the most plausible Roman footwear. Architecture partially out of frame becomes a specific architectural type because the evaluator inferred it from context. These inferences are frequently wrong, and at catalog scale, frequently wrong compounds into systematically wrong metadata.

The system prevents this through a strict missing-value protocol. If a value cannot be determined from visible image evidence, the correct entry is “UNSPECIFIED” – not a guess, not an inference, not the most plausible option. UNSPECIFIED is a valid, searchable state. An invented value that is wrong is not.

This same philosophy runs through every layer of the metadata system, just as it runs through the audit prompt. The principle is consistent: visible evidence overrides inference. Inference without visible evidence produces no entry.

Why This Level of Infrastructure

The honest answer is: because the catalog is a commercial product, not a creative project.

A creative project can live in a folder with descriptive filenames. A commercial catalog that needs to support licensing queries, configurator interfaces, search and filtering, audience targeting, and derivative use tracking cannot. The infrastructure is not the interesting part of this work – the images are, the history is, the stories the gamebooks tell are. But without the infrastructure, none of those things are commercially viable at scale.

The Vault of Ages Art Pack Configurator at HawkesAdventures.com is the interface that sits on top of this system. It is what makes the catalog navigable and licensable for the tabletop RPG designers, game developers, interactive fiction writers, and historical content creators who are the catalog’s primary commercial audience. The metadata is what makes the configurator possible.

The Broader Principle

Every creative project that generates assets at scale eventually faces the same problem: the assets outgrow the system used to manage them.

A folder of 50 images is manageable. A catalog of 500 images covering multiple historical periods, dozens of scene types, hundreds of subjects, and multiple licensing tiers is not – not without structured data behind it. The time to build that structure is before the catalog grows past the point where retroactive organization becomes the project itself.

Build the metadata system before you need it. The pipeline article said the same thing about the audit prompt: build it before you generate the first image, not after you’ve generated a thousand. The same principle applies here.

The final article in this series takes the pipeline, the prompts, and the infrastructure described across this series and asks the question that matters for anyone who wants to apply this work to their own historical period: what does AI get wrong about Vikings, feudal Japan, and ancient Greece – and how does the same three-stage approach fix it?

This is Part 5 of a 6-part series.

Previously, Part 4: The Full Prompts

Coming next week, Part 6: The Pipeline Works for Any Historical Period – adapting the workflow for Vikings, feudal Japan, and ancient Greece.

Tags: Artificial Intelligence · Database Design · History · Game Design · Digital Publishing · Workflow · Indie Creator · Historical Fiction

The post The Database Behind the Art appeared first on Hawkes Adventures.

The Full Prompts

LMHawkes — Sat, 23 May 2026 14:05:00 +0000

The Full Prompts

The Annotated Audit and Corrective Re-Prompting Architecture That Powers the Vault of Ages

By L. M. Hawkes · HawkesAdventures.com

The previous three articles in this series established the problem – nine systematic failure categories in AI-generated Roman imagery – and the three-stage pipeline built to correct them. This article delivers the prompt architecture itself.

What follows is the annotated version: enough to understand the design logic and implement the pipeline for your own catalog. The complete production-ready documents – the full Roman Historical Consistency Ruleset, the structured issue handoff format, the canonical failure tag vocabulary, and the corrective re-prompting template with all switches – are available as a free resource at HawkesAdventures.com.

If you haven’t read Articles 1 through 3, the short version is this: AI historical failures are patterned and systematic, which means they respond to structured, layered prompting rather than to longer or louder instructions. These prompts are the product of four months of iteration across 1,106 images. They are not theoretical. They were built by running them.

Why Prompt Architecture Matters

A prompt is not a wish. It is an instruction set, and instruction sets have architecture – structure that determines how reliably the instructions are followed and how gracefully the system handles edge cases.

Most AI image prompts are flat lists: a subject, a style, a few adjectives, some negative terms. That structure works for creative work where variation is acceptable. It does not work for accuracy-constrained work where specific failures need to be specifically excluded.

The audit prompt and the corrective re-prompting prompt described below are layered documents. Each layer has a distinct function. Understanding why each layer exists is as important as understanding what it says – because when you adapt these for a different historical period, you need to know which layers are period-specific and which are structural constants.

The Audit Prompt – Annotated

The audit prompt is submitted to Claude with a batch of up to 15 generated images attached. It runs every image through a structured evaluation sequence before any image qualifies for the catalog.

The prompt has five functional layers.

Layer 1: The Visual Anchoring Pass

Before any historical evaluation, the evaluator lists the primary visual elements actually present in each image – people, clothing and armor, weapons and equipment, architecture, environment, lighting.

Why this layer exists: It prevents evaluation of things that aren’t there. An AI evaluator will hallucinate violations against elements it assumes are present but cannot actually see. Forcing an explicit inventory of visible elements before evaluation begins grounds every subsequent judgment in observable evidence. You cannot flag what you cannot see.

What to keep when adapting: This layer is period-agnostic. It runs identically regardless of the historical setting.

Layer 2: The Evidence Confirmation Pass

Every suspected issue must be confirmed against clearly visible image evidence before it can be assigned as a failure. Suspicion is not sufficient. If a potential problem cannot be confirmed from visible pixels, it is marked uncertain rather than flagged.

Why this layer exists: It prevents false positives. Without this constraint, an evaluator will flag shadows as anachronistic lighting, obscured footwear as modern shoes, and blurred background architecture as Gothic arches. False positives waste correction cycles and erode trust in the evaluation system. This layer enforces the discipline: visible evidence overrides suspicion. Suspicion without visible evidence produces no rejection.

What to keep when adapting: This layer is structural, not period-specific. Keep it verbatim.

Layer 3: The Negative Knowledge Gate

The evaluator must distinguish between three states: visible violation, uncertain visibility, and not visible or not assessable. Elements that cannot be confidently assessed from visible pixels – footwear cropped out of frame, armor hidden by shadow, architecture partially obscured – are not flagged.

Why this layer exists: It prevents penalizing images for what they don’t show. An image with no visible footwear has not failed the footwear check – it simply has no footwear to evaluate. This distinction matters enormously at scale. Without it, the false positive rate rises to the point where the audit becomes an obstacle rather than a filter.

What to keep when adapting: Structural constant. Keep verbatim.

Layer 4: The Historical Consistency Ruleset

This is the period-specific core of the audit. For ancient Rome, it covers:

Lighting: Open flame only – torches, oil lamps, braziers. No electric lighting, no glass-paneled fixtures, no pipe runs or conduit.
Clothing: Roman garments only. No zippers, elastic fabrics, or modern tailoring.
Footwear: Roman caligae or bare feet only.
Armor: Roman armor types only. No medieval plate, no enclosed visors, no Gothic pauldrons.
Weapons: Gladius or spatha only, worn on the hip. No katanas, longswords, rapiers, or non-Roman blades.
Architecture: Round Roman arches only. No Gothic pointed arches, no medieval fortifications, no Renaissance facades. No flat transparent glass.
Symbols: No crosses, no Christian iconography, no heraldry, no Arabic numerals.
Materials: No plastic, chrome finishes, synthetic fabrics, or modern machined hardware.
Hairstyle: Roman male styles only. No modern fades, undercuts, or gel-styled hair.

Why this layer exists: It is the accuracy standard against which every image is evaluated. Every rule corresponds to a documented failure category from the audit history.

What to adapt: This entire layer is period-specific. For Viking Age content, the ruleset changes entirely – different armor types, different architectural forms, different lighting constraints, different weapon categories. The structure of the layer stays the same; the content is replaced.

Layer 5: The Showcase-Worthiness Rating

Images that pass all historical checks receive a 1–5 star rating based on composition, atmospheric conviction, and visual impact. The rating criteria are calibrated:

Five stars: Exceptional. Cinematic composition, convincing atmosphere, no visible artifacts. Could function as cover art or promotional material.
Four stars: Strong. Clear subject, convincing Roman atmosphere, minor imperfections that don’t distract. Suitable for catalog publication.
Three stars: Usable but not standout. Historically accurate, compositionally ordinary.
Two stars: Weak composition or noticeable flaws. Technically passes but visually mediocre.
One star: Barely usable. Technically passes historical checks, visually ineffective.

Only images that pass historical evaluation receive a rating. Rejected images are not rated.

Why this layer exists: Accuracy is the threshold. Quality is the filter. An image that is historically accurate but compositionally weak serves the catalog no better than an inaccurate one. The rating system separates the threshold from the standard.

The Corrective Re-Prompting Prompt – Annotated

The corrective re-prompting prompt takes three inputs – the original scene description, the audit output for that image, and the period-accuracy criteria – and outputs a refined Midjourney prompt with targeted negative specifications for exactly what went wrong.

The prompt has four functional components.

Component 1: The Style and Atmosphere Frame

Every corrective prompt opens with the same style and atmosphere instruction: photographic realism, bronze and ochre palette, gritty atmospheric lighting, the specific Midjourney switches (--ar 2:3 --chaos 5 --v 6 --q 2).

Why this component exists: It anchors the output in the visual language established across the entire catalog. Consistency in style and atmosphere is what makes a multi-hundred-image catalog read as a coherent body of work rather than a collection of unrelated images.

What to keep when adapting: The palette and atmosphere frame is specific to the Vault of Ages Roman catalog. For a Viking Age catalog, the palette would shift – cooler tones, different atmospheric qualities. The structural role of this component stays the same; the specific values change.

Component 2: The Accuracy Criteria Block

The corrective prompt carries the same core accuracy criteria as the audit – the period-specific ruleset translated from evaluation criteria into generative constraints. What the audit checks for, the corrective prompt explicitly excludes.

Why this component exists: It ensures that corrections don’t introduce new failures while fixing the documented ones. A prompt that fixes the Gothic arch but drops the lighting constraints will fix the architecture and re-introduce the Victorian lanterns.

Component 3: The Scene Text

The original scene description, unchanged. The correction addresses what went wrong with the execution – not with the subject matter.

Component 4: The Targeted Corrections

This is where the audit output translates into specific negative prompt language. Each documented failure becomes a precise visual exclusion:

Victorian bracket lanterns with glass panels → “Roman oil lamp on simple iron bracket, no glass panels, no pipe runs, no conduit, warm flickering flame only, no Victorian or post-Roman lighting elements”
Sword worn on back → “gladius worn on hip at right side, not carried on back, hip scabbard only”
Gothic arch in background → “round Roman arch only, no pointed arches, no Gothic structural elements, semicircular vault”
Enclosed visor helmet → “open-face Roman galea with cheek guards, no enclosed visor, no full-face coverage”

The specificity is the mechanism. Vague exclusions tell the model what category to avoid. Precise visual descriptions tell the model exactly what the failure looks like and what the correct version looks like instead. The model responds to the second kind of instruction far more reliably than the first.

Getting the Full Documents

The annotated versions above are sufficient to implement the pipeline. They contain the structural logic and the period-specific rules for ancient Rome.

The complete production-ready documents – with the full structured issue handoff format, the canonical failure tag vocabulary (sixteen tags covering every documented failure category), the complete evidence confirmation procedures, and the corrective re-prompting template formatted for direct use – are available as a free resource at HawkesAdventures.com.

These documents are the product of four months of iteration across 1,106 images. Use them, adapt them for your own historical period, and build the pipeline before you generate your first image – not after you’ve generated a thousand of them. That is the single most useful thing I can tell you from the other side of this process.

What the Pipeline Produces

The Vault of Ages catalog – cinematic, historically accurate illustration for ancient Roman gladiator culture – is available at HawkesAdventures.com under personal and commercial licenses. Commercial licenses permit derivative works: tabletop RPG supplements, game modules, interactive fiction, campaign settings, digital and print publications.

The pipeline that built it works for any historical period where accuracy matters. The next article in this series covers the infrastructure behind the catalog – the metadata architecture that turns a generation workflow into a structured, searchable, commercially viable image archive.

L. M. Hawkes writes cinematic, historically grounded interactive gamebooks drawing from the warrior traditions of Rome, Greece, Japan, the Viking Age, and the great battles of antiquity. The Vault of Ages Art Pack Configurator – a curated catalog of historically accurate cinematic illustration – is available at HawkesAdventures.com under personal and commercial licenses.

This is Part 4 of a 6-part series.

Previously, Part 3: The Three-Stage Correction Pipeline

Coming next week, Part 5: The Database Behind the Art – the metadata infrastructure that turns a generation workflow into a structured commercial archive.

Tags: Artificial Intelligence · Midjourney · Prompt Engineering · History · Ancient Rome · Game Design · Historical Fiction · Workflow

The post The Full Prompts appeared first on Hawkes Adventures.

The Three-Stage Correction Pipeline

LMHawkes — Sat, 16 May 2026 14:05:00 +0000

The Three-Stage Correction Pipeline

The Exact Workflow I Built to Make AI Stop Getting Ancient Rome Wrong

By L. M. Hawkes · HawkesAdventures.com

The first two articles in this series documented the problem: AI image generation fails at historical accuracy in nine specific, consistent, and predictable ways. If you haven’t read those pieces, the short version is this – AI doesn’t malfunction when it puts a katana in ancient Rome or lights a Roman interior with Victorian gas-bracket lanterns. It does exactly what it was trained to do. The failures are systematic, not random.

Systematic failures have systematic solutions.

After enough rejected images, a pattern became clear: the failures were documentable, which meant they were correctable – not by prompting harder, but by building a structured feedback loop between generation, audit, and re-generation. Over four months and 1,106 images, that feedback loop became a three-stage pipeline.

This article describes the pipeline. The next article delivers the annotated prompts that power it, with the full production-ready versions available at HawkesAdventures.com.

Why “Prompt Harder” Doesn’t Work

The instinctive response to AI accuracy failures is to add more instructions to the prompt. Be more specific. Use stronger language. Add more negatives.

This helps. It is not sufficient.

The problem is that a longer prompt is not the same as a smarter prompt. Adding “no katanas” to a prompt reduces katana appearances – it does not eliminate them, and it does nothing about the Victorian lanterns you didn’t think to exclude, or the Gothic arch in the background you didn’t notice until the third time you looked at the image, or the gladiator’s sword that is worn correctly on the hip but is clearly a medieval falchion rather than a gladius.

The failures are diverse, interacting, and context-dependent. A static prompt, however detailed, addresses the failures you anticipated. A correction pipeline addresses the failures that actually occurred.

The difference between prompting harder and building a pipeline is the difference between guessing and learning.

Stage 1 – Initial Generation

Every image begins with a structured prompt engineered for four specific qualities: photographic realism, period-accurate atmosphere, Roman specificity, and controlled randomness.

Photographic realism is stated as a requirement, not a suggestion. The instruction “must look like a photograph, not an illustration” suppresses the painterly, fantasy-art aesthetic that AI defaults to when generating historical content and pushes output toward documentary visual language.

Period-accurate atmosphere is established through palette and lighting direction rather than through historical instruction. Specifying a bronze and ochre palette and gritty atmospheric lighting produces imagery that feels Roman without requiring the model to independently reach for Roman visual references. The palette does the period work.

Roman specificity is asserted at the material level – brick and stone construction, hand-forged metal, ceramic and bronze lamp vessels, worn and aged surfaces. Vague period references produce generic ancient imagery. Material-level specificity produces Roman imagery.

Controlled randomness is managed through the chaos parameter. A low chaos setting – --chaos 5 – minimizes the model’s tendency to introduce unexpected elements that frequently manifest as anachronisms. Higher chaos settings produce more visually interesting variation but also more historically problematic output. For accuracy-constrained work, low chaos is the correct tradeoff.

The initial prompt also carries a core set of explicit exclusions – the failure categories documented in Article 2, translated into specific visual negative prompts. No glass-paneled lanterns. No pointed arches. No enclosed visors. No swords on backs. These exclusions travel with every prompt in the catalog, regardless of scene.

Stage 2 – Structured Audit

Every generated image runs through a structured evaluation before it qualifies for the catalog. This is not a casual review. It is a systematic check against defined criteria, applied consistently across every image regardless of how good it looks at first glance.

The audit operates in layers, each designed to catch a different class of failure.

The Visual Anchoring Pass comes first. Before any historical evaluation happens, the evaluator identifies and lists the primary visual elements actually present in the image – people, clothing and armor, weapons and equipment, architecture, environment, lighting. This step exists to prevent a specific failure mode: evaluating elements that aren’t actually visible. An artifact that can’t be seen can’t be flagged. This pass grounds the entire evaluation in what is demonstrably present.

The Evidence Confirmation Pass follows. Every suspected issue must be confirmed against clearly visible image evidence before it can be flagged as a failure. Suspicion is not sufficient. If a potential problem cannot be confirmed from visible pixels, it is marked uncertain rather than assigned as a violation. This prevents false positives – the evaluator flagging things that look like they might be wrong without confirming that they actually are.

The historical and accuracy evaluation then checks each visible element against the Roman Historical Consistency Ruleset – the canonical set of period constraints covering lighting, clothing, footwear, armor, weapons, architecture, symbols, materials, and hairstyle. Each failure is assigned a severity level and documented with the specific visible evidence that triggered it.

The showcase-worthiness rating closes the audit. Images that pass all historical checks receive a 1–5 star rating for composition, visual impact, and period conviction. This rating is what drives catalog curation – a technically accurate image with weak composition doesn’t serve the catalog any better than an inaccurate one.

Across 1,106 images, roughly 7% achieved five-star showcase status and approximately 12% achieved four-star status. The rest were rejected or flagged for correction. That yield sounds low. It isn’t – it reflects the standard, not the failure rate. A curated catalog of 100 genuinely exceptional, historically accurate images is a fundamentally different product than a large catalog of adequate ones.

Stage 3 – Corrective Re-Prompting

This is the most powerful stage and the one most people skip.

Rejected images don’t get discarded. Their audit results feed directly into a refined Midjourney prompt that explicitly addresses the identified failures. The correction loop turns audit failures into prompt intelligence. Every rejected image makes the next generation smarter.

The corrective re-prompting process takes three inputs:

The original scene description
The specific failures identified in the audit, with their visible evidence
The core period-accuracy criteria

It outputs a refined Midjourney prompt with targeted negative specifications for exactly what went wrong in that specific image.

The specificity is what makes the difference. Vague exclusions don’t work. “No modern lighting” is not as effective as “Roman oil lamp on simple iron bracket, no glass panels, no pipe runs, no conduit, warm flickering flame only, no Victorian or post-Roman lighting elements.” The model responds to precise visual descriptions of exactly what you don’t want, in the visual language it was trained on.

A prompt that produced Victorian bracket lanterns with glass panels becomes a prompt that produces a ceramic oil lamp casting warm directional light across rough stone. The audit identified the failure. The correction addressed it precisely. The re-generation incorporates that precision.

Borderline images are worth running through this stage too. A borderline image – one that passes historical checks but rates two or three stars for composition or atmosphere – frequently produces a four or five star version when given a targeted corrective re-prompt. The instinct to discard borderlines is wrong. Mine them instead.

What the Pipeline Builds

The three-stage pipeline is not just a quality control mechanism. It is a learning system.

Each cycle through the pipeline adds to the accumulated prompt intelligence for a given scene type, period, and accuracy constraint set. By the time you have processed several hundred images, your prompts carry the documented failure history of everything that went wrong before them. The failure rate decreases. The yield improves. The catalog that emerges is not just larger – it is structurally better than anything a static prompting approach could produce.

The output of this pipeline is the Vault of Ages – a curated catalog of cinematic historical illustrations for ancient Roman gladiator culture, built to a historical accuracy standard that is documentable, defensible, and distinctive. Showcase and four-star tier images are available at HawkesAdventures.com under personal and commercial licenses, including derivative use for tabletop RPG supplements, game modules, interactive fiction, and digital and print publications.

The Prompts

The next article in this series delivers the annotated prompt architecture for both the audit stage and the corrective re-prompting stage – enough to implement the pipeline for your own catalog, adapted to your own historical period.

The full production-ready versions – the complete Roman Historical Consistency Ruleset, the structured issue handoff format, the canonical failure tag vocabulary, and the corrective re-prompting template – are available as a free resource at HawkesAdventures.com.

The methodology is period-agnostic. Swap ancient Rome for Viking Age Scandinavia, feudal Japan, or ancient Greece, adjust the specific accuracy constraints for the period, and the pipeline runs the same way. Future articles in this series will cover what AI gets systematically wrong about those periods specifically – and how the same three-stage approach addresses it.

This is Part 3 of a 6-part series.

Previously, Part 2: The White Marble Lie

Coming next week, Part 4: The Full Prompts – the annotated audit and corrective re-prompting architecture, with complete production-ready documents at HawkesAdventures.com.

Tags: Artificial Intelligence · Midjourney · History · Workflow · Prompt Engineering · Game Design · Ancient Rome · Historical Fiction

The post The Three-Stage Correction Pipeline appeared first on Hawkes Adventures.

The White Marble Lie

LMHawkes — Sat, 09 May 2026 14:05:00 +0000

The White Marble Lie

Everything You Think Ancient Rome Looked Like Is Wrong – And AI Makes It Worse

By L. M. Hawkes · HawkesAdventures.com

In the first article in this series, I introduced the Katana Problem – AI’s tendency to arm Roman legionaries with weapons from feudal Japan – and argued that AI historical failures are not random. They are patterned, predictable, and systematic.

This article is the full taxonomy.

Across 1,106 AI-generated images audited for historical accuracy, I documented nine distinct failure categories. Some are immediately jarring. Some are subtle enough to slip past casual inspection. All of them are consistent enough to constitute a pattern, and all of them are actively noticed and documented by the Roman history community.

Understanding them matters whether you’re generating images, consuming them, or building anything that depends on historical credibility.

Failure 1: The White Marble Default

This is the single most-cited error in the Roman accuracy community, and it is baked so deeply into AI training data that it requires aggressive, specific prompting to suppress.

AI generates Rome as though every surface is white marble. Temples, forums, baths, private houses – all rendered in gleaming, pristine white stone. It looks authoritative. It looks classical. It is historically wrong.

The actual primary Roman building materials were brick, concrete (opus caementicium), and tufa. Marble facing was applied to specific, high-status surfaces – not plastered universally across an entire civilization. The all-white-marble aesthetic comes from how Roman ruins look today, after seventeen centuries of weathering have stripped away the color, the painted plaster, the decorative cladding, and the surface treatments that covered the stone underneath.

That is what ruins look like. It is not what inhabited Roman buildings looked like.

Real Roman urban environments were brick-red and earth-toned, smoke-darkened near cooking fires and lamps, worn and patched and layered with prior construction. Prompting specifically for brick and stone construction – and explicitly excluding “white marble” as a descriptor – produces dramatically more accurate and, frankly, more interesting results.

Failure 2: Architecture That Belongs to a Different Century

The pointed Gothic arch is arguably the most frequent architectural anachronism in AI-generated Roman imagery, and it is persistent enough to appear even when prompts explicitly specify Roman construction.

The Roman arch is round. The Gothic arch is pointed. They are separated by roughly 800 years of architectural history. AI conflates them constantly, because the training data contains far more Gothic and Medieval European architecture than Roman.

But the Gothic arch is only the most visible symptom of a broader architectural contamination. Also documented in community analysis:

Background buildings matching the Monument to Victor Emmanuel II – built in 1911
Colonnades and piazzas copying the layout of St. Peter’s Square, which is Renaissance and Baroque, not Roman
The Baths of Caracalla depicted in Baroque interior style, a movement that began 1,300 years after the baths were built
Medieval fortification elements – battlements, drawbridge-style gates, crenellated walls – appearing in Imperial-era settings
Renaissance facade treatments on buildings that should predate the Renaissance by fifteen centuries

A University of Bordeaux art history professor formally documented several of these failures in widely-shared AI Rome videos in December 2025. This is not academic nitpicking. It is a documented, public, growing body of evidence that AI-generated historical imagery systematically misrepresents its subject.

Failure 3: The Victorian Interior Problem

This one surprised me more than the others, because it is not an obvious failure category until you have seen enough examples to recognize the pattern.

Roman interior scenes frequently generate with Victorian gas-bracket lanterns – the kind with glass panels and decorative metalwork that belong in a 19th century London townhouse, not a 1st century Roman domus. Alongside the lanterns: visible pipe runs and conduit along walls. Electrical lighting quality and directionality in what should be torch-lit or oil-lamp-lit spaces.

The model has a “dramatic interior” template that pulls from 19th century references rather than ancient ones. Victorian interior design generated an enormous volume of illustration and photography that dominates the training data’s representation of “atmospheric, candlelit interiors.” The model reaches for what it knows.

The correct Roman lighting source is an oil lamp – a simple ceramic or bronze vessel with a wick, producing warm, directional, flickering light. Torches. Braziers. Nothing with glass panels. Nothing with pipe runs. Prompting for these specifically, with explicit exclusions of glass-enclosed fixtures and any visible conduit, produces dramatically better results.

Failure 4: The MMA Fighter Problem

AI renders gladiators as lean, muscular modern athletes. This is historically wrong in a way that is genuinely interesting once you understand the reason.

Historical gladiators deliberately carried significant body fat. This was not poor conditioning – it was tactical. A layer of subcutaneous fat over vital organs meant that sword wounds were more survivable. A cut that might kill a lean fighter would wound but not kill a heavier one. Gladiators were valuable investments for their lanistas. Keeping them alive was good economics.

The historically accurate gladiator looks more like a heavyweight wrestler than an MMA fighter. AI defaults to the lean, defined athletic physique because that is what “warrior” and “fighter” look like in the training data – in action films, in video games, in fantasy illustration. The historical reality runs counter to the aesthetic default.

Prompting explicitly for period-accurate body composition – substantial build, visible body fat over muscle – produces more historically accurate results and, frankly, more unusual and distinctive imagery than the default athletic figure.

Failure 5: The Equipment Mixing Problem

Roman gladiator types were distinct, formally categorized, and immediately recognizable to anyone familiar with the institution. A Retiarius fought with a net and trident and wore minimal armor. A Secutor wore a smooth, close-fitting helmet specifically designed to give the net nothing to catch. A Murmillo carried a large rectangular scutum and a helmet with a fish-crest.

AI mixes these freely and without apparent awareness that the categories exist.

A figure carrying a trident and net wearing a Secutor helmet. A Murmillo shield on the wrong body type. Helmet, weapon, and shield combinations that never existed in documented Roman practice. The enthusiast community on r/ancientrome and dedicated gladiator history forums detects these immediately – they are the visual equivalent of putting cavalry insignia on an infantry soldier. The errors are specific, the community is knowledgeable, and the failure to get this right reads as a lack of seriousness about the subject.

Specifying the gladiator type explicitly and naming the correct equipment set in the prompt reduces this significantly.

Failure 6: Wrong Materials Throughout

AI defaults to materials that look impressive rather than materials that are historically correct.

Beyond the white marble problem, documented failures include:

Metal gates and grating that look too manufactured – machined-looking components where everything would have been hand-forged
Window glass that is flat and transparent – glass panes of that quality did not exist in the Roman period; Roman glass was thick, cloudy, imperfect, and used sparingly
Fabric that looks machine-woven rather than hand-loomed – togas and tunics with the wrong weave texture and wrong drape
Gilding in inappropriate contexts – decoration suited to dry environments appearing in scenes that would have been humid
Chrome and polished-metal finishes that could not have been produced with Roman metallurgical technology

The underlying failure is the same as the white marble problem: AI reaches for the most visually impressive version of a material rather than the historically accurate one.

Failure 7: Civilization Drift

A subtler failure, and arguably the most corrosive to historical credibility: AI’s training data contains far more imagery of ancient civilizations generally than ancient Rome specifically.

The result is images that are tonally “ancient” but not identifiably Roman. Greek architectural orders appear where Roman construction would be used. Egyptian visual vocabulary bleeds into crowd scenes. The image looks and feels historical, but it could be Rome, Athens, Persia, or a fantasy hybrid of all of them simultaneously.

One highly-upvoted comment in the AI art community put it plainly: the model shows what people think Rome looked like, not what it actually looked like. The training data reflects centuries of accumulated popular imagination about antiquity – not Roman-specific visual evidence.

The fix is asserting Roman specificity explicitly and repeatedly in the prompt – specifying construction materials, architectural forms, and social context that are Roman and not merely ancient. Vague period references produce generic ancient imagery. Specific Roman references produce Roman imagery.

Roman visual culture was intensely hierarchical. A senator, an equestrian, a freedman, a slave, and a street merchant looked visibly different from each other – in garment quality, cleanliness, ornamentation, and bearing. The social structure of Roman life was legible on the body.

AI collapses this into one undifferentiated “Roman citizen” aesthetic. Everyone wears roughly similar clothing at roughly similar levels of cleanliness. The enormous diversity of Roman society – freedmen, slaves, provincial subjects, equestrians, senators, soldiers, merchants – is flattened into a single visual register.

A related failure: AI will sometimes dress a figure in fine embroidered garments and jewelry while simultaneously rendering them dirty and disheveled. These status signals are contradictory. A figure in senatorial dress should look like they have servants. Specifying role, social status, and the garment quality appropriate to that status – separately and explicitly – produces more coherent results.

Failure 9: Too Clean, Too New

This is the failure that cuts across every other category. AI generates Rome as if it were freshly built.

Pristine stone with no weathering. Armor with no wear. Textiles with no staining or repair. Environments that look like museum reconstructions rather than inhabited places.

Real ancient cities were worn, smoke-darkened, patched, and layered with construction from multiple eras. Surfaces that had been in use for decades or centuries showed it. The lived-in quality of a real Roman street or interior – the accumulated grime near cooking fires, the scuffed leather, the patina on bronze, the weathered wood – is almost entirely absent from default AI output.

Prompting specifically for aged, worn, authentic surfaces – and explicitly rejecting “pristine,” “new,” and “spotless” as descriptors – produces imagery that reads as genuinely inhabited rather than staged.

What This Adds Up To

Nine failure categories. All of them documented in active community discussion. All of them systematic. All of them correctable.

The community of people who notice these failures is not small. It is active on Reddit, in history forums, in tabletop RPG design communities, and in academic circles. The frustration with AI historical imagery that looks right but isn’t is a live, growing, monetizable gap.

Closing that gap requires more than better prompting. It requires a structured workflow – a correction loop that turns documented failures into prompt intelligence and applies that intelligence systematically across a large catalog.

That workflow is the subject of the next article.

This is Part 2 of a 6-part series.

Previously, Part 1: AI Keeps Putting Katanas in Ancient Rome

Coming next week, Part 3: The Three-Stage Correction Pipeline – the exact workflow I built to fix every failure category in this article.

Tags: Artificial Intelligence · History · Ancient Rome · Midjourney · Historical Fiction · Game Design · Worldbuilding · AI Art

The post The White Marble Lie appeared first on Hawkes Adventures.

AI Keeps Putting Katanas in Ancient Rome

LMHawkes — Sat, 02 May 2026 21:53:41 +0000

AI Keeps Putting Katanas in Ancient Rome

And It’s Not Random

By L. M. Hawkes · HawkesAdventures.com

Four months ago I started building a catalog of cinematic historical illustrations for a series of interactive gamebooks set in ancient Roman gladiator culture. The illustrations are AI-generated through Midjourney. The historical accuracy standard is not negotiable.

Those two facts turned out to be in significant tension.

Left to its own devices, Midjourney – and generative AI image tools generally – produces imagery that looks historical until you look carefully. Then you start noticing things. A legionary carrying a katana. A Roman forum ringed with Gothic pointed arches. An interior scene lit by Victorian gas-bracket lanterns with glass panels. Pipe runs along the upper walls of a barracks that should have been built in 50 AD.

Over the course of auditing 1,106 images – yes, eleven hundred – I documented exactly what goes wrong, how consistently it goes wrong, and what you can actually do about it. This is the first article in a series covering the complete workflow, including the exact prompts and the infrastructure behind the catalog.

But before the solutions, you need to understand the problem. And the most important thing to understand about AI historical failures is this:

They are not random. They are patterned, predictable, and systematic – which means they are fixable.

The Katana Problem

This is where the series gets its name, because nothing illustrates the core failure more vividly.

A Roman legionary in full lorica segmentata, standing in the Forum, holding a katana.

It happens. It happens more than once. It happened to me eleven times across 1,106 images, and every single occurrence followed the same logic: the model associated warrior and sword and reached for whatever sword archetype dominates its training data. For reasons that probably reflect the sheer volume of Japanese-inspired content in that training data, the answer is frequently a katana.

Prompting explicitly for gladius, for Roman short sword, for period-accurate weaponry reduces it. It does not eliminate it. The model has a default and it will return to that default unless you build specific, persistent constraints against it.

The katana is the most jarring example, but it points to a broader pattern: AI does not know the difference between ancient warrior and medieval warrior or fantasy warrior or feudal Japanese warrior unless you force it to. And even then, it forgets.

The Sword on the Back

Related to the katana problem, and arguably more pervasive: swords worn on the back rather than the hip.

Roman soldiers wore their gladius on the hip or thigh. This is documented, consistent, and not ambiguous. AI defaults to the dramatic over-the-shoulder carry – the drawn-from-behind-the-head position that looks good in fantasy art and action films and is historically wrong for virtually every ancient culture.

The back-carry is a cinematic invention. It persists in AI output because it persists in the training data – in fantasy illustration, in Hollywood imagery, in video game character design. The model has learned that dramatic warrior with sword means sword on back, and it takes explicit, targeted negative prompting to break that association.

This is a small detail. It is also an immediate tell to anyone who knows Roman history. And the community of people who know Roman history – and care about it – is larger, more vocal, and more active than most content creators realize.

Why the Failures Are Patterned

Understanding why AI produces these specific failures consistently is more useful than cataloging the failures themselves.

The model’s training data contains far more post-Roman content than Roman content. Medieval Europe, Renaissance Italy, and Victorian England generated enormous volumes of imagery – paintings, engravings, illustrations, photographs of architecture, museum collections – that dwarf the documentary record of ancient Rome. When the model reaches for warrior, fortress, interior scene, or dramatic lighting, it draws from the richest available pool. That pool skews late.

The result is a set of predictable anachronistic contaminations:

Weapons default to whatever sword archetype dominates the training data – frequently katanas, sometimes longswords or rapiers, almost never the historically correct gladius
Armor defaults to a “generic warrior” template that skews Medieval – enclosed visors, Gothic pauldrons, articulated gauntlets – rather than the open-faced Roman galea and lorica segmentata
Architecture defaults to pointed Gothic arches, which appear everywhere in the training data, rather than the round Roman arch that is architecturally correct and historically specific
Lighting defaults to a “dramatic interior” template drawn from 19th century references – Victorian gas-bracket lanterns with glass panels, visible pipe runs – rather than oil lamps and torch sconces

None of this is the model malfunctioning. It is the model doing exactly what it was trained to do: producing imagery that matches the statistical weight of its training data. The problem is that statistical weight does not map to historical accuracy.

The Stakes

You might reasonably ask: does this actually matter? Who notices a Gothic arch in the background of a Roman interior scene?

The answer, documented from active community monitoring: a lot of people notice, and they say so publicly.

In February 2026, a thread on r/aiArt titled “What makes these images of Ancient Rome historically inaccurate?” attracted immediate, detailed community response cataloging specific failures. A University of Bordeaux art history professor formally documented anachronisms in widely-shared AI Rome videos in December 2025. Threads on r/ancientrome and dedicated gladiator history forums detect equipment mixing, armor anachronisms, and architectural errors in AI-generated content within hours of posting.

The community of people who care about Roman historical accuracy is not niche. It is active, literate, and increasingly frustrated by AI-generated content that looks historical without being historical. One highly-upvoted comment put it plainly: AI shows what people think Rome looked like, not what it actually looked like.

That frustration is a gap. The work I’ve been doing is designed to close it.

What Comes Next

Across 1,106 images, I documented nine specific failure categories – weapons and armor are two of them. The remaining seven cover gladiator body type, architectural accuracy, lighting sources, materials and construction, gladiator equipment mixing, civilization drift, and social class collapse. Each one follows the same pattern: a predictable AI default that is historically wrong, consistently generated, and specifically correctable.

The next article in this series covers the full taxonomy – everything AI gets wrong about ancient Rome, with the community evidence to prove it isn’t just my opinion.

After that: the three-stage correction pipeline I built to fix it, the full prompt architecture (with an annotated version here and the complete production-ready documents available at HawkesAdventures.com), and the database infrastructure that turned a generation workflow into a structured commercial catalog.

The AI isn’t broken. It just needs to be told, specifically and repeatedly, that Rome and feudal Japan are separated by thousands of miles and several centuries.

It will listen. Eventually.

This is Part 1 of a 6-part series.

Coming next week, Part 2: The White Marble Lie – the full taxonomy of what AI gets wrong about ancient Rome.

Tags: Artificial Intelligence · History · Midjourney · Game Design · Historical Fiction · Ancient Rome · Worldbuilding · Prompt Engineering

The post AI Keeps Putting Katanas in Ancient Rome appeared first on Hawkes Adventures.

Hawkes Adventures

When Copyright Refusal Forced a Stronger Business Architecture

When Copyright Refusal Forced a Stronger Business Architecture

The Wrong Question

The Evaluator

The Metadata

The Ontology

The Archive

The Infrastructure

The Intelligence Layer

What Copyright Refusal Taught Me

When the Evaluator Became Part of the Generation System

When the Evaluator Became Part of the Generation System

Why One AI Was No Longer Enough for Historical Reconstruction

The Original Assumption

Different Models Fail Differently

The Evaluator Changed Roles

From Audit Notes to Repair Targets

Repair Prompts Became Families

The Ruleset Started Learning From Failure

Why This Matters Commercially

The Broader Shift

What This Built

The Pipeline Works for Any Historical Period

The Pipeline Works for Any Historical Period

What AI Gets Wrong About Vikings, Feudal Japan, and Ancient Greece – And How to Fix It

What AI Gets Wrong About the Viking Age

What AI Gets Wrong About Feudal Japan

What AI Gets Wrong About Ancient Greece

Adapting the Pipeline

What This Series Built

The Database Behind the Art

The Database Behind the Art

What Happens After an Image Passes the Audit

Why Metadata Matters at Scale

The Database Architecture

What Gets Recorded

The Validation System

The Anti-Hallucination Design

Why This Level of Infrastructure

The Broader Principle

The Full Prompts

The Full Prompts

The Annotated Audit and Corrective Re-Prompting Architecture That Powers the Vault of Ages

Why Prompt Architecture Matters

The Audit Prompt – Annotated

The Corrective Re-Prompting Prompt – Annotated

Getting the Full Documents

What the Pipeline Produces

The Three-Stage Correction Pipeline

The Three-Stage Correction Pipeline

The Exact Workflow I Built to Make AI Stop Getting Ancient Rome Wrong

Why “Prompt Harder” Doesn’t Work

Stage 1 – Initial Generation

Stage 2 – Structured Audit

Stage 3 – Corrective Re-Prompting

What the Pipeline Builds

The Prompts

The White Marble Lie

The White Marble Lie

Everything You Think Ancient Rome Looked Like Is Wrong – And AI Makes It Worse

Failure 1: The White Marble Default

Failure 2: Architecture That Belongs to a Different Century

Failure 3: The Victorian Interior Problem

Failure 4: The MMA Fighter Problem

Failure 5: The Equipment Mixing Problem

Failure 6: Wrong Materials Throughout

Failure 7: Civilization Drift

Failure 8: Social Class Collapse

Failure 9: Too Clean, Too New

What This Adds Up To

AI Keeps Putting Katanas in Ancient Rome

AI Keeps Putting Katanas in Ancient Rome

And It’s Not Random

The Katana Problem

The Sword on the Back

Why the Failures Are Patterned

The Stakes

What Comes Next