What Is Harness Engineering? Designing the Rails That Keep AI Agents Stable

Introduction

After reading OpenAI's article on Harness Engineering and Anthropic's writing on long-running agents, I felt that the way I usually work with AI agents is already quite close to this idea.

For a while, discussions around AI usage have focused on terms like prompt engineering and context engineering. But once you start asking an AI agent to handle meaningful, multi-step work, what matters more is no longer the prompt itself. It becomes the question of how to build rails that keep the AI moving in the right direction.

Prompts still matter for short interactions. But for long-running, multi-stage work, the weight shifts toward harness design. In this post, I want to explain how I think about harness engineering and how I actually work with AI agents in practice.

What This Article Covers

A practical way to understand harness engineering
Why prompts alone stop being enough in long-running work
How I structure AI-assisted work so it stays stable over time

1. What Is Harness Engineering?

My working definition of harness engineering is this: designing the outer structure that helps an AI agent keep moving through long-running, multi-step work without losing direction. That includes the goal, the procedure, the record of decisions, the review process, and how responsibilities are split.

This does not mean prompt design stops mattering. For short conversations or one-off tasks, prompt engineering can be enough. But once the work spans design, implementation, review, and revision, that alone becomes fragile.

In my head, the relationship looks like this: prompt engineering sits inside context engineering, and context engineering sits inside harness engineering. What I mean here is not superiority, but scope. The outer layer covers a wider range of design decisions.

Harness engineering as the outer layer that includes prompt and context design

Prompt engineering focuses on how to write individual instructions. Context engineering is broader and focuses on what information the model should carry. Harness engineering goes one step further and covers the structure of the work itself.

OpenAI's Harness Engineering article makes a similar point: in the agent era, the center of gravity moves away from directly controlling the model and toward creating an environment in which the agent can work.

Anthropic's article on long-running agents also emphasizes the outer structure: intermediate artifacts, role separation, and ways of breaking work apart instead of forcing everything into a single long conversation.

Claude Code's documentation points in the same direction. Features like CLAUDE.md, subagents, and session handling are all about how to keep work going over time, not just how to phrase a prompt. The wording differs across products, but the core idea feels very similar.

References:

2. Why Harnesses Matter More in Long-Running Work

If all you want is an answer to a short question, a good prompt can go surprisingly far.

But that stops being enough when the work spans multiple stages like design, implementation, review, and revision. Over long interactions, the context window becomes a constraint, sessions get switched, assumptions get lost, and completion criteria become blurry.

In practice, many cases of unstable AI-assisted work are not caused by raw model capability. They are caused by the model failing to hold on to the right assumptions, or by the work lacking a clear definition of what "done" means.

That is why the real question is no longer, "Can I write the perfect prompt in one shot?" It is:

What is the actual goal?
In what order should the work happen?
What should be written down and preserved?
At what points should review happen?
How should responsibilities be split?

For me, harness engineering is the design of that outer structure.

3. The Most Important Step: Align on Goals and Rules Up Front

In my own workflow, this is the most important part. I do not jump straight into implementation. I start by talking with the AI and aligning on the goal and the way we want to move through the work.

The kinds of things I want clarified early are:

What is the final goal?
In what order should we proceed?
What needs to be decided before implementation starts?
At what point should the AI stop and ask questions?

Sometimes I create a rough draft or working direction at this stage. The key is not to throw the task at the AI immediately, but to agree on the rails before the work begins.

For me, that is the real starting point of harness engineering. It does not begin with a fancy tool. It begins with the initial conversation.

In that sense, this whole flow is what I mean by harness engineering. It is not just about improving a prompt. It is about building a stable path from the first conversation through documents, review, and write-backs.

Basic workflow for keeping an AI agent stable

The point is to put the rails in place before implementation starts. That alone makes long-running work much less likely to drift.

4. The Four Documents I Rely On

To turn that initial alignment into stable execution, I usually rely on four documents.

`design.md`

This is the design document or working procedure. It gives the AI a stable understanding of what is being built and in what order the work should happen.

`task_checklist.md`

This is the checklist that prevents work from slipping through the cracks. I break tasks down by phase and attach acceptance criteria to each one, so the work can be checked against a concrete standard.

`session_handoff.md`

This is the session handoff note. It captures the recent work and the reasons behind decisions, so the next session can recover context quickly.

`AGENTS.md`

This is where I write the operational rules that must always be followed. It helps keep the workflow stable even when sessions change.

To me, these documents are not just supporting material. They are external memory and part of the rail system itself.

5. Work with Write-Backs and Review in Mind

Trying to keep everything inside the conversation alone eventually becomes unstable. That is why I care a lot about writing important information back into Markdown files.

The important part is not to preserve only facts. If you save only the what, it becomes much harder to reconstruct intent later. I try to preserve the why and the how as well.

I also assume review will happen. In most cases, AI output does not arrive perfectly in one shot. But when the initial conversation and the document structure are solid, the result tends to be much closer to the target. Review becomes a matter of small corrections rather than large repairs.

When the task becomes large enough, I sometimes go further and split roles more explicitly. For example, one agent may focus on implementation while another focuses on review. I do not think of that as the core of harness engineering, but as an extension that becomes useful once the harness itself is already strong.

6. Appendix: A Starter Prompt I Use to Set the Rails

As I said above, I do not think the main answer is a magical prompt. The real answer is to put the rails in place first.

That said, I do think it is useful to have an initial instruction that helps start that process. If I wanted an AI to begin this way, I would say something like this:

You are an AI agent that should help run long, multi-step work in a stable way.

The goal of this task is not to jump straight into implementation or writing. The first goal is to clarify the requirements and the workflow, then create the rails that will support the work.

Please follow these rules.

1. Start by confirming the goal, assumptions, constraints, and definition of success.
- If the goal is unclear, ask questions before starting the work.
- Do not silently fill in important gaps.

2. Do not try to complete the final output immediately. Start by creating working drafts.
- If needed, create draft versions of `design.md`, `task_checklist.md`, `session_handoff.md`, and `AGENTS.md`.
- Make the role of each document explicit.

3. Use conversation to refine the requirements.
- Align first on the audience, purpose, conclusion, and workflow.
- Ask about the most important uncertainties first.
- Do not try to finalize everything at once.

4. Break work into tasks.
- Organize it by phase or by task.
- Add acceptance criteria to each task.
- Make the completion conditions explicit.

5. In long-running work, write important information back to Markdown files.
- Do not keep everything only in the conversation.
- Preserve not only the what, but also the why and the how.

6. Work with review in mind.
- Do not assume one-shot completion.
- Use a draft -> review -> revise flow.
- If the direction starts drifting, stop and confirm the plan.

7. If the task becomes too large, propose role separation.
- implementation
- review
- coordination

In your first response, do not start the actual work yet. Instead, provide:
- your understanding of the goal
- the missing information
- the main questions that should be clarified first
- the documents that should be created if needed
- the recommended way to proceed

Of course, using this text as-is does not magically solve everything. What matters is that it helps start the initial conversation, and that the document and review workflow continues after that.

Harness is a major part of what shapes an AI agent's behavior, but it does not determine the whole picture by itself. If you want to see how harness fits together with models, context, and tools, How AI Agents Work: Models, Harnesses, Context, and Tools may also be helpful.

Conclusion

To me, harness engineering is not just about improving prompts. It is about designing the outer structure that allows an AI agent to keep moving through long-running, multi-step work in a stable way.

The most important pieces in my own workflow are the initial alignment on goals and rules, and the documents that support that alignment: design.md, task_checklist.md, session_handoff.md, and AGENTS.md.

At least in my experience, AI-assisted work becomes much more stable once those rails are in place. If your AI agent tends to drift during long sessions, the first thing to revisit may not be the prompt itself, but the rails around the work.

Introduction

What This Article Covers

1. What Is Harness Engineering?

2. Why Harnesses Matter More in Long-Running Work

3. The Most Important Step: Align on Goals and Rules Up Front

4. The Four Documents I Rely On

design.md

task_checklist.md

session_handoff.md

AGENTS.md

5. Work with Write-Backs and Review in Mind

6. Appendix: A Starter Prompt I Use to Set the Rails

Conclusion

AI Picks

What GPT-5.4, the Codex App, and Codex Security Reveal About OpenAI's Direction

What Is OpenAI Codex Security? I Tried It and Was Impressed by How Naturally It Leads to a Fix PR

How AI Agents Work: Models, Harnesses, Context, and Tools

`design.md`

`task_checklist.md`

`session_handoff.md`

`AGENTS.md`