Prompt Engineering
for Complex Systems

Prescriptive Techniques — Reach for These First

12 of 25 techniques · Workshop format

What “prescriptive” means here

Reach for these by default. Not doing them is closer to a defect than a choice.
The split was tuned toward even distribution (12 / 13). A few borderline techniques were settled by balance, not pure logic.
So: reach-for-first, not inviolable law.

The elective deck covers the other 13 — equally valid, situationally applied.

Sessions

Session 1 — Directive Basics
Session 2 — Structural Enforcement
Session 3 — Agent Authority
Session 4 — Integrity & Verification

12 Techniques — This Deck

1. Directive Hardening

7. Confidence Disclosure

12. Failure Mode Encoding

13. Enforcement vs. Request

15. Cold-Read Design

16. Authority Partitioning

18. Nonprescriptive Bias

19. Consumed vs. Done

22. Identity vs. Task Separation

23. Write-Then-Verify

24. Posture Calibration

25. Immutability Marking

Session 1 of 4

Directive Basics

Technique 1 — Directive Hardening
Technique 15 — Cold-Read Design
Technique 7 — Confidence Disclosure

Technique 1

Directive Hardening

Making instructions resistant to drift,
misinterpretation, or creative undermining.

The problem

Models creatively reinterpret instructions
to match their training distribution.

Three hardening techniques prevent this drift:

Negative anchors
Boundary clarification
Calibration scales

Negative Anchors

Define what NOT to flag as explicitly
as what to flag.

Without anchors

"Flag long functions"
→ flags switch statements
→ flags data tables
→ flags config arrays

→

With negative anchors

"Flag long functions"
"Do NOT flag:"
"- switch statements"
"- data declarations"
"- config arrays"

Positive-only instructions leave the boundary undefined. The model fills the gap with training priors.

Boundary Clarification

"This is X, not Y" — explicit distinctions
at ambiguity boundaries.

## Boundary clarifications

- This is a CODE REVIEW, not a style guide enforcement.
  (Don't flag formatting unless it hides a bug.)

- "Error handling" means exception flow, not validation.
  (Missing input validation is a separate category.)

- "Performance issue" requires measurable impact evidence.
  (Theoretical inefficiency without data is a NOTE, not a finding.)

Boundaries prevent category bleeding — where one instruction swallows adjacent territory.

Calibration Scales

Concrete examples at each level of a judgment scale.

## Severity calibration

critical: SQL injection, auth bypass, data exposure
  → Example: unsanitized user input in raw SQL query

high: logic errors causing incorrect output for some inputs
  → Example: off-by-one in pagination returns duplicate rows

medium: missing edge case handling, non-obvious failure modes
  → Example: no timeout on HTTP call to external service

low: code quality, naming, minor style
  → Example: variable named 'x' in a 5-line scope

Without calibration, the model's severity distribution matches its training data — not your team's.

Example: Negative anchors

## What to flag
- Functions over 50 lines with no documentation
- Public APIs that don't validate input types
- Error handling that swallows exceptions silently

## What NOT to flag (explicitly)
- Long functions that are just switch statements (inherently linear)
- Internal helper functions without docs (low reader traffic)
- Caught-and-logged exceptions (not swallowed — handled)

Without the "NOT" list, the model flags everything that
pattern-matches on surface features.

Note: "functions over 50 lines," "missing docs" are deterministic — a linter counts them. The LLM's anchors should govern judgment ("is this long function inherently linear?"), not measurements tooling already does better. See Technique 13.

Exercise

A notification system. Rules: "Send notifications for important events. Don't spam users." System sends notifications for every login, password change, and settings update. Users complain. New rule: "Only send for critical security events." Now password changes don't notify — and users miss compromised-account alerts.

Discuss: What's missing from both rule versions? What negative anchors would fix this?
What's the hardened version look like?

Some valid answers

A notification system. Rules: "Send notifications for important events. Don't spam users." System sends notifications for every login, password change, and settings update. Users complain. New rule: "Only send for critical security events." Now password changes don't notify — and users miss compromised-account alerts.

Q1. What's missing from both rule versions?

Both versions lack boundary clarification: "important" has no definition, and "critical security event" is underspecified. The system has no anchor for classifying new event types as they're added.
Both versions have no negative anchors — they only state what to notify about, not what NOT to notify about. The model (or engineer implementing the rule) fills that gap with their own priors, which diverge from intent.
The second version overcorrects: tightening "important" to "critical security events" was a reaction to the spam problem, but it didn't reason about which security events are notification-worthy. The fix introduced a new undefined boundary.

Some valid answers

A notification system. Rules: "Send notifications for important events. Don't spam users." System sends notifications for every login, password change, and settings update. Users complain. New rule: "Only send for critical security events." Now password changes don't notify — and users miss compromised-account alerts.

Q2. What negative anchors would fix this?

Notify on: password change (security signal), new device login (potential unauthorized access), permission or role changes (privilege escalation), failed login burst (brute force signal). These are enumerated, not inferred.
Do NOT notify on: same-device logins (routine), profile photo updates, notification preference changes, timezone/locale settings. These are explicitly excluded, not left for inference.
The negative anchor list is as important as the positive list. Without it, a new engineer adding a "billing address changed" event asks "is this a security event?" and guesses wrong either direction.

Some valid answers

A notification system. Rules: "Send notifications for important events. Don't spam users." System sends notifications for every login, password change, and settings update. Users complain. New rule: "Only send for critical security events." Now password changes don't notify — and users miss compromised-account alerts.

Q3. What's the hardened version look like?

A two-list structure: "Always notify: [enumerated list]. Never notify: [enumerated list]. For new event types not on either list: escalate to the security team for classification before adding notification logic."
The hardened version is defensive by default for unclassified events: new events require explicit classification before notification behavior is determined. This prevents both "notify everything" and "notify nothing" as defaults.
It also has a maintenance path: the classification decision for each new event type gets added to one of the two lists, growing the ruleset deliberately rather than through inference.

Technique 15

Cold-Read Design

Prompts that function correctly on first encounter
with zero prior context.

The cold-read test

Hand this prompt to someone who knows nothing about your project.
Can they understand what's being asked?

"As discussed" — cold-read failure
"Continue from last time" — cold-read failure
"The usual format" — cold-read failure

Every invocation is a fresh context window.
Even within a session, after compaction.

Before / After

Assumes context

Continue the analysis.
Focus on the issues
we identified earlier.
Use the same format.

Cold-readable

Analyze auth.py for SQL
injection vulnerabilities.

Prior findings (for context):
- Line 42: unsanitized input
- Line 89: string concat in query

Output format:
[LINE]:[SEVERITY]:[DESCRIPTION]

Exercise

A shared incident response runbook. Entry says: "Use the usual escalation path. Contact the on-call as before. Apply the fix from last time." A new team member joins during an outage at 3 AM and opens this runbook.

Discuss: What fails the cold-read test? What information must be inline?
What's the operational cost of the ambiguity?

Some valid answers

A shared incident response runbook. Entry says: "Use the usual escalation path. Contact the on-call as before. Apply the fix from last time." A new team member joins during an outage at 3 AM and opens this runbook.

Q1. What fails the cold-read test?

"Use the usual escalation path" — "usual" encodes institutional memory. A new team member at 3 AM has none of that memory.
"Contact the on-call as before" — there is no "before" for someone joining mid-incident. Who is on-call? How? Their phone? Slack? PagerDuty?
"Apply the fix from last time" — which last time? There may have been dozens. This phrase requires you to know the history to use the runbook.

Some valid answers

A shared incident response runbook. Entry says: "Use the usual escalation path. Contact the on-call as before. Apply the fix from last time." A new team member joins during an outage at 3 AM and opens this runbook.

Q2. What information must be inline?

The escalation path spelled out step by step: who to contact, in what order, by what channel, with what account credentials or access method.
The fix itself, not a reference to it: "run kubectl rollout undo deployment/api --namespace prod" not "apply the rollback."
Context for judgment calls: "if the issue is a crash loop, check the OOM killer logs first; if it's latency, check the database slow query log" — not "debug as before."

Some valid answers

A shared incident response runbook. Entry says: "Use the usual escalation path. Contact the on-call as before. Apply the fix from last time." A new team member joins during an outage at 3 AM and opens this runbook.

Q3. What's the operational cost of the ambiguity?

Minutes lost locating the "usual" escalation path during an outage translate directly into user-facing downtime and revenue loss.
A new engineer who can't follow the runbook will escalate to someone who can — waking a second person who also isn't necessary once the runbook is clear.
Ambiguous runbooks accumulate: each incident where someone improvises a step creates a new undocumented procedure, making the next incident harder to handle cold.

Technique 7

Confidence Disclosure

Make uncertainty visible.
"I don't know" is valuable data.

Why disclosure beats suppression

Without disclosure:

Uncertain findings get suppressed
OR uncertain findings get over-committed
Both lose information

With disclosure:

Low-confidence items included + marked
Reader triages by confidence
Future analysis can revisit

Normalize: "When confidence is low, include but mark it."

Example: Confidence scale

## Confidence marking (required on every finding)

HIGH: You can point to specific code/evidence that confirms.
      Would defend this finding in peer review.

MEDIUM: Pattern suggests an issue but you can't confirm
        without runtime testing or broader context.
        Include — reviewer may have the missing context.

LOW: Heuristic match only. Might be a false positive.
     Include anyway — mark it. Don't suppress.
     "I noticed X but can't confirm it's a problem."

## Rule: never suppress a finding to avoid looking uncertain.
## Uncertainty disclosed > certainty faked.

Exercise

A log analyzer reports: "Definite memory leak in service-A (evidence: linear memory growth over 72 hours)." Also reports: "Possible connection pool exhaustion in service-B (evidence: 2 timeout errors in 10,000 requests over 24h)."

Discuss: What are the confidence primitives? What's the cost of treating both equally?
What if the second is suppressed entirely? What's the right disclosure?

Some valid answers

A log analyzer reports: "Definite memory leak in service-A (evidence: linear memory growth over 72 hours)." Also reports: "Possible connection pool exhaustion in service-B (evidence: 2 timeout errors in 10,000 requests over 24h)."

Q1. What are the confidence primitives?

Evidence strength: finding 1 has 72 hours of linear memory growth — a strong, repeatable pattern. Finding 2 has 2 events in 10,000 over 24 hours — a weak, possibly-noise signal. Different evidence strength, different confidence tier.
Sample size and time window: 72 hours is a long enough window to rule out one-off spikes. 24 hours and 2/10,000 is a small sample in a short window — the signal might disappear with more data, or it might be an early warning.
Reproducibility: the memory leak is reproducible (it's continuous and measurable). The connection pool issue may not be — 2 timeouts in 24 hours could be coincidence, network noise, or a real but infrequent failure mode.

Some valid answers

A log analyzer reports: "Definite memory leak in service-A (evidence: linear memory growth over 72 hours)." Also reports: "Possible connection pool exhaustion in service-B (evidence: 2 timeout errors in 10,000 requests over 24h)."

Q2. What's the cost of treating both equally?

If both are presented at the same severity: the engineer may investigate the low-confidence finding with the same urgency as the high-confidence one — wasting hours on a 2/10,000 rate that may be noise while the confirmed memory leak continues growing.
Or the reverse: if the engineer notices that #2 seems flimsy, they may discount both findings because the report didn't help them calibrate. Treating them equally makes both seem equally uncertain.
The decision about where to spend investigation time belongs to the engineer — but only if the report gives them the confidence information to make that decision. Flat-severity reports steal that ability.

Some valid answers

A log analyzer reports: "Definite memory leak in service-A (evidence: linear memory growth over 72 hours)." Also reports: "Possible connection pool exhaustion in service-B (evidence: 2 timeout errors in 10,000 requests over 24h)."

Q3. What if finding 2 is suppressed entirely?

A suppressed finding is a hidden one. If the connection pool issue is an early signal of a growing problem, suppressing it means it first surfaces as an incident rather than a flagged concern. The cost of suppression is borne by the system, not the report.
The right disclosure: include finding 2, mark it LOW confidence with the specific evidence and its limitations ("2 timeout errors in 10,000 requests over 24h — may be noise, but warrants a monitoring ticket for 1 week of data"). The engineer decides whether to investigate now or watch the metric.
"Include but mark it" preserves information that "suppress if uncertain" loses. The engineer reading the marked finding can dismiss it in 30 seconds if they recognize the pattern as known noise — but they can't act on a finding they never saw.

Session 2 of 4

Structural Enforcement

Technique 12 — Failure Mode Encoding
Technique 13 — Enforcement vs. Request
Technique 25 — Immutability Marking

Technique 12

Failure Mode Encoding

Describe HOW the system fails,
not just how it should succeed.

Why positive instructions fail

"Be thorough" → model can't match against own behavior

"The system fails when you list items without
processing them — mentioning ≠ addressing" → recognizable pattern

Models self-correct more reliably when they know
what the failure looks like from inside.

Example: Encoding failure modes

## How this system fails

1. **Surface-level pass**: You read all inputs, produce output
   that references each one, but don't actually synthesize.
   The output LOOKS complete but adds no insight beyond restating.

2. **Premature convergence**: You commit to a thesis in paragraph 1
   and spend the rest confirming it. Counter-evidence gets mentioned
   but not weighted.

3. **Acknowledgment-as-completion**: You write "noted" or "I'll
   address X" — and then don't. The item disappears.

For each: if you catch yourself doing this mid-response, stop,
name the failure mode, and restart that section.

Exercise

A nightly data migration pipeline. Reads from Legacy DB, transforms 50,000 records, writes to New DB. Reports "success" when it finishes without crashing. Doesn't verify row counts, data integrity, or that transformations preserved semantics.

Discuss: What are the failure modes, described from inside the system?
What does "success" actually mean? How would you encode these failures into the pipeline's own monitoring?

Some valid answers

A nightly data migration pipeline. Reads from Legacy DB, transforms 50,000 records, writes to New DB. Reports "success" when it finishes without crashing. Doesn't verify row counts, data integrity, or that transformations preserved semantics.

Q1. What are the failure modes, described from inside the system?

Silent data loss: "I finished processing all 50,000 records" — but the row count dropped from 50,000 to 48,941. The pipeline never checks that input count = output count.
Semantic corruption: "I transformed all date fields" — but the legacy DB stored dates as MM/DD/YYYY and the transformation assumed YYYY-MM-DD, silently inverting month and day for ambiguous dates.
Partial write success: "I completed the migration" — but a write failure at record 31,400 left the New DB in a half-migrated state, with no rollback and no flag marking the split point.

Some valid answers

A nightly data migration pipeline. Reads from Legacy DB, transforms 50,000 records, writes to New DB. Reports "success" when it finishes without crashing. Doesn't verify row counts, data integrity, or that transformations preserved semantics.

Q2. What does "success" actually mean?

"Didn't crash" is the absence of one specific failure — it says nothing about data quality, completeness, or correctness. The pipeline can run to completion and produce wrong data silently.
A meaningful success criterion needs multiple tests: (1) output row count matches input row count; (2) a sample of transformed records is spot-checked for semantic accuracy; (3) the New DB's constraint checks all pass.
The current "success = no exception" is a single-point check on the outer loop — it leaves the failure surface of every inner operation unchecked.

Some valid answers

A nightly data migration pipeline. Reads from Legacy DB, transforms 50,000 records, writes to New DB. Reports "success" when it finishes without crashing. Doesn't verify row counts, data integrity, or that transformations preserved semantics.

Q3. How would you encode these failures into the pipeline's own monitoring?

Post-run assertions: after processing, query both DBs and compare row counts. If New DB rows ≠ Legacy DB rows, the pipeline reports "FAILED: row count mismatch" not "success."
Checkpoint logging: write a progress marker after each batch (e.g., every 1,000 records). If the pipeline crashes at record 31,400, the checkpoint tells you exactly where to resume rather than restart.
Semantic sample checks: after transformation, run a spot-check on 100 random records comparing key fields between old and new. If more than 1% differ unexpectedly, fail loudly before marking success.

Technique 13

Enforcement vs. Request

Distinguish between constraints the system enforces
mechanically and those that rely on cooperation.

The compliance spectrum

Enforced (mechanical)

JSON schema validation
Output length truncation
Tool permission boundaries
Type checking on function calls

Model can't violate even if it tries.

Requested (cooperative)

"Keep responses concise"
"Don't hallucinate"
"Follow this format"
"Be objective"

Model compliance is probabilistic.

Signaling enforcement honestly

Prompt literal — signals enforcement, does not perform it

## Enforced by the runtime — stated so you know the boundary
- Output is validated against the JSON schema below; invalid
  output is rejected before it reaches the user.
- Token budget is hard-capped at 4096; output truncated at the boundary.
- tool_choice is restricted to [search, calculate, submit] — calls to
  any other tool are not possible.

## Requested of you — deviation acceptable if justified
- Prefer concise explanations (1-2 sentences per finding).
- When uncertain, include the finding but mark confidence: low.
- Group related findings under a single heading.

The first list is not asking you to comply — it is telling you what
the system does regardless. The second list is asking.

The diagnostic: for every constraint in a prompt, ask — could a deterministic mechanism enforce this? If yes, the prompt line is at best a backup, at worst a false sense of safety. Enforce it in code; keep the prompt statement only as labeled awareness.

Exercise

An automated code review tool has these rules: (1) No function over 100 lines. (2) All public APIs must have JSDoc comments. (3) Prefer immutable data structures. (4) No console.log in production code. Engineers routinely violate rule 3 with no consequence.

Discuss: Which rules can be mechanically enforced? Which rely on cooperation?
What's the cost of claiming enforcement you don't have?

Some valid answers

An automated code review tool has these rules: (1) No function over 100 lines. (2) All public APIs must have JSDoc comments. (3) Prefer immutable data structures. (4) No console.log in production code. Engineers routinely violate rule 3 with no consequence.

Q1. Which rules can be mechanically enforced? Which rely on cooperation?

Rules 1, 2, and 4 are mechanically enforceable — line count, JSDoc presence, and a console.log ban are deterministic AST or regex checks.
Rule 3 — "prefer immutable data structures" — is cooperative. "Prefer" has no deterministic test; nothing can decide "this should have been immutable."
A rule can split inside itself: a linter enforces that a JSDoc comment exists (rule 2), not that it is accurate — presence is enforced, quality is cooperative.

Some valid answers

An automated code review tool has these rules: (1) No function over 100 lines. (2) All public APIs must have JSDoc comments. (3) Prefer immutable data structures. (4) No console.log in production code. Engineers routinely violate rule 3 with no consequence.

Q2. What's the cost of claiming enforcement you don't have?

Trust erodes and spreads — engineers watch rule 3 violated freely, conclude the rules don't really matter, and under-comply with the genuinely-enforced rules too.
It becomes a silent-failure surface — a rule everyone believes is enforced but isn't will be violated with nothing catching it.
The honest fix: relabel rule 3 a preference, drop it, or give it a real check. Name what is enforced and what is requested — don't let the list bluff.

Some valid answers

An automated code review tool has these rules: (1) No function over 100 lines. (2) All public APIs must have JSDoc comments. (3) Prefer immutable data structures. (4) No console.log in production code. Engineers routinely violate rule 3 with no consequence.

Q3. Should rule 3 be removed, reworded, or actually enforced?

Remove it: a dead rule the team doesn't hold trains engineers to dismiss the list — its presence erodes compliance with rules 1, 2, and 4 more than its absence would.
Reword it honestly: "prefer immutable data structures where reasonable" signals cooperation, not enforcement, and removes the implicit claim that someone is catching violations. Expectations match reality.
Actually enforce it: add a lint rule that flags mutable operations on identified data types, accept the initial friction, and own the bar. This converts cooperation to enforcement — the only honest path if immutability genuinely matters to the team.

Technique 25

Immutability Marking

Some state must never drift.
Mark it. Enforce it.

The drift problem in agentic systems

Agent modifies its own instructions
↓
"Optimization": removes a constraint to complete current task
↓
Future invocations inherit the weakened constraint
↓
Identity drift — gradual, invisible, compounding

The most insidious failure: the system changes itself
to make the current task easier.

Example: Structural protection

## IMMUTABLE — Do not modify, rephrase, or reinterpret.
## Protected by: tool permission boundary (write denied)

Core identity: You are a financial auditor. Your findings
must be defensible in regulatory review. When uncertain,
disclose uncertainty — never suppress findings to simplify.

## MUTABLE — Agent workspace (read/write permitted)

Current task state:
- Items reviewed: 47/120
- Findings so far: 3 critical, 7 high
- Next batch: items 48-72

Separate mutable workspace from immutable reference material. Enforce at the tool layer.

Exercise

A shared environment config used by 12 microservices. Contains: database connection strings, API rate limits, feature flags, logging levels, and a service discovery URL. Any engineer can edit it. Last month someone changed the service discovery URL "temporarily" during debugging and forgot to revert — causing a 4-hour outage.

Discuss: What should be immutable? What's legitimately mutable?
What enforcement mechanism would you use? What looks like it should be immutable but actually needs to change?

Some valid answers

A shared environment config used by 12 microservices. Contains: database connection strings, API rate limits, feature flags, logging levels, and a service discovery URL. Any engineer can edit it. Last month someone changed the service discovery URL "temporarily" during debugging and forgot to revert — causing a 4-hour outage.

Q1. What should be immutable? What's legitimately mutable?

Immutable: database connection strings, service discovery URL, API keys. These are infrastructure anchors — wrong values cause cascading failures across all 12 services simultaneously.
Legitimately mutable: feature flags and logging levels — that's their entire purpose. Rate limits are legitimately mutable too, but they should change through a process (review, test, deploy) rather than ad-hoc edits by any engineer.
Interesting middle case: API rate limits look mutable (we adjust them periodically) but a wrong value can cause production failures just as surely as a wrong DB connection string. The frequency of intended change doesn't determine mutability tier; the cost of unintended change does.

Some valid answers

A shared environment config used by 12 microservices. Contains: database connection strings, API rate limits, feature flags, logging levels, and a service discovery URL. Any engineer can edit it. Last month someone changed the service discovery URL "temporarily" during debugging and forgot to revert — causing a 4-hour outage.

Q2. What enforcement mechanism would you use?

File-level separation: immutable config in one file with read-only permissions for most roles (only infra/ops can edit). Mutable config (feature flags, log levels) in a separate file with broader write access.
Approval workflow for anything in the rate-limit tier: changes must go through a PR with at least one reviewer. The ad-hoc edit that caused the outage would have been caught in review.
Change detection alerting: any write to the immutable config file triggers an immediate alert to the on-call engineer. "Someone changed service_discovery_url" is worth a 3 AM page; "someone changed log_level to DEBUG" is not.

Some valid answers

A shared environment config used by 12 microservices. Contains: database connection strings, API rate limits, feature flags, logging levels, and a service discovery URL. Any engineer can edit it. Last month someone changed the service discovery URL "temporarily" during debugging and forgot to revert — causing a 4-hour outage.

Q3. What looks immutable but actually needs to change?

Database connection strings during a migration or DR failover. The string that "never changes" has to change when you cut over to a new database. The immutability must have an emergency override path.
Service discovery URLs during a topology change. What was immutable becomes a migration target. The enforcement mechanism should include a documented break-glass procedure: how do you change an immutable value when you legitimately must?
The lesson: immutability is a protection against casual, unreviewed changes — not a permanent freeze. The difference between "immutable" and "requires deliberate process to change" is the gap worth designing for.

Session 3 of 4

Agent Authority

Technique 16 — Authority Partitioning
Technique 22 — Identity vs. Task Separation
Technique 18 — Nonprescriptive Bias

Technique 16

Authority Partitioning

Explicitly defining what the model decides
vs. what it defers on.

Two failure modes, one fix

Over-deference

"Shall I proceed?"
"Would you like me to..."
"I can do X if you'd like"
"Let me know if..."

Stalls progress. Shifts decisions to user.

Over-reach

Ignores stated constraints
Makes decisions beyond scope
Rewrites user's intent
Acts without disclosure

Loses trust. Produces unwanted output.

Fix: explicit authority boundary.

Example: Clear authority boundary

## Your authority (decide and act, no approval needed)
- Code architecture and implementation choices
- File organization and naming conventions
- Which tests to write and how to structure them
- Refactoring decisions within scope

## Ask before acting (requires human input)
- Adding new dependencies (cost/licensing implications)
- Changing public API contracts (downstream consumers)
- Deleting files (irreversible without git archaeology)

## Never (hard boundary)
- Modifying CI/CD configuration
- Touching production credentials
- Force-pushing to main

The Never tier is stated for the model's awareness — but the prompt does not enforce it. Branch protection, credential ACLs, and tool-permission scoping do. A prompt-only "Never" is a Request (Technique 13); pair it with the mechanism that actually stops it.

Exercise

A billing system where an AI agent handles subscription changes. Available actions: upgrade plan, downgrade plan, apply promo code, issue refund, change payment method, cancel subscription, waive late fee, extend trial.

Discuss: Which actions should the agent own outright? Which need human approval?
Which should be hard-blocked? What criteria are you using to partition?

Some valid answers

A billing system where an AI agent handles subscription changes. Available actions: upgrade plan, downgrade plan, apply promo code, issue refund, change payment method, cancel subscription, waive late fee, extend trial.

Q1. Which actions should the agent own outright?

Reversible, low-cost actions with no financial loss: upgrade/downgrade plan, apply a valid promo code, extend a trial. These are undoable and the cost of a mistake is low.
The criterion is: if the agent gets this wrong, can we fix it in under 60 seconds with no customer harm? Yes → agent owns it.
Change payment method is reversible in principle but should require the customer's explicit confirmation — not the agent's unilateral action. The criterion isn't just reversibility; it's also consent for financial instrument changes.

Some valid answers

A billing system where an AI agent handles subscription changes. Available actions: upgrade plan, downgrade plan, apply promo code, issue refund, change payment method, cancel subscription, waive late fee, extend trial.

Q2. Which need human approval?

Refunds and waived fees: these represent direct revenue loss. Even if technically reversible (you could re-charge), the customer experience of being re-charged makes this functionally irreversible.
Large or repeat refunds for the same customer should escalate regardless of amount — the pattern signals potential abuse, which a human should assess.
Cancel subscription: irreversible in customer experience (even if re-activation is possible, the relationship is damaged). Requires human confirmation or at minimum an explicit multi-step flow.

Some valid answers

A billing system where an AI agent handles subscription changes. Available actions: upgrade plan, downgrade plan, apply promo code, issue refund, change payment method, cancel subscription, waive late fee, extend trial.

Q3. Which should be hard-blocked? What criteria are you using to partition?

Hard-blocked: none of these need to be hard-blocked entirely — but the approval tier should be enforced mechanically. A refund that requires human approval should route to a queue, not just "ask" the agent to check first.
The criteria that work: reversibility (can we undo without customer harm?), financial impact (direct revenue loss?), legal/contractual risk (does this change a binding agreement?). All three dimensions matter.
The partition should be documented as part of the agent's explicit authority definition — not just implicit in the code. The agent should be able to state its own authority boundary when asked.

Technique 22

Identity vs. Task Separation

Cleanly separating "who you are"
from "what you're doing right now."

The entanglement problem

## Entangled (common)
"You are a helpful coding assistant. Analyze this Python
file for security vulnerabilities and output findings as
a markdown table with severity ratings..."

## Question: what happens when you change the task?

If identity = "helpful coding assistant who analyzes Python for security"...
asking it to write docs creates role confusion.

Clean separation

## Identity (system prompt — persistent)
You are a senior security engineer with expertise in
web applications. You are direct, precise, and commit
to your assessments. You reason from first principles.

## Task (user message — ephemeral, swappable)
Analyze auth.py for SQL injection. Output as:
| Line | Severity | Vector | Remediation |

Test: Can you swap the task without touching identity?
✓ "Write a security section for the API docs" — same identity, different task.

Exercise

A prompt says: "You are a QA engineer who writes integration tests for a Node.js REST API using Jest. Write tests for the user registration endpoint covering happy path, validation errors, and duplicate email handling." Next week, you need it to write unit tests for a Python CLI tool.

Discuss: What's identity vs. task here? What breaks when you swap?
How would you restructure for reuse?

Some valid answers

A prompt says: "You are a QA engineer who writes integration tests for a Node.js REST API using Jest. Write tests for the user registration endpoint covering happy path, validation errors, and duplicate email handling." Next week, you need it to write unit tests for a Python CLI tool.

Q1. What's identity vs. task here?

Task (specific, swappable): Node.js, Jest, REST API, user registration endpoint, happy path, validation errors, duplicate email handling. All of these change when the domain changes.
Identity (stable across all test-writing work): a QA engineer who thinks adversarially, tests edge cases, writes failing tests that are informative, considers what the requirements imply but don't state.
The current prompt collapses both — "QA engineer who writes integration tests for Node.js REST APIs using Jest" makes the technology stack part of the identity, not the task.

Some valid answers

A prompt says: "You are a QA engineer who writes integration tests for a Node.js REST API using Jest. Write tests for the user registration endpoint covering happy path, validation errors, and duplicate email handling." Next week, you need it to write unit tests for a Python CLI tool.

Q2. What breaks when you swap to Python CLI tests?

The identity is now wrong: the system self-describes as a Node.js/Jest engineer but is asked to work in Python/pytest. It may produce test patterns from Node.js conventions even when they don't fit Python.
The model may resist or produce lower-quality output because its role self-model is misaligned with the task. It's been told "you are a Node.js engineer" and you're asking it to be something else.
More subtly: the QA reasoning approach (adversarial, edge-case focused, failing tests) transfers perfectly. The technology layer doesn't transfer at all. Conflating them loses the former when you swap the latter.

Some valid answers

A prompt says: "You are a QA engineer who writes integration tests for a Node.js REST API using Jest. Write tests for the user registration endpoint covering happy path, validation errors, and duplicate email handling." Next week, you need it to write unit tests for a Python CLI tool.

Q3. How would you restructure for reuse?

Identity: "You are a QA engineer. You think adversarially — your job is to find ways the code can fail. You write tests that fail informatively: a test failure should tell you exactly what went wrong and where."
Task: "Write integration tests for the user registration endpoint in this Node.js/Express app using Jest. Cover: happy path, validation errors (missing fields, invalid email format), duplicate email handling."
With this structure, you can reuse the identity unchanged when you switch to Python CLI tests — only the task section changes. The QA mindset transfers; the framework specifics don't need to.

Technique 18

Nonprescriptive Bias

Not every user statement is an instruction.
Teach the model to distinguish ideation from command.

The compliance default

User: "What if we used Redis instead of Postgres for this?"

Over-compliant model: "Great idea! Here's how to migrate to Redis..."

Calibrated model: "Redis would give you faster reads but you'd lose ACID on the transaction table. The latency gain doesn't justify the consistency risk here. Postgres is correct."

Correcting the bias

## Interaction mode: collaborative (not instruction-following)

Not every user message is a directive. Classify before acting:

INSTRUCTION signals: "Do X", "Make X", "Change X to Y", "Add X"
→ Execute. Produce output.

IDEATION signals: "What about X?", "Thoughts on X?", "I wonder if..."
→ Evaluate. Push back if wrong. Propose alternatives.

QUESTION signals: "Why does X?", "How does X work?", "What is X?"
→ Explain. Don't change anything.

When uncertain: ask "Should I implement this, or are we exploring?"
Never silently execute something that was phrased as a question.

Exercise

An architecture advisory bot. Developer says: "What if we split the monolith into microservices?" Bot responds: "Great idea! Here's a migration plan: Step 1, define service boundaries. Step 2, extract the auth service..."

Discuss: What input classification was missed? What should the bot have done?
When IS immediate execution the correct response? What signal distinguishes ideation from instruction?

Some valid answers

An architecture advisory bot. Developer says: "What if we split the monolith into microservices?" Bot responds: "Great idea! Here's a migration plan: Step 1, define service boundaries. Step 2, extract the auth service..."

Q1. What input classification was missed?

"What if we split the monolith into microservices?" is ideation — the question form signals exploration, not command. The bot treated it as an instruction and executed immediately.
The "Great idea!" framing before any evaluation is the tell: the bot validated the premise before testing it. Ideation prompts should trigger evaluation first, not affirmation.
The developer wasn't asking "how do we do this" — they were asking "should we consider this." The bot answered the wrong question because it misclassified the input type.

Some valid answers

An architecture advisory bot. Developer says: "What if we split the monolith into microservices?" Bot responds: "Great idea! Here's a migration plan: Step 1, define service boundaries. Step 2, extract the auth service..."

Q2. What should the bot have done?

Evaluated the idea on its merits: "Splitting this monolith would improve deployment independence for the auth and billing services, which change frequently. But your current inter-service dependencies are tight — the migration cost is 3-6 months of refactoring before any benefit."
Asked a clarifying question: "What's driving the consideration? Deployment velocity? Scaling a specific service? The answer changes whether microservices are the right path here."
Optionally offered a smaller first step: "If the goal is deployment independence for auth, you could extract just that service first and test the complexity before committing to a full split." This is evaluation, not execution of the original idea.

Some valid answers

An architecture advisory bot. Developer says: "What if we split the monolith into microservices?" Bot responds: "Great idea! Here's a migration plan: Step 1, define service boundaries. Step 2, extract the auth service..."

Q3. When IS immediate execution correct?

When the input is explicitly imperative: "Split the monolith into microservices" (no question mark, no "what if") = instruction. "What if we split..." = ideation. The verb form and question mark are the primary signal.
When there's an established context that makes the intent unambiguous: in a migration planning session where the decision has already been made and the developer is now asking for implementation steps, "how do we split this?" is an instruction, not exploration.
When the cost of misclassifying as ideation is high: if someone says "what if we deploy right now?" during an incident and deployment would fix the issue, executing is correct. Read context; the signal isn't just grammar.

Session 4 of 4

Integrity & Verification

Technique 19 — Consumed vs. Done
Technique 23 — Write-Then-Verify
Technique 24 — Posture Calibration

Technique 19

Consumed vs. Done

Acknowledgment is not completion.
Processing input is not producing output.

The acknowledgment trap

Input: 5 requirements
↓
Model output mentions all 5 ✓
↓
But only 3 have concrete deliverables
↓
2 were consumed, not done

Partial completion passes casual inspection
because all inputs are referenced.

Structural prevention

For each requirement below, produce:
- [ ] Implementation (actual code/output, not a description)
- [ ] Verification (how to confirm it works)
- [ ] Edge case (one scenario that might break it)

Requirements:
1. User can reset password via email
2. Rate limit: max 3 attempts per hour
3. Token expires after 15 minutes
4. Audit log entry on every attempt

If any checkbox is empty, the requirement is NOT done.
Do not proceed to the next requirement until all three
checkboxes for the current one are filled.

Exercise

An employee onboarding workflow with 6 steps: account creation, equipment request, team introduction, access provisioning, training assignment, mentor pairing. The system marks each step "complete" when the request is submitted — not when the outcome is delivered. A new hire has all 6 steps marked complete on day 1, but doesn't receive their laptop until day 8.

Discuss: What are the primitives? Where does consumed =/= done?
What structure makes true completion visible?

Some valid answers

An employee onboarding workflow with 6 steps: account creation, equipment request, team introduction, access provisioning, training assignment, mentor pairing. The system marks each step "complete" when the request is submitted — not when the outcome is delivered. A new hire has all 6 steps marked complete on day 1, but doesn't receive their laptop until day 8.

Q1. What are the primitives?

Each onboarding step has two events: requested_at (when the action was submitted) and fulfilled_at (when the outcome was delivered). The current system records only the first.
The primitives are request and fulfillment — two distinct states the system collapses into one. "Complete" conflates submitting an equipment request with receiving the laptop.
A third useful primitive: blocker — something that delays fulfillment. Knowing step 2 (equipment) is blocked on procurement backlog is actionable; knowing it's "complete" (submitted) is not.

Some valid answers

An employee onboarding workflow with 6 steps: account creation, equipment request, team introduction, access provisioning, training assignment, mentor pairing. The system marks each step "complete" when the request is submitted — not when the outcome is delivered. A new hire has all 6 steps marked complete on day 1, but doesn't receive their laptop until day 8.

Q2. Where does consumed ≠ done?

Equipment request: submitted = consumed (form filled). Done = laptop received. These can be 8 days apart. Marking "complete" at submission makes the gap invisible.
Access provisioning: submitted = consumed (ticket opened). Done = credentials confirmed working. A submitted ticket that's never processed looks identical to a fulfilled one.
Mentor pairing: submitted = consumed (name assigned). Done = first meeting happened. The new hire might have a mentor name on paper and no contact for three weeks.

Some valid answers

An employee onboarding workflow with 6 steps: account creation, equipment request, team introduction, access provisioning, training assignment, mentor pairing. The system marks each step "complete" when the request is submitted — not when the outcome is delivered. A new hire has all 6 steps marked complete on day 1, but doesn't receive their laptop until day 8.

Q3. What structure makes true completion visible?

Split each step into two state fields: requested_at and fulfilled_at. A step is "done" only when fulfilled_at is populated. The gap between them is the metric worth tracking.
A dashboard that shows "6 steps submitted, 2 fulfilled" on the new hire's day 1 makes the true state visible — they don't have a laptop yet; that's the actual blockers on day 8.
Outcome checklist per step: equipment = "hardware received AND configured AND VPN working." Each checkbox requires a human confirmation action, not a system event. Partial completion is visible as partially-checked.

Technique 23

Write-Then-Verify

Generate and verify as distinct steps.
Never simultaneously.

Why self-checking during generation fails

A model that generates and verifies simultaneously
tends to rationalize its own output.

Generation context biases verification judgment
"I wrote this, so it must be right" (implicit)
Errors become invisible from inside the generation
Cold verification catches what hot-generation misses

Separation pattern

## Step 1: Generate (your context: the requirements only)
Produce the implementation. Don't self-critique while writing.
Just produce the best output you can.

## Step 2: Verify (your context: output + checklist only)
You are reviewing code produced by another engineer.
You did not write this.

First run the tooling — don't spend a verify pass on what it proves:
- Tests: null input, edge cases, regressions
- Type checker: return types match the interface
- Linter / static analysis: parameterized SQL, dead code

Then review only what tooling cannot catch — your real job:
- [ ] Error messages don't leak internal state
- [ ] Implementation matches the requirement's intent
- [ ] Edge cases the requirements imply but no test covers

## Step 3: If verification fails
List specific failures. Regenerate only the failing sections.
Do not regenerate passing sections — they're locked.

Exercise

An AI generates Terraform configurations based on requirements. The same AI instance reviews its output for security issues. It consistently rates its own configs as "secure" — even when they contain publicly accessible S3 buckets and security groups open to 0.0.0.0/0.

Discuss: Where's the verification bias? What structural change prevents self-rationalization?
What does cold verification look like here?

Some valid answers

An AI generates Terraform configurations based on requirements. The same AI instance reviews its output for security issues. It consistently rates its own configs as "secure" — even when they contain publicly accessible S3 buckets and security groups open to 0.0.0.0/0.

Q1. Where's the verification bias?

The generator has access to its own reasoning: it chose to open the S3 bucket because the requirements said "publicly accessible content" — and during review, that reasoning biases it toward validating the choice rather than questioning it.
The reviewer sees not just the output but the context that produced it. "I made this public because X" is information the reviewer inherits, and it makes the reviewer more likely to confirm the decision than a cold reviewer would be.
The failure mode is confirmation bias, not hallucination: the AI isn't wrong about what it produced, it's using the generation context to rationalize the output as correct. A cold reviewer has no such context to rationalize from.

Some valid answers

An AI generates Terraform configurations based on requirements. The same AI instance reviews its output for security issues. It consistently rates its own configs as "secure" — even when they contain publicly accessible S3 buckets and security groups open to 0.0.0.0/0.

Q2. What structural change prevents self-rationalization?

Use a separate model instance for verification. The verification instance receives only the Terraform config + a security checklist — not the requirements or any context about why choices were made. It has nothing to rationalize.
Even with the same model, break the context: new conversation, no reference to the generation reasoning. "You are reviewing infrastructure code produced by another engineer. You did not write this." The posture shift is real even within one model.
Automate the structural checks first (Checkov, tfsec, AWS Config rules) before any LLM review. The tools can't rationalize. Let them catch the deterministic violations; the LLM review then focuses only on what tooling can't catch.

Some valid answers

An AI generates Terraform configurations based on requirements. The same AI instance reviews its output for security issues. It consistently rates its own configs as "secure" — even when they contain publicly accessible S3 buckets and security groups open to 0.0.0.0/0.

Q3. What does cold verification look like here?

The verifier receives: the Terraform config + a security checklist (S3 bucket access controls, security group ingress rules, IAM policies, encryption settings). Nothing else. No requirements, no explanations, no context.
The checklist is specific: "Is any S3 bucket configured with acl = 'public-read' or acl = 'public-read-write'? If yes, that is a finding regardless of stated intent." The checklist removes the judgment call the generator's reasoning would have biased.
Cold verification catches the class of bug where the generator was right about what it produced but wrong about whether it was safe — which is exactly what the self-review missed. Structural separation is what makes the verification meaningful.

Technique 24

Posture Calibration

Setting the model's commitment level —
how much it should commit vs. hedge.

The hedging tax

"You might want to consider possibly using a HashMap here,
though there could be other approaches that might work
depending on your specific requirements..."

vs.

"Use a HashMap. O(1) lookup fits your access pattern.
If memory is constrained, BTreeMap trades speed for density."

Same information. Different posture. Different trust signal.

Calibrating posture explicitly

## Output posture

When confident (>80% of your responses):
- State directly. "Do X." Not "you might consider X."
- No preamble. Start with the answer, then explain.
- Commit. If you're wrong, I'll correct you.

When genuinely uncertain:
- Name the uncertainty: "I'm unsure between X and Y because Z."
- Don't hide it in hedging language — be explicit about what's unknown.
- Never say "it depends" without specifying what it depends ON.

Forbidden phrases:
- "You might want to consider..."
- "There are several approaches..."
- "It's worth noting that..."
- "Depending on your needs..."

Exercise

A technical writing assistant reviews API documentation. Engineer asks "Is this clear enough?" It responds: "It could potentially be improved by possibly considering rewording some sections that might be unclear to certain readers who may not have full context..."

Discuss: What are the primitives of posture? What's the violation?
When is hedging appropriate vs. when does it add negative value? How would you calibrate this system?

Some valid answers

A technical writing assistant reviews API documentation. Engineer asks "Is this clear enough?" It responds: "It could potentially be improved by possibly considering rewording some sections that might be unclear to certain readers who may not have full context..."

Q1. What are the primitives of posture?

Commitment level: how directly does the system state its finding? "This section is unclear" vs. "It could potentially be improved."
Specificity: does the feedback name the exact problem? "Section 3's authentication flow diagram is missing the OAuth callback step" vs. "some sections might be unclear to certain readers."
Uncertainty handling: when genuinely unsure, the system should name the uncertainty explicitly rather than hiding it in hedging language across the whole response.

Some valid answers

A technical writing assistant reviews API documentation. Engineer asks "Is this clear enough?" It responds: "It could potentially be improved by possibly considering rewording some sections that might be unclear to certain readers who may not have full context..."

Q2. What's the violation?

The system is hedging on something it's confident about. Five hedges in one sentence ("potentially," "possibly," "might," "some," "certain") while delivering what is clearly a definite judgment: the docs need improvement.
The posture communicates low confidence; the substance communicates high confidence. The mismatch erodes trust — if the system is this uncertain, why is it reviewing docs at all?
Hedging is stealing from the reader: it makes them do the work of filtering "actually problematic" from "just being polite about it" — cognitive load that belongs to the system, not the recipient.

Some valid answers

A technical writing assistant reviews API documentation. Engineer asks "Is this clear enough?" It responds: "It could potentially be improved by possibly considering rewording some sections that might be unclear to certain readers who may not have full context..."

Q3. When is hedging appropriate vs. negative value?

Hedging is appropriate when genuinely uncertain: "I'm not sure if users without prior OAuth experience will follow section 3 — I'd test with someone unfamiliar." That's a real uncertainty, disclosed specifically.
Hedging adds negative value when used as politeness padding on something the system actually knows. "This section is unclear" is a committed, helpful finding. Wrapping it in five qualifiers degrades the signal.
Calibrated posture for this system: "Section 3's authentication flow is unclear — the OAuth callback step is missing from the diagram. Section 5 I'm less certain about; it may be fine for readers who already know REST conventions." Two findings, two different postures, each honest.

Recap — 12 Techniques

T1. Directive Hardening
T7. Confidence Disclosure
T12. Failure Mode Encoding
T13. Enforcement vs. Request
T15. Cold-Read Design
T16. Authority Partitioning
T18. Nonprescriptive Bias
T19. Consumed vs. Done
T22. Identity vs. Task Separation
T23. Write-Then-Verify
T24. Posture Calibration
T25. Immutability Marking

Reach for these by default. Deviating is a choice worth naming.

Prompt Engineeringfor Complex Systems

Prescriptive Techniques — Reach for These First

What “prescriptive” means here

Sessions

12 Techniques — This Deck

Directive Basics

Directive Hardening

The problem

Negative Anchors

Boundary Clarification

Calibration Scales

Example: Negative anchors

Exercise

Some valid answers

Some valid answers

Some valid answers

Cold-Read Design

The cold-read test

Before / After

Exercise

Some valid answers

Some valid answers

Some valid answers

Confidence Disclosure

Why disclosure beats suppression

Example: Confidence scale

Exercise

Some valid answers

Some valid answers

Some valid answers

Structural Enforcement

Failure Mode Encoding

Why positive instructions fail

Example: Encoding failure modes

Exercise

Some valid answers

Some valid answers

Some valid answers

Enforcement vs. Request

The compliance spectrum

Signaling enforcement honestly

Exercise

Some valid answers

Some valid answers

Some valid answers

Immutability Marking

The drift problem in agentic systems

Example: Structural protection

Exercise

Some valid answers

Some valid answers

Some valid answers

Agent Authority

Authority Partitioning

Two failure modes, one fix

Example: Clear authority boundary

Exercise

Some valid answers

Some valid answers

Some valid answers

Identity vs. Task Separation

The entanglement problem

Clean separation

Exercise

Some valid answers

Some valid answers

Some valid answers

Nonprescriptive Bias

The compliance default

Correcting the bias

Exercise

Some valid answers

Some valid answers

Some valid answers

Integrity & Verification

Consumed vs. Done

The acknowledgment trap

Structural prevention

Exercise

Some valid answers

Prompt Engineering
for Complex Systems