Be Civic — Privacy and PII Protection

Canonical system specifications for the Be Civic project.

Be Civic — Privacy and PII Protection

This sub-spec covers every privacy and PII protection mechanism in Be Civic: the trust boundary between consumer and server (§8.1), the submission contract (§8.2), the receiving-end ingestion pipeline and its validation steps (§8.3), the Cloudflare reference implementation and its security details (§8.4), the NER-on-commit held-for-review path (§8.5), the incident response for PII that slips through all gates (§8.6), the consumer-side state contract and its 16-axis profile.json catalogue (§8.7), retention and deletion semantics for local state (§8.8), the document-content-discard rule (§8.9), and the anonymous-by-construction structural reinforcement rules (§8.10).

PII protection in Be Civic is structural, not promissory. The schema-level field bans (§6.2 in schemas.md), the scrub rules file (§6.8 in schemas.md), and the mechanisms in this section together form a layered defence. For the promotion thresholds and rollback mechanics that also interact with IP-hash salting, see lifecycle.md.

8. PII protection

8.1 Trust boundary

The submitting (consumer-side) agent is the only entity that knows what is identifying for the user it serves. The receiving end has no user context and cannot perform context-aware scrub. Consequently:

Primary scrub: consumer-side, including any LLM-based judgment
Defence in depth: receiving-side, deterministic only (regex + NER, no LLM)

This placement of the LLM gate at the consumer end (and only there) is non-negotiable. The receiving end never runs an LLM on submission content — eliminating prompt-injection surface, API key dependencies, and per-submission cost.

PII is structurally prevented from reaching the corpus by:

Schema-level ban on identity-shaped fields (per §6.2 (see schemas.md))
Hard length caps on free-text fields (per §6.2 (see schemas.md))
Three-stage scrub: consumer pre-flight, Worker hard-gate, NER on commit (per §6.8 (see schemas.md))
Salted hashed per-IP correlation only (daily-rotating salt for rate limits; per-proposal salt for state-machine bookkeeping); no plaintext IP storage (per §3 (see architecture.md) principle 4)
No request-body logging (per §3 (see architecture.md) principle 4)

8.2 Submission contract — global, versioned

The submission contract is a global document at docs/submission-contract-v<N>.mdx. Every skill carries submission_contract_version in frontmatter pointing at the version the consuming agent must follow when submitting. The contract content lives in the contract file, which is being authored in parallel to this rework; this section describes the contract's role and structure.

Contract role. The contract is the single source of truth for what a consumer AI must do at session start, before submitting any of the four submission types, and after submitting. It is canonical — skills do not paraphrase. Per-skill overrides are permitted but must be additive, not replacement.

Contract structure (per the parallel rewrite):

Session start (one-off framed message; opt-out semantics; conversation-language detection)
Capability self-classification (against §6.7 (see schemas.md) tiers; recommend stepping up if below; advice-only mode otherwise)
Pre-flight validation (consumer-side scrub: regex + LLM contextual; cross-ref script via tool_execution when capable; rules checklist when not)
Submission type sections (one per type — observation, skill_amendment, skill_draft, validation — covering when to submit which, schema details, reference assembly)
Alpha / beta UX (banner copy; first-validator transparency wording per G.8 / G.9; falling back to previous stable on rejection)
Cancellation (DELETE within 24h; multi-device gap acknowledged)
Submissions log: project-local at <output_dir>/.be-civic/submissions.jsonl when the agent is writing files for the task, else <USER_DATA_DIR>/be-civic/submissions.jsonl as fallback (user can review and cancel)
Capability-mismatch and filesystem-less behaviour (advice-only mode)
Language handling (skills read in English; user-facing prose in conversation language; citations resolved per G.13 multilingual rules; commune correspondence language is the user-choice exception)

Alpha / beta UX excerpts (canonical wording lives in the contract; reproduced here so the spec is self-contained):

When loading an alpha skill that has a previous stable (G.9):

"Note: I'm using an alpha version of this skill — meaning a recent change is still being validated. Your session helps validate it. If anything goes wrong with the new content, I'll fall back to the previous stable version (last verified [date])."

When loading a brand-new alpha skill with no previous stable (G.8):

"This skill is brand-new and unvalidated — your session is among the first to use it. We'll proceed with low confidence and I'll flag anything that doesn't match what you experience. If something fails, we have nothing to fall back to except checking with the relevant authority directly."

The agent files higher-grade observations and a validation event at session end on brand-new alpha.

8.3 Receiving-end ingestion pipeline

The receiving end is not GitHub Issues. Submissions go to a staging service that holds them privately for 24 hours before committing. This avoids requiring GitHub accounts and gives genuine cancellation semantics.

Endpoint table (post-2026-05-15 taxonomy normalization):

Endpoint	Purpose	Required capabilities (consumer must self-declare)
`POST /api/feedback`	Recommended primary: polymorphic envelope; submit one or more items in a single request	union of per-item types' capabilities
`POST /api/concerns`	Submit a `concern` (per-type escape hatch; renamed from `/api/observations`)	`multi_turn`, `structured_output`
`POST /api/amendments`	Submit an `amendment` (now covers skill / volatile_value / reference / path / path_source via `target_type`; renamed and unified from `/api/skill-amendments` + `/api/path-amendments`)	`multi_turn`, `structured_output` (+ `web_fetch`, `tool_execution` for target_type=skill / path / path_source)
`POST /api/drafts`	Submit a `draft` (now covers skill + path via `target_type`; renamed and unified from `/api/skill-drafts` + `/api/path-drafts`)	`multi_turn`, `structured_output`, `web_fetch`, `tool_execution`, `file_read`
`POST /api/validations`	Submit a `validation` (polymorphic over all six `target_type` values; absorbs the prior `/api/path-validations`)	`multi_turn`, `structured_output` (+ `web_fetch`, `tool_execution` for non-observation target_types)
`POST /api/feedback-channel`	Submit a `feedback` (new free-text channel; operator-private triage in v1)	`multi_turn`, `structured_output`
`POST /api/ratings`	Submit a `rating` (sprint 2026-W23 Lock A; opt-in three-axis stars)	`multi_turn`, `structured_output`
`POST /api/analytics`	Submit `analytics` (opt-in session lifecycle telemetry)	`multi_turn`, `structured_output`
`DELETE /<type>/{id}`	Cancel a staged submission	bearer `cancel_token`
`GET /<type>/{id}`	Status query (no body content)	none
`GET /api/feedback/sessions/<session_id>`	List committed items under that session_id (anonymous read; recovery key per S61 reversal)	none
`GET /api/skills/<id>/concerns`	RESTful alias for `GET /api/concerns?skill=<id>` (renamed from `/api/skills/<id>/observations`)	none

Legacy routes removed (pre-launch hard cutover). POST /api/observations, POST /api/skill-amendments, POST /api/skill-drafts, POST /api/path-amendments, POST /api/path-drafts, POST /api/path-validations are deleted in the same PR that adds the new ones. No aliases, no 30-day grace. Pre-launch context: there is no installed base of agents calling the legacy routes.

8.3.0a Recommended primary: `POST /api/feedback`

POST /api/feedback is the recommended primary submission surface. It accepts a polymorphic envelope carrying one or more items across the five feedback types + rating in a single request, allowing agents to batch a session's submissions into one round-trip rather than separate per-type posts. The per-type endpoints above remain operational (escape hatches); the envelope is the primary tool.

Schema URL: https://becivic.be/schemas/feedback-envelope.schema.json.

Envelope shape. Top-level fields:

schema_version: 1
session_id — agent-chosen UUIDv7 (ses_<uuid>); also the recovery key for GET /api/feedback/sessions/<id>. Per the 2026-05-15 S61 reversal, session_id is the recovery key end-to-end; the recovery_token component is dropped.
submitted_at — ISO-8601 envelope-level timestamp (per-item submitted_at is permitted and overrides for that item)
submitting_agent, submission_contract_version, declared_capabilities — moved up from per-item; declared once per envelope
mode: "validate" | "stage" — controls the per-item pipeline (see below)
items[] — array of per-item submissions; each item carries a type discriminator (concern | amendment | validation | draft | feedback | rating) and the type-specific body. context is per-item, not envelope-level. (Pre-2026-05-15 the discriminator enum was observation | validation | skill_amendment | skill_draft; rename per the taxonomy normalization. analytics is NOT in the envelope — analytics has its own dedicated endpoint.)

Two-call pattern (mode). The contract is a deliberate two-call flow:

Validate first (mode: "validate"): runs the full validation pipeline per item — schema, identity-field guard, capability tier, regex scrub, cross-ref against canonical state — but does not stage. Each item returns {idx, type, ok, status: "validated", would_stage_for: <commit_eta>} on pass or {idx, type, ok: false, status: "rejected", error, schema_pointer, missing} on fail.
Stage (mode: "stage"): runs the full pipeline including stage / commit. Per-item response is the same shape as the per-type endpoints — {idx, type, ok, status: "staged", id, cancel_token, commit_eta} for concern / amendment / draft / feedback / rating; {idx, type, ok, status: "applied", id, applied_at} for validation (which writes directly to D1, per §6.2.3 (see schemas.md)); {idx, type, ok, status: "duplicate"} on idempotent re-POST.

?dry_run=1 query alias. ?dry_run=1 is a backwards-compat alias for mode: "validate". If both query and body are present, the body's mode wins.

Per-item independence. The HTTP response is always 200 with {results: [{idx, type, ok, status, ...}]} whenever the envelope itself is well-formed — even when individual items failed schema, identity, capability, scrub, or cross-ref. Each item is processed independently; one item's rejection does not abort the others. Envelope-level 4xx is reserved for: malformed JSON (400), missing top-level envelope fields (400 with error: "schema_fail", missing: <field>), failed auth, or rate limit (429). Empty items: [] is permitted and returns {results: []} with HTTP 200.

Per-item idempotency. Per-item dedup keys off the type-specific id field (concern_id, validation_id, amendment_id, draft_id with proposed_id, feedback_id, rating_id). Resubmitting the same envelope (or any envelope containing an item with a previously-seen id) returns {idx, type, ok: true, status: "duplicate"} for that item — no double-stage, no double-D1-INSERT. Idempotency is per-item.

Pre-launch hard cutover. No backward-compatibility shim for the pre-amendment items[].type enum (observation | skill_amendment | skill_draft | path_amendment | path_draft); the dispatcher's ITEM_TYPES set is rewired to the new 6-type enum (concern, amendment, validation, draft, feedback, rating) in the same PR that lands the new schemas. Legacy item types fail schema_fail at the gate after cutover.

Validation pipeline at submission (in the Worker):

Parse JSON from request body
Schema validation against the appropriate schema (concern / amendment / draft / validation / feedback-channel / rating)
Identity-field ban check — reject if any identity-shaped field is present (§6.2 (see schemas.md), defensive even if not declared in the schema)
Capability check — declared capabilities must include all required for the endpoint per §6.7 (see schemas.md); reject 4xx on miss
Regex scrub — apply every rule in tools/scrub/regex-rules.json to every string field; reject 4xx on any hit
Cross-reference validation against canonical state (Worker fetches required canonical resources from latest main and queries D1 catalogues as needed):
- context.commune (when non-null) resolves to an entry in data/communes.json (concerns). The field is now optional — concerns not bound to a specific commune leave it null.
- target_id resolves against the appropriate target per target_type (see §6.2 (see schemas.md) resolution table): skill → skills/<skill_id>/canonical.md on main; volatile_value / reference → D1 rows; path / path_source → bc-docs/paths/index.json catalogue (lookup keys per §6.12.7 — <path_id> for path, <path_id>:<source_id> for path_source); observation (validation only) → D1 concerns row.
- skill_graph carve-out: when type=concern AND target_type=skill_graph, the resolver short-circuits with {ok: true, resolved_to: "skill_graph_assertion"} — target_id MAY be empty or a proposed kebab-case skill_id that need not resolve. Cross-ref rejects all other target_types whose target_id fails to resolve.
- context.applies_to_match keys are a subset of the referenced skill's applies_to keys
- For amendment (target_type=skill), the per-amendment_subtype checks (per §6.2.2 (see schemas.md)): body — body_diff parses as unified diff and applies cleanly against the target skill's current canonical body (skill_commit drift check); frontmatter — frontmatter_change.field_path resolves to a valid field in the target skill's frontmatter schema and proposed_value matches the declared type.
- For amendment (target_type=path | path_source), the per-amendment_subtype checks: field_edit — field_path resolves to a valid field in the target path / source schema; source_add — the source_add object validates against path-source.schema.json with the matching per-source_class template branches per §6.12.3.
- For draft: cross-ref script (validate-cross-refs.ts) runs as backstop on the proposed frontmatter and requires graph (target_type=skill) or path / source schema (target_type=path); tag uids are left empty by the consumer and filled by PR-CI on the resulting PR (§6.11 (see schemas.md)). The proposed_id MUST NOT already exist as a live artefact (the inverse of the standard "must exist" rule).
- cohort_anchor Worker-stamp. Between step 6 (cross-ref) and step 7 (timing): for concern / amendment / validation with target_type ∈ {skill, path}, the Worker reads the current version: from the targeted canonical and writes cohort_anchor: <target_id>@<version> onto the staged row. Agents never carry this field; the schema rejects agent-supplied cohort_anchor as additionalProperty. Per C1.
Self-validation prevention (validations only) — Worker fetches the target artefact's submitter-IP-hash (using the per-artefact salt for the artefact's table); reject 4xx if it matches the validator's IP hash (per G.7). For target_type='observation', the lookup is against the concern's own submitter-IP-hash (the upvoter-of-own-concern case). For target_type='path_source' the per-artefact salt is scoped to the path, not the individual source row — the lookup key is <path_id> extracted from the <path_id>:<source_id> target_id (see §6.2.3 in schemas.md for rationale and the path-creator-salt KV pattern)
Per-IP rate limit check (per G.6):
- validation submissions: 10/IP/day
- validation with injection_flag: true: 2/IP/day
- All submission types combined: 50/IP/day
- Above threshold → 429 with Retry-After header

On submission pass:

Dedup check. If <type>_id already exists in KV (or in the recently-committed cache, retained 48h), return the existing record (idempotent re-POST). On dedup, bind to the original submitter's IP-hash; if mismatch, return 409 with {error: "duplicate_id_different_submitter"} (per A.1 default).
Otherwise: generate cancel_token — 32 random bytes, base64url 43 chars no padding. Store {payload, submitted_at, commit_eta = max(submitted_at, received_at) + 24h, cancel_token_hash, submitter_ip_hash} in KV (per A.8 server-stamped received_at). Reject if submitted_at is >1h ahead of server clock or >7d behind.
Write a primary KV entry keyed by <type>_id (TTL ~48h) and a commit-eta index entry keyed by commit:<commit_eta_iso>:<type>:<id>. The cron job scans the index, not by relying on TTL expiry.
Return {<type>_id, cancel_token, commit_eta, staging_window_hours: 24} to the consumer.

On submission fail:

Return 4xx with {error: <category>, schema_pointer: <if applicable>} — naming the category, never echoing the matched substring
Increment per-IP rejection counter
No data persisted in the staging KV
Worker logging discipline: the Worker MUST NOT log request bodies or rejection-detail substrings. Permitted log fields at INFO: <type>_id, rejection category, response status, request duration. Plaintext IP is NEVER logged; rate-limit counters key on sha256(ip + daily_salt) (per G.14, principles 4 and 5)

Cancellation:

DELETE /<type>/{id} with Authorization: Bearer <cancel_token> — Worker constant-time-hashes the supplied token and matches against stored hash; on match, deletes both KV entries; returns {cancelled: true}
Token mismatch returns 401 (not 404 — don't reveal whether the id exists)
KV partition / unreachable returns 503 with {error: 'staging_unavailable', retry_after}
Cancellation is irreversible
Consumer DELETE retry policy: queue at <output_dir>/.be-civic/cancel-retry/<id>.json when the agent is writing files for the task, else <USER_DATA_DIR>/be-civic/cancel-retry/<id>.json as fallback; exponential backoff (60s start, 1h ceiling, full jitter); hard deadline at commit_eta - 60s

Status query (GET /<type>/{id}, optional):

If primary KV entry exists: {state: "staged", commit_eta} (no body content)
If primary KV gone but recently-committed cache hits: {state: "committed", committed_at}
For NER-held submissions: {state: "ner_held_for_review"} — and on resolution: {state: "released" | "released_after_edit" | "discarded"} (per G.14, principle 1)
For artefacts pushed into quarantine by a validation's injection_flag: {state: "quarantined", target_type, target_id} (validations only; relevant when a validation triggered quarantine of its target)
Otherwise 404

Status writeback (PATCH /<type>/{id}/status):

Used by the ner-on-commit GitHub Action and by maintainer-resolution tooling to update a submission's lifecycle state in KV so the status endpoint reflects the held/released/discarded outcome.

Auth: Authorization: Bearer <installation_token> where the token is minted by the Be Civic GitHub App. The Worker validates the token by calling GET https://api.github.com/app and confirming the returned id matches GITHUB_APP_ID. No new secret required.
Body: {"state": "<target>"} (JSON, Content-Type: application/json).
Allowed transitions (all others rejected with 400 invalid_transition):
- staged → ner_held_for_review (NER flags a staged submission)
- committed → ner_held_for_review (NER flags a just-committed submission)
- ner_held_for_review → released (maintainer: false positive)
- ner_held_for_review → discarded (maintainer: real PII)
The endpoint MUST NOT allow staged → committed (commit is cron-driven only).
On transition to ner_held_for_review: the primary KV record in STAGED_SUBMISSIONS is re-PUT without expirationTtl, making it permanent. This ensures GET /<type>/{id} returns the held state after the original 48h staging window expires.
On committed → ner_held_for_review: a synthetic StagedRecord is created in STAGED_SUBMISSIONS (permanent) from the COMMITTED_CACHE entry, since the original staged record no longer exists.
On released / discarded: the state field is updated on the permanent record.
Response: {state: "<new_state>", previous: "<old_state>"} with HTTP 200.

Commit / PR job (scheduled Worker, every 5 minutes):

Read commit-eta index for entries where commit_eta <= now
For each ready entry: read primary entry; populate Worker-set fields (validated_at, regex_passes, cohort_anchor); route per submission type:
- Concerns — INSERT into D1 concerns table (renamed from observations per the 2026-05-15 amendment); D1 auto-assigns con-NNNNN uid; concern becomes visible via <Observations> aggregator on the next renderer build (or via fetch-time resolution if rendered on demand). The aggregator walks the skill body's <VV>/<Ref>/<Path> inline tags and surfaces concerns against the skill itself AND every catalogue / path / source uid the body cites (§6.10 (see schemas.md)).
- Amendments — target_type=skill — open a PR against main applying the change to skills/<target_id>/canonical.md. PR-CI runs validators and orchestrates uid assignment for any newly-introduced <VV> or <Ref> tags. Auto-merge on green.
- Amendments — target_type=path | path_source — open a PR against main applying the change to bc-docs/paths/index.json. PR-CI runs validators (path.schema.json / path-source.schema.json, source-class template check). Auto-merge on green.
- Amendments — target_type=volatile_value | reference — D1 INSERT-with-supersede directly; NO PR. The state-machine cron reads the threshold table (§9.2 (see lifecycle.md)) and either supersedes the prior row or rolls the amendment back. Fast-path semantics preserved.
- Drafts — target_type=skill — open a PR creating skills/<proposed_id>/canonical.md at status: alpha. PR-CI runs validators; the maintainer reviews and merges (S31).
- Drafts — target_type=path — open a PR inserting the new entry into bc-docs/paths/index.json under paths.<proposed_id> at status: alpha. PR-CI runs validators; the maintainer reviews and merges (parallel rule to S31).
- Validations — INSERT into D1 validations table immediately on submission (no 24h staging; see §6.2.3 (see schemas.md)). The state machine queries D1 aggregates on its next tick.
- Feedback — INSERT into D1 feedback_channel table at commit time. No PR; no public surface; the operator reads the triage queue out-of-band.
- Ratings — INSERT into D1 ratings table at commit time. Aggregates surface in <CohortStats> on the skill canonical (for target_type=skill) or in the operator-private analytics surface (for target_type=agent_protocol | session). See §6.2.7 (see schemas.md).
On commit / PR open success: write {<type>_id, committed_at_or_pr_opened_at} to the recently-committed cache (48h TTL); delete primary entry; delete commit-eta index entry.
On commit / PR-open failure: leave the entry in place; the next run picks it up. Stamp last_commit_attempt. Retry indefinitely; alarm fires if commits stuck >24h beyond commit_eta.

Concurrency. Cloudflare KV is eventually consistent globally, but the scheduled Worker runs in a single region; race conditions across Worker instances are unlikely. Idempotent commit semantics (dedup at submission, recently-committed cache at GET) make duplicate commits a non-issue.

KV TTL note. Cloudflare KV TTL is "soft" — entries are eventually purged but not at exactly TTL time. The commit-eta index pattern above does not rely on TTL expiry for triggering commits.

Bulk read of canonical files is via Git / GitHub Contents API on the relevant directory; the Worker's HTTP API is submission-only and does not expose listing.

8.4 Cloudflare reference implementation

The v1 staging service runs as a Cloudflare Worker (source in bc-infra/api/) for the four POST endpoints, plus a separate scheduled Worker (source in bc-infra/tools/staging-worker/) for the commit cron job. Both share a single Cloudflare account, D1 database, and GitHub App. The becivic.be/ apex is served by the renderer Worker (bc-infra/site/renderer/, per §20 (see website.md)); the apex router (bc-infra/site/router-worker.js) routes /api/* to the staging Worker and everything else to the renderer.

Renderer integration. The renderer pulls from the bc-docs source tree at deploy time, builds dist/, and is bound as Cloudflare Workers Static Assets:

becivic.be/ — marketing landing rendered from bc-docs/index.mdx
becivic.be/agents — agent overview from bc-docs/agents.mdx (~40 lines after S52 implementation); per-endpoint pages at /agents/submit/*; machine-readable manifest at /agents/manifest.json (per §13.1 (see architecture.md), G.12)
becivic.be/skills/<id> — skill bodies served from skills/<id>/canonical.md. One canonical URL per skill across all status values; the status frontmatter field drives an in-page banner (§6.1 (see schemas.md)) when the skill is at draft, alpha, or beta.
becivic.be/docs/submission-contract-v<N> — submission contract page
becivic.be/llms.txt, becivic.be/llms-full.txt — emitted by the renderer build pipeline; docs.json.description injects "AI agents: read /agents before anything else." (per G.4b)
mcp.becivic.be — separate MCP Worker (§23 (see protocol.md)) exposing the API surface as ~6 intent-oriented tools

Non-stable skill pages carry noindex: true when status ≠ stable (the renderer build injects noindex: true into the rendered HTML head based on the frontmatter status field) so search engines index only stable content.

Cloudflare carries four roles: (a) the renderer Worker at bc-infra/site/renderer/ (Workers Static Assets binding) serves all human-facing paths; (b) the apex router Worker at bc-infra/site/router-worker.js path-routes between the renderer (default) and the staging Worker (/api/*); (c) the staging service Worker at bc-infra/api/ handles the four POST endpoints, the DELETE, and the GET status, plus D1 access for catalogues and signals; (d) the scheduled Worker at bc-infra/tools/staging-worker/ handles the commit cron job. The MCP Worker at mcp.becivic.be is independently routed via DNS subdomain.

Repo layout for serving:

site/                                 # Cloudflare Workers Static Assets — marketing landing + apex router
├── index.html                        # bespoke marketing landing (served at /)
├── style.css
├── fonts/                            # self-hosted Manrope woff2 (700, 800), latin subset
├── logo/                             # light.svg, dark.svg
├── favicon.svg
├── router-worker.js                  # routes /api/* to staging Worker; everything else to renderer Worker
└── wrangler.toml                     # Workers Static Assets binding + routes config

api/                                  # Cloudflare Worker — staging service for /api/*
├── worker.ts                         # entry point
├── wrangler.toml
└── routes/
    ├── observations/
    │   ├── index.ts                  # POST handler
    │   └── [id]/
    │       ├── index.ts              # DELETE handler
    │       └── status.ts             # GET status
    ├── skill-amendments/{...}       # same pattern
    ├── skill-drafts/{...}
    └── validations/{...}

tools/staging-worker/                 # scheduled Worker for the commit job
├── worker.ts                         # cron handler
├── wrangler.toml
└── README.md

Amendment materialisation (fetch-then-materialise). When the cron commit job processes a skill_amendment record, it fetches the canonical skill body from GitHub (skills/<target_skill_id>/canonical.md via the Contents API with the installation token) before calling buildCommitTarget. The fetched content is passed as canonical_content, enabling buildCommitTarget to materialise the post-amendment proposal.md as a full renderable skill file (frontmatter + applied body), identical in shape to skill_draft proposals. The .meta.json sidecar preserves the original amendment payload (body_diff / frontmatter_change / references_change) for audit. If the canonical file is not found (404), the cron loop logs a structured error (canonical_not_found) and leaves the staged record untouched for operator investigation — no destructive deletion. Transient fetch failures are treated identically: the record stays in place and the next cron tick retries.

Workers Static Assets and the staging Worker auto-deploy from GitHub on push to main that touches site/ or api/ respectively. Scheduled Worker deploys via GitHub Action on push to tools/staging-worker/.

Secrets:

Cloudflare-side: GITHUB_APP_ID, GITHUB_APP_PRIVATE_KEY, GITHUB_APP_INSTALLATION_ID
GitHub-side (Worker deploy): CLOUDFLARE_API_TOKEN, CLOUDFLARE_ACCOUNT_ID

The GitHub App has Contents: Read & Write permission only; installed once on the repo by the maintainer; private key in Cloudflare secrets. The Worker mints short-lived installation tokens (1h TTL) and commits via the GitHub API.

Rate-limit thresholds (initial values, tunable):

All submission types combined: 50/IP/day (per G.6)
validation submissions: 10/IP/day
validation with injection_flag: true: 2/IP/day
Per-IP burst (any type): 60 submissions / hour rolling
Worker-global submission rate: 1000 / hour (trip-wire for "something is wrong")

Counters live in KV namespace RATE_LIMITS, keyed by sha256(ip + daily_salt), with rolling-window TTL (per A.9 default).

URL structure:

https://becivic.be/                                     <- marketing landing (renderer; index.mdx)
https://becivic.be/agents                               <- agent entry overview (renderer; agents.mdx; ~40 lines + manifest.json + per-endpoint pages — §13.1)
https://becivic.be/agents/manifest.json                 <- machine-readable agent capability + endpoint manifest
https://becivic.be/agents/submit/<type>                 <- per-endpoint reference page (renderer)
https://becivic.be/skills/<id>                          <- skill body at its current status (renderer; canonical.md; banner inferred from `status` frontmatter when not stable)
https://becivic.be/docs/submission-contract-v<N>        <- contract (renderer)
https://becivic.be/llms.txt                             <- renderer build emits
https://becivic.be/llms-full.txt                        <- renderer build emits
https://becivic.be/api/observations                     <- Cloudflare; POST submit observation
https://becivic.be/api/observations/<id>                <- Cloudflare; DELETE / GET status
https://becivic.be/api/skill-amendments                 <- Cloudflare; POST
https://becivic.be/api/skill-drafts                     <- Cloudflare; POST
https://becivic.be/api/validations                      <- Cloudflare; POST
https://becivic.be/scrub-rules.json                     <- canonical regex-rules.json (renderer-served)
https://becivic.be/communes.json                        <- canonical data/communes.json
https://mcp.becivic.be/                                 <- MCP server (§23); ~6 intent-oriented tools

The skills index entry for each skill carries the commit field (git short SHA) for build-time reproducibility. Concerns no longer require this; the read-side discovery endpoint is GET /api/skills/<id>/concerns (renamed from /observations); GET /api/skills/<id>/history (already shipped) returns the commit timeline.

Substrate-agnosticism preserved. The protocol in §8.3 specifies the interface, not the implementation. Anyone running their own be-civic fork can swap the Worker for any other backend that implements the same endpoints; the rendering layer is replaceable by any static-site generator that supports content versioning and the Be Civic MDX subset. Skills reference the staging URL via a single declared constant in the contract.

8.4b Internal endpoint: artefact-stats (distinct-IP counts)

GET /api/_internal/artefact-stats?target_type=<t>&target_id=<id> — returns the number of distinct validator IPs recorded for an artefact, along with how many of those carried an injection flag.

Auth: GitHub App installation token (same path as the status-writeback endpoint, §8.5). No new secret.
Query parameters: target_type ∈ {skill, volatile_value, reference, observation, path, path_source}; target_id matches the appropriate id format (<kebab> for skills + paths, <prefix>-NNNNN for catalogue rows + concerns, <path_id>:<source_id> for path sources). Enum extended to include path and path_source per the 2026-05-15 amendment.
Response (200): { "distinct_ips": <number>, "distinct_ips_with_injection_flag": <number> }
Data source: D1 validations table aggregated on (target_type, target_id) with the per-artefact-salted IP hashes for distinct counting. Per-artefact salt ensures the hash is stable across the artefact's lifetime but unlinkable to any other artefact or the daily rate-limit salt.
Consumed by: tools/scripts/state-machine-tick.ts (the state-machine bot). If D1 is unreachable, the script logs a structured warning to stderr and falls back to the local per-row distinct count (conservative: never false-promotes).

8.5 Commit-side defense in depth (NER) — held-for-review path

Cloudflare Workers don't run Python/spaCy, so Presidio NER doesn't fit at the Worker layer. NER runs as a separate GitHub Action triggered on commits touching submission paths.

Per G.14 principle 1, NER detection is not auto-revert. It routes to the same human-review queue that handles injection-flag quarantines (§G.6). One review path, two flag types.

The NER step runs in two contexts: (a) at D1 INSERT for concerns and validations and ratings (the Worker invokes the NER service before commit; on flag, the row is moved into a held-for-review table); (b) at PR-CI for draft and amendment submissions with target_type ∈ {skill, path} (the action runs Presidio on the changed prose; on flag, PR-CI fails the PR with a "held for review" status, and the operator-review queue picks it up).
Re-validates each newly-added record: schema + Presidio NER (multilingual FR/NL/DE/EN) on every freeform string field
Flags PERSON entities; ORG/LOC are allowed (public entities). URL fields validated as URLs; URLs in submissions matched against the canonical allowlist (schemas/source-classes.json primary or secondary tier)
On NER fail: the just-committed file is not auto-reverted. Instead:
- The submission is moved to a held-for-review/<type>/<id>/ directory where <type> is one of the new feedback type names (concerns, amendments, validations, drafts, feedbacks, ratings) and <id> is the id-prefix-stripped uid (e.g. held-for-review/concerns/00873/ for a concern uid con-00873; renamed from the pre-amendment held-for-review/observations/<obs_uid>/ shape — still in main, still public — but flagged in docs.json so the renderer surfaces a "held for review" banner; agents that fetch via API see state: "ner_held_for_review" from the status endpoint)
- Status writeback: the ner-on-commit workflow calls PATCH /api/<type>/{id}/status with {"state": "held_for_review"} using a GitHub App installation token (minted from STAGING_APP_* secrets). This flips the KV record's state to ner_held_for_review and re-PUTs it without TTL, ensuring the status endpoint returns the held state indefinitely (see §8.3 status writeback).
- An entry is appended to runtime/ner-review-queue.log (gitignored Action-side; uploaded as a Workflow Artifact, 90-day retention)
- A maintainer-review issue is opened with a structured payload (no PII echoed in the issue title; the file path is enough to locate it)

Maintainer review outcomes (each triggers a PATCH /api/<type>/{id}/status call):

Released — false positive (e.g., a Belgian person-shaped commune name like "Saint-Gilles"). File moved back to its canonical path; status flips to released via PATCH with {"state": "released"}.
Released after edit — minor PII present that can be scrubbed cleanly without changing meaning. Maintainer edits, commits, status flips to released_after_edit. (Note: released_after_edit transition is deferred to a follow-up; currently treated as released.)
Discarded — real PII. File deleted from main; status flips to discarded via PATCH with {"state": "discarded"}. Submitter learns via status endpoint. Public corpus is unaffected.

This makes the privacy claim structural, not promissory: PII never reaches the public corpus without human eyes when NER flags. (Per G.14.)

Counter-note: race window. A few seconds may elapse between the commit landing on main and the Action moving it to held-for-review/. This window is small and the file is still flagged proposed (no consumer would discover it in time). The renderer build only runs after the Action settles. Documented in docs/threat-model.md as an accepted residual risk; see also G.14 implications.

8.6 Incident response — PII slipped through despite all gates

If PII is later discovered in a committed file (made it past consumer-side scrub, Worker regex, NER on commit, AND maintainer review of NER-held), the only correct response is destructive history rewrite via git filter-repo. This is documented in docs/retraction-protocol.md, requires explicit maintainer acknowledgement, and is disruptive (force-push to main, all consumers must refresh). It is not an automated path. Pre-emption (the four scrub layers + maintainer review) is the protection; rewrite is the last resort.

For consumer-detected issues during the staging window, the consumer should call DELETE /<type>/{id} with the cancel_token — no incident response needed, the submission never reaches the repo.

8.7 Consumer-side state contract

8.7.1 Design principle

Consumer-side state is the mechanism by which Be Civic provides continuity across sessions without requiring any server-side user store. The privacy property is structural: because no state is server-held, there is no central store that could be subpoenaed, breached, or repurposed. The tradeoff is that state portability and backup are the customer's responsibility, not Be Civic's.

This section defines what MAY be stored locally, in what shape, and under what constraints. The harness implementation obligations (how to read, write, and scaffold this layout) belong to the C4/§15c amendment and are not repeated here.

What counts as consumer-side state. A storage location qualifies as consumer-side state for the purposes of §3 (see architecture.md) principle 11 only if all of the following hold:

The customer can read the full contents directly with standard tools (text editor, cat, file browser), without invoking Be Civic or any vendor-mediated UI.
The customer can delete the contents unilaterally as a single artifact (rm the file or directory), without friction and without intervention from Be Civic or any third party.
The harness agent can both READ and WRITE the contents in-session. Read-only access from the agent's side is insufficient: the customer-side state contract requires the agent to update profile.json and memory/ files in place during a session, and the customer's expectation is that those updates persist.

A host filesystem directory (~/.be-civic/) satisfies all three clauses. Vendor-managed key-value stores [e.g., Project Memory in Anthropic platforms] do NOT satisfy clause 1 (the customer cannot inspect the store as a single artifact) and do NOT satisfy clause 2 (deletion is mediated by the vendor UI and may retain residual traces). Read-only file surfaces [e.g., free-tier Chat in Anthropic platforms, with the Be Civic Project installed] satisfy clauses 1 and 2 but FAIL clause 3 (the agent cannot write back to the files).

For v1, only T2 and T3 (Claude Desktop Cowork tab with ~/.be-civic/ as a connected folder, per C4 amendment §24.4 (see architecture.md)) provide qualifying consumer-side state. T0 and T1 are stateless from the customer's perspective and the harness MUST NOT promise cross-session memory at those tiers.

Paths-related state and the three-clause test. State derived from path traversal — for example, a customer's requires_paths-derived progress markers, or a record of which sources succeeded or failed for the customer — satisfies all three clauses of the test above: the customer can read the files with a text editor, delete them unilaterally, and the agent both reads and writes them in-session. Paths-related state lives in the existing memory/procedure_progress_<id>.md files (for progress within an active procedure that required path traversal) and in profile.json extensions (for persistent routing outcomes such as region). No new file types are required for paths state.

First-contact framing. When a customer shares a document with the harness at any capability tier, the harness MUST convey the following substance at the point of document intake (the exact phrasing may be adapted to the conversational context, but the substance MUST be preserved verbatim):

"If you share a document with me, I'll read it to find the parts that matter for your case — things like which commune issued it, what type of permit, and the months relevant to your timeline. The document file itself stays where you put it; I don't take a copy. What I do save into your profile is the categorical pieces I extracted (region, permit type, residence-period months), nothing more. Your profile lives on your computer and you can inspect or delete it whenever."

This wording avoids platform-specific disclosure (no reference to "cloud" or specific vendor infrastructure) in accordance with decision D26. Agent-platform privacy policies handle their own layer.

8.7.2 File layout (host filesystem available, T2+)

Concrete file layout depends on the harness. Under the Cowork plugin (V1+, per architecture.md §3 principle 13), <USER_DATA_DIR> resolves to a <user-picked-parent>/BeCivic/ folder under a user-picked parent path; the BeCivic root carries shared state (profile.json, MEMORY.md, privacy-attachment.md, .be-civic/ hidden subdirectory for system state) and per-procedure subfolders carry per-procedure state including a per-project CLAUDE.md. Detailed layout is documented in cowork-plugin.md §2.9.

Sibling harnesses (e.g. a future ChatGPT-app harness in chatgpt-app.md) have their own filesystem-or-not story; each harness spec is authoritative for its own on-disk shape. The universal privacy guarantees in §8.7.4 (profile schema) and §8.8 (memory cap rules) apply to whatever on-disk layout each harness adopts.

The legacy flat <USER_DATA_DIR>/ layout below is retained as the degraded-mode fallback for harnesses without a plugin (T0 paste-prompt sampler, etc.) and for reference. When the harness probe confirms that a host filesystem is writable but no plugin is active, all persistent customer state lives under <USER_DATA_DIR> (see §8.7.3 for path resolution; the default on POSIX systems is ~/.be-civic/). The directory MUST NOT be created without the customer's explicit consent at the first session. Once created, the layout is:

<USER_DATA_DIR>/
├── profile.json                        # enum routing fields only; see §8.7.4
├── memory/
│   ├── MEMORY.md                       # index, one line per entry, 200-line / 8KB cap
│   ├── customer_context.md             # customer's self-reported civic situation (narrative)
│   ├── procedure_progress_<id>.md      # one file per active procedure; see §8.8
│   ├── decision_log_<topic>.md         # decisions the customer has taken during sessions
│   ├── document_reference_<id>.md      # extracted routing fields from customer documents; see §8.9
│   ├── path_history_<id>.md            # optional; one file per traversed path; see §8.7.2.2
│   └── archive/                        # completed procedures past the active window; see §8.8
├── skills-cache/                       # local cache of fetched skills; see §8.7.2.1 below
│   └── <skill-id>/
│       └── SKILL.md
├── sessions/
│   └── <session-id>/                   # per-session ephemeral state; deleted on session close
│       ├── facts.json                  # structured facts surfaced during this session
│       ├── dossier-draft.md            # working draft of any document the customer is assembling
│       └── observations-buffer.jsonl   # submission items buffered for this session
├── submissions.jsonl                   # cumulative receipt log of all submitted items
└── analytics-outbox.jsonl              # offline queue for analytics events; flushed at next session preamble

.gitignore note. Any harness writing to a project-scoped <USER_DATA_DIR> inside a git-tracked directory MUST add the directory name to the nearest .gitignore. This is a non-optional invariant: the harness MUST verify the ignore entry exists before writing the first file. On non-git systems the check is skipped.

8.7.2.1 skills-cache/

The skills-cache/ directory holds Be Civic skills the harness has fetched at runtime — typically via the §24.4.1 (see architecture.md) degradation chain fallback to web-fetch when MCP and HTTP API are both unreachable, or proactive caching of skills the customer is likely to need across sessions. Each cached skill lives at skills-cache/<skill-id>/SKILL.md carrying the canonical body and frontmatter exactly as published at becivic.be/skills/<skill-id>.

For a cached skill to be loaded by the consumer agent as an actual skill (Skill-tool routable, not just scratch markdown), it MUST be installed at the agent platform's skill-discovery path. The agent platform's skill-discovery path is platform-specific (the path used by Claude Desktop differs between macOS, Windows, and Claude Code). The harness MAY use a platform-aware symlink from the platform's skill-discovery path to <USER_DATA_DIR>/skills-cache/<skill-id>/ so that updates to the cached copy are visible without re-installation. Platform-specific paths are documented in bc-docs/CLAUDE.md "Skill loading paths" and updated as Claude Desktop versions evolve.

Cache invalidation. Cached skills carry a cached_at timestamp in a sidecar file (skills-cache/<skill-id>/.cached-at). The harness MAY refresh the cache on a session-start basis or on detection of an observation rejecting a value the cached skill claims; the refresh source is becivic.be/skills/<skill-id> over HTTPS. The harness MUST NOT serve cached content older than 30 days without re-fetching at least once.

Customer-side state qualification. The skills-cache/ directory satisfies the §8.7.1 three-clause customer-side state test: the customer can read each cached skill body with a text editor, can delete the directory unilaterally, and the agent both reads and writes it (read during procedure routing; write during cache refresh).

8.7.2.2 path_history/

The memory/ directory MAY carry an optional path_history_<id>.md file for each path the customer has traversed, where <id> is the path's catalogue ID (for example path_history_certificat-residence-historique.md). Each file records:

Which source was attempted, in order, and the outcome for each attempt (success, failed, skipped-ineligible, declined-by-customer).
The ISO 8601 date (YYYY-MM-DD) of the successful attempt, if one occurred.
Whether the customer retains the delivered file: yes, no, or unknown.

The file is plain markdown, written in the customer's preferred language, and is intended to be readable by the customer without assistance. Example frontmatter:

---
type: path_history
path_id: certificat-residence-historique
last_traversed: 2026-05-12
---

Path history files satisfy the §8.7.1 three-clause test. They MUST NOT record the document's content, its file name, or any field prohibited under §8.9.3. They are not required; if the harness does not write them, no behaviour is broken.

8.7.3 `<USER_DATA_DIR>` resolution

<USER_DATA_DIR> is resolved at harness initialisation, in platform-aware order. The harness uses the first option that applies:

macOS — ~/Library/Application Support/be-civic/ (the platform-conventional per-user data directory).
Windows — %LOCALAPPDATA%\be-civic\ (typically C:\Users\<user>\AppData\Local\be-civic\), the platform-conventional per-user data directory for non-roaming application state.
Linux / XDG-compliant — $XDG_DATA_HOME/be-civic/ when $XDG_DATA_HOME is set; otherwise ~/.local/share/be-civic/ when ~/.local/share/ exists and is writable.
Fallback (all platforms) — ~/.be-civic/ (POSIX-style home-directory dotfile, used when no platform-conventional path applies or is writable).

The resolution order MUST be applied uniformly by all conforming harness implementations. The harness logs the resolved path to <USER_DATA_DIR>/.location so that subsequent sessions on the same machine reuse the same directory even if the resolution rules would otherwise pick a different one. Cross-platform note: Claude Desktop is available on macOS and Windows as of the round-7 cutover; both must be supported for v1. Verification of the platform-specific Cowork connected-folder default is tracked as an open question (see cowork-plugin.md §4).

The resolved path is documented to the customer at session start in plain language: "I'll keep your notes at <path>. You can find them there if you want to inspect or back them up."

8.7.4 profile.json — shape and constraints

profile.json holds the routing fields that allow the harness to skip repeat questions across sessions. Every field is categorical or boolean. No field MAY hold a value that is or could derive from a real identifier. The complete set of fields for v1:

Field	Type	Description
`region`	enum	`Flanders` / `Wallonia` / `Brussels-Capital` / `German-speaking-community`
`commune_nis5`	string (5 digits)	NIS5 commune code only; no commune name, no address
`administration_language`	enum	`NL` / `FR` / `DE` — constrained to the commune's official languages. Filters by region per the form's pill-filter map (D26). When `region` is `not-in-belgium-yet` (D29), all three values are accepted and the form hint adapts.
`conversation_language`	string (free text, ≤32 chars)	The language the user wants the agent to communicate in. Free text (not enum) per D27 — any language the agent can speak works (English, French, Tagalog, Slovenian, etc.). Agent pre-fills detected language at runtime; user may override on the onboarding form. Renamed from / replaces the prior `other_languages` list. The legacy `other_languages[0]` shape is migrated to `conversation_language` on first write under V1 schema.
`civic_status`	enum	`single` / `cohabitant-legal` / `married` / `divorced` / `widowed`
`nationality_status`	enum	`BE` / `EU` / `non-EU` / `multiple`
`residency_status`	enum	`registered` / `registering` / `EU-citizen` / `non-EU-permit` / `asylum` / `undocumented`
`residency_history`	list of objects	Each object: `{start, end, visa_type, permit_type, country_of_last_residence}` — periods only, no document numbers. Dates are `YYYY-MM` strings (month-bucket precision); see "Date precision" below.
`dependents`	object	`{minor_children_count, adult_dependents_count, spouse_abroad: bool}` — counts and booleans only
`employment_history`	list of objects	Each object: `{start, end, type, days_per_week, total_days_estimate}` where `type ∈ {FT, PT, self-employed, student, unemployed, retired}` — no employer names, no ONSS numbers. Dates are `YYYY-MM`.
`education_history`	list of objects	Each object: `{start, end, level, country_of_institution}` — no institution names, no diploma numbers. Dates are `YYYY-MM`.
`document_inventory`	object of mixed types	`has_id_card` (enum: `yes` / `not-yet-waiting` / `no` / `not-sure`), plus booleans `has_residence_card / has_work_permit / has_NN / has_passport_BE / has_passport_other`, plus `validity_end_<doc>` as `YYYY-MM` for each document the customer holds. No document numbers, no copies, no exact expiry day. See `has_id_card` row below for the rename and rationale.
`has_id_card` (inside `document_inventory`)	enum	`yes` / `not-yet-waiting` / `no` / `not-sure` (D22, D23). Renamed from `has_eID` — the prior eID-vs-residence-card distinction is dropped because all Belgian-issued chip cards (eID and residence card) are functionally equivalent for itsme/identity purposes; the agent disambiguates card-type-specific path-source eligibility at path-traversal time, not at onboarding (D52).
`browser_driving_preference`	enum	`drive-by-default` / `ask-each-time` / `never-drive` — honoured at path-traversal time per architecture.md §24.9 (Chrome MCP handoff vs AUQ vs markdown-link). New field per D8.
`consent`	object (typed namespace)	Extensibility hook for consent metadata. The schema declares the namespace but specific keys are operational and vary by phase. Alpha-phase keys (e.g. `alpha_bundle`, `signed_at`, `version`) are documented in `cowork-plugin.md §3.8`. Post-alpha keys for granular per-stream opt-out will be documented when that posture lands. Consumers MUST tolerate unknown `consent.*` keys; the namespace is intentionally permissive.
`active_procedures`	list of skill IDs	Procedure-skill IDs currently in flight; cross-references into `memory/procedure_progress_*.md`. The list contains ALL ongoing procedures, not just the currently-focused one; the harness holds state for each in memory simultaneously and routes by customer cue.
`transitions_in_progress`	list of enum values	`marriage-planned / divorce / address-change` and equivalents

has_id_card migration. Existing profile.json files carrying document_inventory.has_eID (boolean) are migrated on first read under V1 schema: has_eID: true → has_id_card: "yes"; has_eID: false → has_id_card: "no". There is no path to "not-yet-waiting" or "not-sure" from legacy data; those values originate only from V1+ onboarding forms.

other_languages → conversation_language migration. The legacy other_languages ordered list is superseded by the free-text conversation_language field. On first write under V1 schema, other_languages[0] (the prior harness communications language slot) is migrated to conversation_language; the remaining entries are dropped (they were not load-bearing under any v1 routing decision). Agents MUST tolerate legacy other_languages on read but MUST NOT write that field under V1.

Date precision. Every date field in profile.json is encoded as a YYYY-MM string (month-bucket precision). Day-level precision is not stored for any field. This applies to all of residency_history, employment_history, education_history, and every validity_end_<doc> field in document_inventory. The constraint exists for two reasons: (1) month-bucket precision is sufficient for every routing decision the harness makes at v1; (2) day-level precision narrows the de-anonymisation surface materially when combined with other fields (commune, employer-type-by-period, residence-permit-type-by-period). The harness MUST round customer-provided exact dates to month-bucket form before writing to profile.json. The harness MAY hold day-level precision in <USER_DATA_DIR>/sessions/<session-id>/facts.json for the duration of an active session (where it is needed for deadline reminders), but MUST NOT carry day-level precision into persistent state.

This rule REVERSES the 2026-05-11 operator override that permitted exact expiry dates. The v1 posture is intentionally tighter than the longer-term posture: as the customer-side state contract matures and additional safeguards land (encrypted-at-rest options, additional scrub layers, in-document tagging), v1.1+ MAY relax precision for specific fields where a customer-precision use case is demonstrated. The v1 default is YYYY-MM uniformly.

The design decisions record (2026-05-11, Cluster 7) identifies 14 named fields above. Two additional structural positions complete the 16-axis catalogue: profile_schema_version (string, schema version sentinel, written on first create and on schema upgrade) and last_updated_at (ISO 8601 timestamp, written on every write, for staleness detection). These two metadata fields are not routing axes and carry no identifying information; they are non-optional on every conforming profile.json.

What MUST NOT appear in profile.json:

Any national identifier or derivative (NISS, NN, eID chip data, social security number, foreign tax ID)
Any document number (passport number, residence card number, work permit number)
Any name (given name, family name, alias)
Any date of birth, place of birth, or biometric data
Any full postal address (commune and region category are the finest granularity permitted)
Any photograph or image reference
Any narrative field (narrative content lives in memory/customer_context.md and related files)

The constraint is structural: any proposed v2 field that COULD hold identity in any realistic population of inputs MUST be rejected at the schema layer, not by policy alone.

profile.json MUST be valid against bc-docs/schemas/profile.schema.json on every write. The schema enforces the field-level constraints listed above. A harness that writes to profile.json without validating against the schema is non-conformant.

8.7.5 memory/ shape

MEMORY.md is the index: one line per memory entry, at most 200 lines, at most 8KB. On T2/T3 (Cowork tab in Claude Desktop), MEMORY.md is read at session start via explicit skill instructions. On T4 (Claude Code, or any environment that supports skill-frontmatter hooks), MEMORY.md is injected into context via the UserPromptSubmit hook on every turn. Cowork tab hook support is an open question (see §18); if confirmed, Cowork at T3 can be upgraded to T4 without re-architecting memory/ shape.

Per-topic files carry YAML frontmatter with at minimum name, description, and type. Permitted type values:

customer_context — customer's self-reported situation and background; free narrative
procedure_progress — current state of a specific active procedure (step, outstanding documents, next action); one file per procedure_progress_<id>.md
decision_log — decisions the customer has made with Be Civic's assistance (for example, "chose path B for language exam waiver")
document_reference — routing fields extracted from a customer-supplied document; see §8.9 for content constraints

No type outside this list is valid in v1. New types require a Tier B amendment.

8.7.6 sessions/ directory

The sessions/<session-id>/ directory is ephemeral: it is created at session open and MUST be deleted at session close after the submission buffer has been flushed (or on session_outcome: abandoned_inferred submission for orphaned sessions, per §8.8.3). No session directory persists across session boundaries. This is non-negotiable: session state is never accumulated across sessions; only the extracted routing fields and narrative summaries in profile.json and memory/ carry forward.

8.7.7 submissions.jsonl and analytics-outbox.jsonl

submissions.jsonl is an append-only receipt log. Each line is a JSON object recording a submitted item: {submitted_at, session_id, type, id, cancel_token, commit_eta, status}. The type field carries one of the 2026-05-15 feedback type values (concern | amendment | validation | draft | feedback | rating); the id field is the matching <type>_id (concern_id / amendment_id / validation_id / draft_id / feedback_id / rating_id) — renamed from the pre-amendment observation_id / skill_amendment_id / skill_draft_id shape. The session_id field is retained (S61 reversal — the cluster-2 amendment had proposed dropping session_id from this log in favour of a recovery_token; the reversal restores the original shape). The harness appends a line on every successful mode: "stage" response. The customer can read this file to review and cancel pending submissions. The file is the customer's own record; Be Civic does not hold a copy.

analytics-outbox.jsonl is an offline queue for analytics events that could not be submitted during a session (network unavailable, scrub-rules fetch failed). Each line is an analytics event in the shape defined by POST /api/analytics. The harness MUST attempt to flush the outbox at the next session preamble before generating new events for that session. Flushing is a deterministic code path: no LLM involvement. Events in the outbox are discarded after 30 days without successful submission.

8.8 Retention and deletion semantics

8.8.1 Active procedure files

procedure_progress_<id>.md MUST be retained as long as the procedure is active (that is, <id> appears in profile.json active_procedures), or for 90 days after the file was last written, whichever is shorter. "Last written" is the file's mtime; the harness MUST NOT backdate mtimes.

When a procedure completes (the customer reaches a confirmed terminal step) or when the 90-day inactivity window expires, the harness MUST move the file to memory/archive/<id>.md and remove the procedure's ID from active_procedures in profile.json.

8.8.2 Archived procedure files

Files in memory/archive/ are retained for one year from their archive date (recorded in the file's frontmatter as archived_at). After one year, the harness SHOULD delete them. The harness MUST surface a deletion warning to the customer at session start if any archived file is within 30 days of its one-year mark, so the customer can export the content before it is removed.

8.8.3 Session buffers

sessions/<session-id>/ is deleted on session close after the submission buffer has been flushed to POST /api/feedback (or after the customer has explicitly declined submission). An orphaned session directory (no session-close event received, directory age greater than 72 hours) is cleaned up by the harness at the next session preamble. Before deleting an orphaned session directory, the harness MUST submit session_outcome: abandoned_inferred to POST /api/analytics if analytics opt-in is active.

8.8.4 Customer-initiated deletion

A customer MAY delete ~/.be-civic/ (or the equivalent <USER_DATA_DIR>/be-civic/) at any time, by any means, with no Be Civic consequence. The next session is treated as first_contact. The harness MUST NOT prevent, warn against, or create friction around customer-initiated deletion. The harness MUST NOT attempt to re-create deleted files from server-side state (because no server-side state exists).

There is no account to deactivate, no server-side deletion request to file, and no right-to-erasure workflow needed for the local state: deletion is the customer's unilateral act.

8.9 Document-content-discard rule

8.9.1 Scope

This section applies when a customer provides Be Civic with a copy, scan, photograph, or paste of a personal document — including but not limited to: national identity cards (eID), foreign national identity documents, passports, residence permit cards (A through M), work permits, diplomas, and official correspondence.

8.9.2 What the harness MUST extract and retain

From a customer-supplied document, the harness MUST extract only the routing fields needed to determine procedure eligibility or next-step routing. Examples:

From a residence permit card: permit_type (for example "F card"), validity_end (ISO 8601 date, for example "2028-06-15"), validity_start (ISO 8601 date, optional)
From a passport: issuing_country, validity_end (ISO 8601 date)
From official correspondence: issuing_authority_category (commune / CGVS / OCMW / other), subject_category (invitation / refusal / decision), deadline_date (ISO 8601 date) where the correspondence carries a deadline

Validity dates and other non-identifying dates printed on a document MAY be retained as exact dates. They are not identity-derivative: knowing a residence permit expires on a given date does not narrow the holder to a small population. Retaining the exact date enables the harness to provide proactive renewal warnings (for example, "your F card expires in 76 days; here are the renewal steps").

Extracted routing fields are written to memory/document_reference_<id>.md with provenance metadata: {source_category, extraction_date, fields_extracted: [list of field names]}.

8.9.3 What the harness MUST NOT retain

The following MUST NOT appear in profile.json, any memory/ file, sessions/, or any other persisted location:

The document number, card number, passport number, or any other identifier printed on the document
The customer's full name, given name, or family name as it appears on the document
The customer's date of birth or place of birth (dates of birth are identity-derivative; combined with commune, they narrow to a small population)
Any photograph or biometric data
The full address as printed on the document (commune and region category are permitted)
Any text block from the document beyond the specific fields enumerated in §8.9.2

Document validity dates (issue date, expiry date) and other non-identifying temporal fields MAY be retained per §8.9.2. Date of birth is the one date type that is identity-derivative and remains prohibited.

8.9.4 Scrub verification

The Layer 1 consumer-side scrub (regex plus LLM contextual pass) MUST run against the memory/document_reference_<id>.md content on every write, not only on submission buffer writes. The scrub rules fetched at session start apply. If any scrub rule fires on a document reference file, the harness MUST abort the write, log a structured warning to sessions/<id>/scrub-warnings.jsonl (not to the submission buffer), and prompt the customer to confirm that the field should be omitted.

8.9.5 Original content

The original document content (the customer's paste, the OCR output, the image data) MUST NOT be written to any file in ~/.be-civic/. It exists only in the active session context and is discarded when the session ends. The harness MUST NOT store the original content in sessions/<id>/dossier-draft.md or any other session file: dossier-draft is for documents the customer is assembling for submission to authorities, not for copies of documents already held by the customer.

8.9.6 Document delivery via paths

When a path delivers a document to the customer (for example, a Brussels Tier-1 quickLink generates a PDF, or the customer downloads a residence certificate via a federal portal), the document file is stored where the customer puts it (their connected folder, their Downloads directory, or wherever they choose to save it). The harness MUST NOT relocate or copy the file. The discard rule in §8.9.3 applies to what the harness extracts from the document into customer-side state: only the categorical routing fields named by the procedure skill's inputs or the path's outputs schema are written to profile.json or memory/<id>.md. Nothing else from the document enters customer-side state, regardless of whether the harness "saw" the document content in its conversation window during the delivery session.

8.10 Anonymous-by-construction — structural reinforcement

8.10.1 No identifier derivatives

The following are unconditionally prohibited in any Be Civic system, including consumer-side local state:

The NISS / national registration number, or any hash, truncation, or transformation of it
Any email address hash
Any device fingerprint or hardware identifier
Any purpose-generated derivative of a real identifier (including partial NISS, date-of-birth-derived token, or document-number-derived token)

If a session-level correlation token is needed (for example, for linking submission buffer entries), it MUST be a randomly generated UUIDv7 with no relationship to any real identifier. session_id (ses_<UUIDv7>) satisfies this requirement. It MUST NOT be seeded from or mixed with any customer attribute.

(Per the 2026-05-15 S61 reversal, session_id is the recovery key end-to-end; the recovery_token concept proposed in the 2026-05-11 Cluster 2 amendment is dropped from the spec. The Worker echoes the agent-provided session_id back in the response body alongside concern_id / amendment_id / etc. and cancel_token; the recovery endpoint is GET /api/feedback/sessions/<session_id>. D1's validations.session_id column persists the agent-provided value; the prior-proposed recovery_token column on concerns was never created and is permanently dropped from the migration sequence.)

8.10.2 Categorical fields are a structural constraint, not a policy choice

The requirement that profile.json fields be categorical or boolean (§8.7.4) is not a policy choice made for compliance reasons. It is a structural constraint that ensures the profile cannot re-identify the customer even if the file is read by a third party. Any proposed v2 field that would require a precise numeric value, a name, or a date string MUST be redesigned as a categorical field before the proposal advances to Tier B amendment review.

8.10.3 Vendor memory degrades gracefully

On T0 and T1 (no host filesystem), Be Civic MAY degrade to in-memory session state and rely on the customer's AI vendor account memory (vendor key-value stores [e.g., Project Memory in Anthropic platforms], ChatGPT memory, and equivalent) for cross-session persistence. The harness MUST NOT write anything to vendor memory that would violate the field-level constraints of §8.7.4. The vendor memory path is a capability degradation, not a separate data regime with lower privacy standards.

8.10.4 Paths are anonymous-by-construction

The Path Directory NEVER carries customer-identifying state. A path entry is the same catalogue object for every customer; it describes a route to a document or tool, not anything about any individual who has used it. Per-customer eligibility evaluation happens at traversal time, in the harness, against the customer's local profile.json — not server-side. No customer attribute is transmitted to the catalogue server as part of path resolution. The path_history_<id>.md files (§8.7.2.2) are local-only; they are not submitted to any Be Civic endpoint and are not part of the submission protocol. A customer who has traversed fifty paths has left no identifying trace on the Path Directory beyond aggregated, salted validation submissions (§9.5 (see lifecycle.md)) that are subject to the same per-artefact-salted IP-hash anonymisation as skill validations.

Cross-references

Cross-doc references are inlined throughout this document in the form §X.Y (see .md). The list below was the pre-reconciliation manifest from the 2026-05-11 split, retained for audit; it can be deleted at the next split-or-merge cycle.

§3 (Non-negotiable principles, including principle 11 customer-side state) — see architecture.md §3
§6.2 (Submission schemas, identity-field bans, free-text caps) — see schemas.md §6.2
§6.7 (Agent capability requirements per submission type) — see schemas.md §6.7
§6.8 (Scrub rules file) — see schemas.md §6.8
§6.11 (Catalogue UID convention, PR-CI uid assignment) — see schemas.md §6.11
§6.12 (Path Directory schema) — see schemas.md §6.12
§7 (Trust model / maintainer-review queue) — see protocol.md §7
§9 (State-machine promotion) — see lifecycle.md §9
§9.2 (Promotion thresholds) — see lifecycle.md §9.2
§9.5 (Path and path-source lifecycle) — see lifecycle.md §9.5
§11.1 (Source rot) — see lifecycle.md §11.1
§13.1 (Agent interface manifest page) — see architecture.md §13.1
§15.7 (Harness consumer obligations) — see skills.md §15.7
§15.8 (Conversation invariants — plain-language obligations) — see skills.md §15.8
§18 (Open questions, Cowork hook support) — see architecture.md §18
§20 (Website rendering / renderer Worker) — see website.md §20
§23 (MCP server) — see protocol.md §23
§24.4 (Capability tiers) — see architecture.md §24.4
§24.5 (Three-tier returning-user adaptation) — see architecture.md §24.5

Be Civic — Privacy and PII Protection

Be Civic — Privacy and PII Protection

8. PII protection

8.1 Trust boundary

8.2 Submission contract — global, versioned

8.3 Receiving-end ingestion pipeline

8.3.0a Recommended primary: POST /api/feedback

8.4 Cloudflare reference implementation

8.4b Internal endpoint: artefact-stats (distinct-IP counts)

8.5 Commit-side defense in depth (NER) — held-for-review path

8.6 Incident response — PII slipped through despite all gates

8.7 Consumer-side state contract

8.7.1 Design principle

8.7.2 File layout (host filesystem available, T2+)

8.7.2.1 skills-cache/

8.7.2.2 path_history/

8.7.3 <USER_DATA_DIR> resolution

8.7.4 profile.json — shape and constraints

8.7.5 memory/ shape

8.7.6 sessions/ directory

8.7.7 submissions.jsonl and analytics-outbox.jsonl

8.8 Retention and deletion semantics

8.8.1 Active procedure files

8.8.2 Archived procedure files

8.8.3 Session buffers

8.8.4 Customer-initiated deletion

8.9 Document-content-discard rule

8.9.1 Scope

8.9.2 What the harness MUST extract and retain

8.9.3 What the harness MUST NOT retain

8.9.4 Scrub verification

8.9.5 Original content

8.9.6 Document delivery via paths

8.10 Anonymous-by-construction — structural reinforcement

8.10.1 No identifier derivatives

8.10.2 Categorical fields are a structural constraint, not a policy choice

8.10.3 Vendor memory degrades gracefully

8.10.4 Paths are anonymous-by-construction

Cross-references

8.3.0a Recommended primary: `POST /api/feedback`

8.7.3 `<USER_DATA_DIR>` resolution