Be Civic — Internal Build-Tools Specification

Canonical system specifications for the Be Civic project.

Be Civic — Internal Build-Tools Specification

This sub-spec covers internal operations tooling used by the operator to author and maintain the Be Civic corpus. It is distinct from the product spec (the other sub-spec docs), which describes the protocol, schemas, lifecycle, and runtime behaviour that customers' agents see.

Internal build-tool content lives here because spec changes still need to be coordinated with build-tool changes. When the product spec moves, the build tools must move with it. This file is the bridge: a single place to read what the build tools produce, what artefact shapes they emit, and which product-spec sections they support.

The canonical authoring references for bc-corpus-creator live at bc-skills/bc-corpus-creator/references/ (research-report shape, evals shape, voice and style, canonical shape, rubric, schema descriptors). This file documents the artefact shapes those references produce and the product-corpus integration points (what files ship in bc-docs/skills/<id>/, how the renderer treats them, how PR-CI gates them).

1. Tools in scope

Tool Role Source location Canonical references
bc-corpus-creator Operator-driven walks of corpus skills end-to-end; produces canonical.md, research-report.md, and optionally evals.json bc-skills/bc-corpus-creator/ bc-skills/bc-corpus-creator/references/

Future build tools (catalogue authoring, path-directory enrichment, harness-build helpers, etc.) are added to this table as they emerge.

2. Shipped artefacts per skill

Each skill in bc-docs/skills/<id>/ ships three artefacts:

  • canonical.md — the customer-facing skill body. Product spec: schemas.md §6.1.
  • research-report.md — the durable evidence record produced by the walking-procedure. Spec: §3 of this document.
  • evals.json — the test prompts and per-prompt expectations driving the opt-in benchmark mode. Spec: §4 of this document.

The skill-PR classifier path filter at ^skills/[^/]+/(canonical|research-report|evals)\.(md|json)$ (lifecycle.md §10.1) covers all three. The renderer surfaces canonical.md to customer agents; research-report.md and evals.json are operator-internal artefacts but ship with the corpus for audit and re-rendering purposes.

3. research-report.md schema

Every walked skill ships a research-report.md alongside its canonical.md at bc-docs/skills/<id>/research-report.md. The research-report is the durable evidence record produced by the walking-procedure (skills.md §15.1). It carries the source base, structural decisions, catalogue-extraction proposals, and a §11 failure-modes catalog that the canonical body never fully surfaces. The translator reads this file to render canonical.md; the migrator re-reads it on schema-collapse to re-render without re-walking. The artefact is shipped with the corpus and covered by the skill-PR classifier path filter (lifecycle.md §10.1) at ^skills/[^/]+/(canonical|research-report|evals)\.(md|json)$.

Frontmatter. YAML, all fields present unless marked optional:

---
title: Research report for <skill-id>
type: research-report
status: draft | complete
walked: YYYY-MM-DD                         # walk date
walker: bc-corpus-creator v<semver>        # tooling identifier
canonical_version: <semver>                # mirrors `version` on canonical.md
                                           # (same artifact; same versioning scheme — §6.1 cohort reset
                                           # applies). A research-report backs exactly one canonical
                                           # version at a time; canonical-version bumps require either
                                           # translator re-render or research-report update + retranslate.
spec_round: round-<N>                      # spec generation this report was structured against
                                           # (distinct from the skill `schema_version: 3` integer
                                           # field — different namespace, different semantics).
                                           # Resolves against
                                           # references/schema-descriptors/<spec_round>.md
research_complete: true | false            # true ⇒ all §3–§8 sources gathered to researcher
                                           # self-assessment; false ⇒ reverse-translated draft or
                                           # incomplete walk. Migrate refuses to proceed unless true.
last_updated: YYYY-MM-DD                   # bumped on any post-walk research-update; canonical_version
                                           # may or may not bump in parallel (research-report can carry
                                           # findings not yet promoted into canonical)
sources_count: <integer>                   # total §3 rows
sources_by_usage:                          # rolls up the `usage` column on §3
  citation: <integer>
  corroboration: <integer>
  failure_mode_context: <integer>
  comparative: <integer>
volatile_values_proposed: <integer>        # §4 row count
references_proposed: <integer>             # §5 row count
authorities_proposed: <integer>            # §6 row count
failure_modes_documented: <integer>        # §11 row count
new_enum_values:                           # optional; proposed additions to schemas/types.json enums
  - <enum>:<value>                         # e.g. sponsor_type:au-treaty-secondment
researcher_adequacy_note: |                # required; free-form 4–8 sentence self-assessment
  Named source-class coverage achieved, depth of probe, what was deliberately
  skipped (and why), confidence level on hand-off to translator. Operator
  reads this at the Phase 3→4 acceptance gate and decides whether the
  coverage is sufficient. No hard coverage gate — thoroughness is enforced
  by researcher.md instructions and operator judgment, not deterministic
  rubric items.
---

Body sections (fixed 11-section schema; the migrator depends on this being stable):

  1. Reconnaissance — skeleton scan, sibling drafts, stop nodes.

  2. Structural decision — absorption / extraction / stop / deferral, with reasoning.

  3. Sources consulted — table per source: URL, fetch date, citation_grade (statutory / federal / regional / consular / professional / origin / secondary), usage enum, quoted passages.

    usage enum — distinguishes how each source feeds the corpus, orthogonal to citation_grade:

    • citation — cited in canonical body via <Ref> tag. MUST resolve to a citation-grade source (classes 1–6).
    • corroboration — confirms a canonical claim but is NOT cited. Used when a non-canonical-grade source (law-firm post, news coverage, forum thread) independently validates statute or agency guidance. Builds researcher confidence; never surfaces to the user-agent.
    • failure-mode-context — source documents a procedural failure mode captured in §11. Anything from forum thread to news investigation to academic paper.
    • comparative — other-jurisdiction analog or comparison point that informed structural decisions (§2) but isn't load-bearing for this skill's body. Optional; v1 walks may carry zero.

    The canonical body's <Ref> tags resolve only to usage: citation rows; §11 catalog rows draw their evidence_sources[] only from usage: failure-mode-context rows. The reviewer rubric's Tag-trace item gates on this invariant.

  4. Volatile values surfaced — table: name, value, unit, source_ref, last_verified, indexation_date. Field set matches the volatile-values catalogue row schema (§6.3).

  5. References surfaced — table: name, title, url, last_verified. Field set matches the references catalogue (§6.10, §6.11).

  6. Authorities surfaced — table: id, type, name, url, NIS5 (if commune). Field set matches data/authorities.json. No free-form "contact pointer" — align to the catalogue.

  7. Catalogue extraction proposals — new rows vs existing rows referenced.

  8. Type-enum additions — new sponsor_type / route / region / relationship_type values proposed for schemas/types.json.

  9. Open questions — not pinned, needs operator review. Doubles as the concern-promotion watchlist post-launch: when a runtime concern submission (renamed from observation per the 2026-05-15 taxonomy normalization) matches an open question semantically, that is the trigger to promote the resolution into the canonical body and close the entry.

  10. Out of scope — flagged for adjacent walks.

  11. Common failure modes & procedural pitfalls — what other sources document about how this procedure goes wrong in practice. Distinct from walker-level pitfalls (which concern the walk, not the procedure). Row schema:

    • pattern — one-line summary of the failure mode (e.g., "Commune rejects birth certificate without apostille on first submission").
    • evidence_sources[] — pointers to §3 rows carrying usage: failure-mode-context.
    • severityblocks-procedure / delays-procedure / cosmetic.
    • observed_byforum / news / law-firm-blog / academic / government-report / operator-experience.
    • predicted_concern_keywords[] — phrases an incoming user-agent concern submission might use to describe this failure (seeds the future concern-match mechanism for the deferred promote mode). Field name renamed from predicted_observation_keywords[] per the 2026-05-15 taxonomy normalization.
    • canonical_anchor — pointer to where this failure mode is reflected in the canonical body (e.g., process#step-4-fee-payment, required-documents#birth-certificate, known-surprises per §6.1 (see schemas.md) SG1 reversal). Empty string when the failure mode is not yet surfaced in canonical. The deferred promote mode reads this field to know where in the body to insert the surfaced text.
    • promotion_stateproposed (not yet validated by a runtime concern) / promoted (a real concern confirmed; surfaced in canonical) / deprecated (promoted but later removed; policy changed) / contradicted (new evidence says this failure mode is wrong).

    Inclusion rule (per §15.3 (see skills.md)): a failure mode can be included only if (a) ≥3 independent reports across signal/secondary sources, OR (b) cross-referenced against a primary source. The researcher MUST enforce this; evidence_sources[] length is the lower bound on independence (multiple usage: failure-mode-context rows must point to different sources, not the same source three times).

    The catalog is the watchlist — rows at promotion_state: proposed are candidates for canonical promotion when a real concern submission confirms them.

Versioning and lifecycle. The research-report is a living documentlast_updated bumps on every post-walk research-update mode run; canonical_version mirrors the canonical it currently backs. Catalogue rows the researcher proposes flow into D1 via the catalogue-extractor (separate from the research-report file itself); §11 rows live only in the research-report until the deferred promote mode surfaces them into canonical.

Translator contract. The translator reads research-report §1–§10 plus the voice / canonical-shape / schemas references and produces canonical.md. If the schema collapses or the canonical shape changes, the migrator re-reads §1–§10 and re-renders canonical conformant to the new schema, without re-running research.

4. evals.json schema

evals.json is the third shipped artefact per skill, alongside canonical.md (§6.1 in schemas.md) and research-report.md (§3 of this document). It carries the test prompts and per-prompt expectations that drive the opt-in benchmark mode of the walking-procedure tooling. Authored by the operator (or seeded by the walker on first walk completion), committed at bc-docs/skills/<id>/evals.json. The skill-PR classifier path filter (§10.1 (see lifecycle.md)) covers it via the amended pattern ^skills/[^/]+/(canonical|research-report|evals)\.(md|json)$.

Shape:

{
  "skill_id": "<kebab-case-id>",
  "evals_version": "<semver>",
  "prompts": [
    {
      "id": "<prompt-id>",
      "prompt": "<user-shaped prompt the runner sends to a fresh subagent loaded with the canonical>",
      "expectations": [
        "<LLM-judged assertion the response must satisfy>",
        "..."
      ],
      "tool_call_expectations": [
        "<LLM-judged assertion about the runner's tool-call pattern>",
        "..."
      ]
    }
  ]
}

Field semantics:

  • skill_id — kebab-case skill id; MUST match the folder name (skills/<skill_id>/evals.json).
  • evals_version — semver of the evals.json shape. Independent of canonical_version and schema_version; bumps when the evals harness's contract changes (new field shapes, new expectation types). v1 baseline: 0.1.
  • prompts[] — array of test prompts. Each prompt is independently runnable; the benchmark mode iterates the array and grades each in isolation.
  • prompts[].id — stable identifier per prompt (kebab-case, deterministic). Allows the benchmark-grader to track per-prompt pass/fail over time and tie regressions to specific scenarios.
  • prompts[].prompt — the user-shaped prompt the runner sends to a fresh subagent loaded with the skill's canonical. Should reflect realistic agent invocations rather than synthetic edge probes.
  • prompts[].expectations[] — LLM-judged assertions the response must satisfy. The benchmark-grader scores each one with pass/fail plus evidence quoted from the response.
  • prompts[].tool_call_expectations[] — LLM-judged assertions about the runner's tool-call pattern (e.g., "should call read_skill('X') at least once", "should NOT redirect mid-flow"). Scored alongside expectations[].

Lifecycle. The walker seeds evals.json on first walk completion with a minimal baseline prompt set; the operator extends it manually as new edge cases surface. evals.json is versioned with the canonical (same git history; same PR landings). It is NOT a generated artefact — it ships as authored.

Out of scope for v1. evals.json does not carry runtime metadata (latency budgets, tokens-in/tokens-out targets) — those are runner-side concerns recorded in the benchmark output, not in the eval definitions themselves. Per-prompt baseline-without-canonical comparison is the runner's mode (baseline: true); evals.json does not encode the comparison expectation directly.

5. Product-spec integration points

The build-tool artefacts described above interact with the product spec at the following points. When the product spec changes, the corresponding build-tool reference at bc-skills/bc-corpus-creator/references/ must be updated to match.

Build-tool artefact Product-spec reference Notes
canonical.md shape schemas.md §6.1 (skill schema) The product spec is authoritative; bc-corpus-creator's references/canonical-shape.md is the build-time mirror.
Walking-procedure steps skills.md §15.1 Manual walks per skills.md §15.1 remain valid. bc-corpus-creator operationalises the same steps.
Source classes skills.md §15.2 The closed enum + grade values are in the product spec. The usage enum on each source row (citation / corroboration / failure-mode-context / comparative) is internal to research-report and not surfaced in the product corpus.
Failure-mode inclusion rules skills.md §15.3 research-report §11 carries the catalog; canonical body surfaces what it must.
Skill-PR classifier path filter lifecycle.md §10.1 Covers `canonical

6. Process for build-tool spec updates

Changes to research-report.md or evals.json schemas follow the same amendment-proposal workflow as the product spec (amendment-proposals/README.md). A change to the build-tool artefact shape MAY require a corresponding update to the product spec when:

  • The artefact's product-spec integration points (§5 above) shift
  • The skill-PR classifier path filter needs to change
  • A new artefact is introduced that ships in bc-docs/skills/<id>/

Build-tool-only changes (rubric refinement, voice-and-style updates, internal agent prompts) do not require product-spec amendments; they are committed directly to bc-skills/bc-corpus-creator/references/ with operator review.