feat: Phase 2 complete — 13 Phases of COBOL type classification and test benchmark

P0.6: gcov infrastructure P1: extract_structure output expansion (11 new feature fields) P2: Confusion group rule engine (8 pairs + contradiction + backtrack) P3: 4-factor confidence calculation + quality gate update P4: 33+2 COBOL program type test samples (22 files, 7 categories) P5: parametrized/ test data generation engine P6: japanese_data.py lookup tables P7-10: Type-specific test suites (~159 parametrized tests) P11: Full classification pipeline (classify_program) + orchestrator integration P12: Documentation (module-interfaces, test-plan v3.0, coverage-matrix) Architecture decisions: - classification_pipeline/ merged to hina/pipeline/ - parametrized/ as independent module - japanese_data.py as root-level file - hina/__all__ only exports classify_program() Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-19 23:51:55 +08:00
parent 63b5284715
commit bc1d56d1a4
129 changed files with 19378 additions and 261 deletions
@@ -0,0 +1,780 @@
+# COBOL-Java 迁移验证平台 — 模块接口规范
+
+> 目的：明确定义每个模块的边界、公开 API、数据契约，实现多人并行开发。
+> 每个模块可以由不同开发者独立开发，只要遵循接口契约即可集成。
+
+---
+
+## 一、模块分层架构
+
+```
+┌──────────────────────────────────────────────────────────────────────────┐
+│                              Layer 4: 管道集成                           │
+│                                                                          │
+│  orchestrator.py  — 管道导演，编排全流程                                  │
+│  web/              — FastAPI + Worker 网络层                             │
+│  jcl/executor.py   — JCL 执行器                                          │
+└───────────────────────────────────┬──────────────────────────────────────┘
+                                    │ 调用
+                                    ▼
+┌──────────────────────────────────────────────────────────────────────────┐
+│                      Layer 3: 业务引擎                                    │
+│                                                                          │
+│  agents/    — LLM 智能体（解析/设计/诊断）                                 │
+│  hina/      — 程序分类（关键字/规则/LLM）                                  │
+│  comparator/— 对比引擎（对齐/比较/舍入）                                    │
+│  runners/   — 编译运行引擎（COBOL/Java/Spark）                             │
+└───────────────────────────────────┬──────────────────────────────────────┘
+                                    │ 调用
+                                    ▼
+┌──────────────────────────────────────────────────────────────────────────┐
+│                      Layer 2: 核心引擎                                    │
+│                                                                          │
+│  cobol_testgen/  — COBOL 解析 + 测试数据生成                              │
+│  report/         — 报告生成器                                             │
+│  jcl/parser.py   — JCL 解析器                                            │
+│  config/         — 配置管理                                               │
+│  quality/        — 质量验证                                               │
+│  preprocessor.py — COPYBOOK 预处理                                        │
+└───────────────────────────────────┬──────────────────────────────────────┘
+                                    │ 使用
+                                    ▼
+┌──────────────────────────────────────────────────────────────────────────┐
+│                      Layer 1: 数据模型 + 存储                             │
+│                                                                          │
+│  data/     — 核心数据模型（所有层共享）                                     │
+│  storage/  — 持久化存储（缓存/报告库）                                     │
+└──────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 二、数据模型层 (Layer 1) — 所有层的契约
+
+### `data/field_tree.py` — 字段树
+
+```python
+@dataclass
+class Field:
+    name: str
+    level: int
+    pic: str
+    usage: str = "DISPLAY"          # COMP / COMP-3 / DISPLAY / ...
+    offset: int = 0
+    length: int = 0
+    decimal: int = 0
+    signed: bool = False
+    sign_separate: bool = False
+    occurs: Optional[int] = None
+    occurs_max: Optional[int] = None
+    redefines: Optional[str] = None
+    redefines_variant: Optional[str] = None
+    conditions: list[dict] = field(default_factory=list)
+    children: list["Field"] = field(default_factory=list)
+
+@dataclass
+class FieldTree:
+    fields: list[Field] = field(default_factory=list)
+    copybook_name: str = ""
+    sha256: str = ""
+    def flatten(self) -> dict[str, Field]: ...
+    def get_by_name(self, name: str) -> Optional[Field]: ...
+    @classmethod
+    def from_list(cls, fields, name="") -> "FieldTree": ...
+```
+
+### `data/test_case.py` — 测试用例
+
+```python
+@dataclass
+class TestCase:
+    id: str
+    fields: dict = field(default_factory=dict)      # {字段名: 值}
+    coverage_targets: list[str] = field(default_factory=list)
+
+@dataclass
+class TestSuite:
+    test_cases: list[TestCase] = field(default_factory=list)
+    spark_config: Optional[SparkConfig] = None
+    @property
+    def has_spark(self) -> bool: ...
+
+@dataclass
+class SparkConfig:
+    num_records: int = 100
+    replication: str = "key_varied"
+    key_field: str = ""
+    edge_cases: list[str] = field(default_factory=list)
+```
+
+### `data/diff_result.py` — 对比结果
+
+```python
+@dataclass
+class FieldResult:
+    field_name: str = ""
+    status: str = "PASS"            # PASS / TOLERATED / MISMATCH / NOT_SET
+    cobol_value: str = ""
+    java_value: str = ""
+    tolerance_applied: float = 0.0
+    rounding_detected: str = ""
+    suggestion: str = ""
+
+@dataclass
+class VerificationRun:
+    program: str = ""
+    timestamp: str = ""
+    status: str = "PASS"            # PASS / MISMATCH / BLOCKED / ERROR / FATAL
+    exit_code: int = 0
+    duration_s: float = 0.0
+    fields_matched: int = 0
+    fields_mismatched: int = 0
+    field_results: list[FieldResult] = field(default_factory=list)
+    runner: str = "native"          # native / spark
+    branch_rate: float = 0.0
+    paragraph_rate: float = 0.0
+    decision_rate: float = 0.0
+    hina_type: str = ""
+    hina_confidence: float = 0.0
+    quality_score: float = 0.0
+    quality_warn: str = ""
+    heal_retry: int = 0
+    simple_retry: int = 0
+    total_retry: int = 0
+    llm_cost: float = 0.0
+    report_path: str = ""
+    debug: dict = field(default_factory=dict)
+    @property
+    def total_fields(self) -> int: ...
+    def verdict(self) -> str: ...
+```
+
+---
+
+## 三、核心引擎层 (Layer 2) — 接口规范
+
+### 模块 2-1: `cobol_testgen`（COBOL 解析 + 数据生成）
+
+**负责人**: A
+**依赖**: data/ (Field, FieldTree, PicInfo, FieldDef, BrSeq, ...)
+
+```
+公开函数:
+┌─────────────────────────────────────────────────────────────────────────┐
+│ extract_structure(cobol_source: str, source_dir: str = None) → dict    │
+│                                                                         │
+│  入力: COBOL 源码文本、可选的 COPYBOOK 搜索路径                          │
+│  出力: {                                                                │
+│    paragraphs: list[str],          ← 段落名列表                         │
+│    total_paragraphs: int,          ← 段落总数                          │
+│    decision_points: list[dict],    ← [{id, kind, label, branches}, ...]│
+│    total_branches: int,            ← 分支总数                          │
+│    branch_tree: BrSeq,             ← 控制流树                           │
+│    file_count: int,                ← SELECT 文件数                      │
+│    open_directions: dict,          ← {文件名: INPUT/OUTPUT/I-O}        │
+│    has_search_all: bool,           ← 是否有 SEARCH ALL                  │
+│    has_evaluate: bool,             ← 是否有 EVALUATE                    │
+│    has_call: bool,                 ← 是否有 CALL                        │
+│    has_break: bool,                ← 是否有 key 中断                    │
+│    branch_tree_obj: BrSeq,         ← 原始分支树对象                     │
+│  }                                                                     │
+├─────────────────────────────────────────────────────────────────────────┤
+│ generate_data(cobol_source: str, structure: dict,                      │
+│               source_dir: str = None) → list[dict]                     │
+│                                                                         │
+│  入力: COBOL源码, extract_structure 的输出, 搜索路径                    │
+│  出力: [{字段名: 值, ...}, ...]  ← 每条记录覆盖一条分支路径              │
+├─────────────────────────────────────────────────────────────────────────┤
+│ incremental_supplement(branch_tree, decision_gaps: list[int])          │
+│   → list[dict]                                                         │
+│                                                                         │
+│  入力: 分支树对象, 未覆盖决策点的 ID 列表                                │
+│  出力: 补充的新测试记录                                                  │
+└─────────────────────────────────────────────────────────────────────────┘
+
+子模块职责:
+  read.py    — 预处理 + DATA DIVISION 解析 + PIC 解析 → FieldDef[]
+  core.py    — PROCEDURE DIVISION 解析 → BrSeq 树 + assignments
+  cond.py    — 条件表达式解析 + MC/DC 枚举 → CondLeaf/And/Or/Not
+  design.py  — 路径枚举 + 值生成 + 约束应用 → generate_records()
+  coverage.py— 决策点收集 + 标记 + HTML报告 → check_coverage()
+  output.py  — JSON/文件输出 → output_json/output_input_files
+  models.py  — 数据模型 (共享)
+```
+
+---
+
+### 模块 2-2: `config`（配置管理）
+
+**负责人**: B
+**依赖**: 无内部依赖
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│ Config (dataclass)                                                      │
+│                                                                         │
+│  字段:                                                                   │
+│    project_name: str = ""                                               │
+│    copybook_paths: list = ["./copybooks"]                               │
+│    dialect: str = "ibm"                 # cobc -std 参数               │
+│    llm_model: str = "gpt-4o-mini"       # LLM 模型                     │
+│    llm_timeout: int = 15                                                 │
+│    llm_cache_dir: str = ".cache/llm"                                     │
+│    coverage_default: str = "boundary"                                    │
+│    rounding_mode: str = "TRUNCATE"                                       │
+│    tolerance: float = 0.01               # 比较容忍度                   │
+│    runner_mode: str = "native"           # native / spark               │
+│    spark_master: str = "local[*]"                                        │
+│    num_records: int = 1000                                               │
+│    branch_pass: float = 0.80             # 覆盖率通过阈值               │
+│    max_llm_cost: float = 0.50                                           │
+│    quality_gate_mode: str = "warn"       # off / warn / strict          │
+│    quality_gate_decision_threshold: float = 0.90                        │
+│    quality_gate_paragraph_threshold: float = 1.0                        │
+│    gcov_enabled: bool = False                                           │
+│    max_quality_retries: int = 4                                         │
+│                                                                         │
+│  类方法:                                                                 │
+│    @classmethod from_toml(path="aurak.toml") → Config                   │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+### 模块 2-3: `preprocessor`（COPYBOOK 预处理）
+
+**负责人**: B
+**依赖**: 无
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│ CopybookPreprocessor                                                     │
+│                                                                         │
+│  __init__(paths: list = ["./copybooks"])                                │
+│  expand(text: str) → str            # COPY 语句展开后的源码             │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+### 模块 2-4: `quality`（质量验证）
+
+**负责人**: C
+**依赖**: data/field_tree.py
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│ L1OffsetValidator                                                        │
+│   validate(tree: FieldTree, cpath: str) → dict {score, mismatches}     │
+│                                                                         │
+│ L2RoundtripValidator                                                     │
+│   validate(tree: FieldTree) → dict {pass, results}                     │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+### 模块 2-5: `jcl/parser.py`（JCL 解析）
+
+**负责人**: C
+**依赖**: 无
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│ parse_jcl(filepath: str) → Optional[Job]                                │
+│                                                                         │
+│ Job { job_name: str, steps: list[JobStep] }                             │
+│ JobStep { step_name: str, program: str,                                 │
+│           dd_entries: list[DDEntry], cond: Optional[CondParam] }        │
+│ DDEntry { dd_name: str, dsn: Optional[str], disp: Optional[str],        │
+│           sysout: Optional[str], inline_data: list[str] }               │
+│ CondParam { code: int, operator: str, step_name: Optional[str] }        │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+### 模块 2-6: `report`（报告生成）
+
+**负责人**: B
+**依赖**: data/diff_result.py
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│ ReportGenerator                                                          │
+│   generate_json(vr: VerificationRun, path: Path)                        │
+│   generate_html(vr: VerificationRun, path: Path)                        │
+│   generate_machine_json(vr: VerificationRun, path: Path)                │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+### 模块 2-7: `parametrized`（测试数据生成器）
+
+**负责人**: I (新增)
+**依赖**: 无
+
+```
+公开函数（8 个）:
+┌─────────────────────────────────────────────────────────────────────────┐
+│ matching.py — 匹配系数据生成                                             │
+│                                                                         │
+│  generate_matching_data(matching_type="1:1",                            │
+│                         record_count_r01=10,                            │
+│                         record_count_r02=10,                            │
+│                         key_match_ratio=1.0) → tuple[list, list]       │
+│    出力: (主文件记录列表, 从文件记录列表)                                  │
+│    匹配模式: "1:1" / "1:N" / "N:1"                                       │
+├─────────────────────────────────────────────────────────────────────────┤
+│ matching.py — KEY 切中断数据生成                                         │
+│                                                                         │
+│  generate_keybreak_data(group_count=3,                                  │
+│                         records_per_group=2,                            │
+│                         sum_type="accumulate") → list[dict]            │
+│    出力: [{KEY, FIELD, GROUP, SEQ}, ...]                                │
+│    sum_type: "accumulate" / "aggregate" / "mark"                        │
+├─────────────────────────────────────────────────────────────────────────┤
+│ division.py — 分割系数据生成                                             │
+│                                                                         │
+│  generate_division_data(division_type=50,                               │
+│                         record_count=50) → list[list[dict]]             │
+│    出力: [[文件1记录], [文件2记录], ...]                                  │
+│    division_type: 50(对半) / 25(四等分) / 100(全量)                    │
+├─────────────────────────────────────────────────────────────────────────┤
+│ common.py — 通用数据生成工具                                              │
+│                                                                         │
+│  generate_zero_byte_file(path: str) → None                              │
+│    写入 0 字节空文件                                                      │
+│                                                                         │
+│  generate_minimal_records(fields: list[dict]) → list[dict]              │
+│    出力: 1 条类型合理默认值的记录                                         │
+│                                                                         │
+│  generate_sorted_records(record_count=10, key_field="KEY") → list[dict] │
+│    出力: 已按 KEY 升序排列的记录列表                                      │
+│                                                                         │
+│  generate_duplicate_keys(records: list[dict], key_field="KEY")          │
+│      → list[dict]                                                       │
+│    出力: 原记录 + 同键值重复记录（用于 SORT MERGE / 去重测试）             │
+│                                                                         │
+│  generate_boundary_values(pic: str) → dict                              │
+│    出力: {max, min, overflow, zero, pic_info}                           │
+│    从 PIC 子句解析出最大值 / 最小值 / 溢出值                               │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+### 模块 2-8: `japanese_data.py`（日文测试数据生成）
+
+**负责人**: I (新增)
+**依赖**: 无
+
+```
+公开函数（8 个生成函数 + 常量表）:
+┌─────────────────────────────────────────────────────────────────────────┐
+│ 查找表常量                                                               │
+│  FULLWIDTH_KATAKANA    — 全角片假名字符串                                │
+│  FULLWIDTH_HIRAGANA    — 全角平假名字符串                                │
+│  FULLWIDTH_DIGITS      — 全角数字                                        │
+│  FULLWIDTH_ALPHA       — 全角字母                                        │
+│  HALFWIDTH_KATAKANA    — 半角片假名字符串                                │
+│  SJIS_5C_PROBLEM       — Shift-JIS 第2字节 0x5C 问题文字                │
+│  SJIS_7C_PROBLEM       — Shift-JIS 第2字节 0x7C 问题文字                │
+│  WAREKI_BOUNDARIES     — 和历边界对照表                                  │
+├─────────────────────────────────────────────────────────────────────────┤
+│ 生成函数                                                                 │
+│                                                                         │
+│  generate_fullwidth_text(field: dict) → str                             │
+│    全角片假名填充 PIC N 字段                                              │
+│                                                                         │
+│  generate_halfwidth_katakana(field: dict) → str                         │
+│    半角片假名填充 PIC X 字段                                              │
+│                                                                         │
+│  generate_sjis_5c_problem(field: dict) → str                            │
+│    含 Shift-JIS 0x5C 问题文字的字符串                                    │
+│                                                                         │
+│  generate_sjis_7c_problem(field: dict) → str                            │
+│    含 Shift-JIS 0x7C 问题文字的字符串                                    │
+│                                                                         │
+│  generate_wareki_date(wareki_type="R") → str                            │
+│    和历日期字符串（格式: R050101）                                        │
+│                                                                         │
+│  generate_wareki_boundary(era="平成") → tuple[str, str]                  │
+│    和历边界日期对（前代末日, 新代初日）                                    │
+│                                                                         │
+│  generate_encoding_test_data(from_enc="shift_jis", to_enc="utf-8")     │
+│      → tuple[bytes, bytes]                                              │
+│    Shift-JIS ↔ UTF-8 编码回环验证数据                                    │
+│                                                                         │
+│  select_data_type(field: dict) → str                                    │
+│    字段类型判断: "japanese" / "numeric" / "halfwidth"                     │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+### 模块 2-9: `coverage/compare_coverage.py`（覆盖率比较）
+
+**负责人**: D
+**依赖**: 无
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│ compare_coverage(program_name: str,                                     │
+│                  static: dict,                                          │
+│                  dynamic: dict) → dict                                  │
+│                                                                         │
+│  入力:                                                                   │
+│    program_name — 程序名称                                               │
+│    static — 静态覆盖率: {branch_rate, paragraph_rate, ...}               │
+│    dynamic — 动态覆盖率: {gcov_cov, covered_branches, ...}              │
+│                                                                         │
+│  出力: {                                                                │
+│    program: str,               ← 程序名称                               │
+│    static: {branch_rate, paragraph_rate},                               │
+│    dynamic: {gcov_cov},                                                 │
+│    gap: float,                 ← static - dynamic 差异                  │
+│    misleading_branches: list,  ← 静态覆盖但动态未覆盖的分支              │
+│  }                                                                      │
+│                                                                         │
+│  用途: 识别 gcov 实际运行与静态分析之间的偏离                              │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 四、业务引擎层 (Layer 3) — 接口规范
+
+### 模块 3-1: `hina`（程序分类 + 质量门禁 + 类型判定管道）
+
+**负责人**: D
+**依赖**: data/diff_result.py (VerificationRun)
+
+```
+├── 公开 API ────────────────────────────────────────────────────────────┤
+│ pipeline/（类型判定管道 — 唯一入口）                                    │
+│                                                                         │
+│  classify_program(cobol_source: str) → dict                            │
+│    出力: {category, confidence, subtype, strategy_params,              │
+│            resolved_type, needs_review, ...}                            │
+│                                                                         │
+│    流程:                                                                 │
+│    1. 并行执行关键字识别 + 结构提取                                      │
+│    2. 确信度 ≥ 90% → 直接输出                                          │
+│    3. 确信度 50-89% → 混淆组判定 + 4因子确信度 + 矛盾检测 + 回溯       │
+│    4. 确信度 < 50% → 标记需人工处理                                     │
+├─────────────────────────────────────────────────────────────────────────┤
+│ confidence.py                                                           │
+│                                                                         │
+│  compute_confidence_v2(keyword_result, structure_features,             │
+│                        contradictions=None, resolution=None) → dict    │
+│    出力: {confidence, base, context_factor, consistency_factor,        │
+│            structure_factor, judgment, needs_review}                    │
+│                                                                         │
+│    4 因子确信度公式:                                                     │
+│      confidence = base × context_factor × consistency_factor           │
+│                    × structure_factor                                   │
+│                                                                         │
+│    判定标准:                                                             │
+│      >= 0.90 auto       — 自动通过                                     │
+│      0.70-0.89 review    — 需要人工审核                                 │
+│      0.50-0.69 manual    — 需要人工介入                                 │
+│      < 0.50   impossible — 无法判定                                    │
+├─────────────────────────────────────────────────────────────────────────┤
+│ classifier.py                                                           │
+│                                                                         │
+│  L1_RULES: list[tuple[str, list[str], float]]    ← 11类关键字规则      │
+│                                                                         │
+│  detect_keyword(source: str) → list[tuple[str, float, str]]            │
+│    出力: [(分类名, 确信度, 匹配关键字), ...]                             │
+├─────────────────────────────────────────────────────────────────────────┤
+│ hina_agent.py                                                           │
+│                                                                         │
+│  classify_with_llm(structure: dict, llm) → dict                        │
+│    出力: {category, subtype, confidence, strategy_params}               │
+│                                                                         │
+│  _parse_llm_response(raw: str) → dict                                   │
+│  _validate_result(parsed: dict) → dict                                  │
+│  _fallback_classification(structure: dict) → dict  ← 7混淆组规则      │
+├─────────────────────────────────────────────────────────────────────────┤
+│ gate.py                                                                 │
+│  compute_quality_score(branch_rate, paragraph_rate) → float             │
+│  check(tests, hina_result, coverage, thresholds...) → dict              │
+│    出力: {passed: bool, score: float, issues: dict}                    │
+├─────────────────────────────────────────────────────────────────────────┤
+│ strategy.py                                                             │
+│  get_strategy(hina_type: str) → dict    ← 5类型策略模板                 │
+│  supplement(base_tests, hina_result) → list[dict]                      │
+│  supplement_only(base_tests, gaps) → list[dict]                        │
+├─────────────────────────────────────────────────────────────────────────┤
+│ retry.py                                                                │
+│  RetryHandler(max_heal=2, max_simple=3)                                 │
+│    run(pipeline_fn: Callable[[], VerificationRun]) → VerificationRun    │
+├─────────────────────────────────────────────────────────────────────────┤
+│ gcov_collector.py                                                       │
+│  collect_gcov(cobol_src: Path, work_dir: Path) → dict                  │
+│    出力: {available, line_rate, total_lines, executed_lines}           │
+├─────────────────────────────────────────────────────────────────────────┤
+├── 内部实现（不公开）───────────────────────────────────────────────────┤
+│ rule_engine/（混淆组规则引擎 — 非公开，由 pipeline 内部调用）           │
+│                                                                         │
+│  confusion_groups.py — 8 个混淆组判定函数                               │
+│    resolve_matching_vs_keybreak(features) → dict                        │
+│    resolve_dedup_vs_nodedup(features) → dict                            │
+│    resolve_validation_vs_keybreak(features) → dict                      │
+│    resolve_csv_merge_vs_split(features) → dict                          │
+│    resolve_simple_vs_two_stage(features) → dict                         │
+│    resolve_pure_vs_mixed(features) → dict                               │
+│    resolve_division_50_25_100(features) → dict                          │
+│    resolve_mn_output_mode(features) → dict                              │
+│                                                                         │
+│  contradiction.py — 矛盾检测与解决                                       │
+│    CONTRADICTION_PAIRS: list[tuple[str, str]]                           │
+│    detect_contradictions(types: list[str]) → list[dict]                 │
+│    resolve_contradiction(type_a, type_b, features) → str                │
+│                                                                         │
+│  backtrack.py — 多轮回溯判定                                             │
+│    BacktrackResolver(max_iterations=3, fallback_type="unknown")         │
+│      resolve(features, initial_types, contradictions) → dict            │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+### 模块 3-2: `agents`（LLM 智能体）
+
+**负责人**: E
+**依赖**: data/field_tree.py, data/test_case.py, data/diff_result.py
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│ LLMClient(model="gpt-4o-mini", timeout=15, cache_dir=".cache/llm")      │
+│   call(messages: list[dict], retries=1) → str                           │
+│                                                                         │
+│   通信契约: POST {base}/chat/completions                                │
+│      Header: Authorization: Bearer $LLM_API_KEY                         │
+│      Body: {model, messages}                                            │
+│      成功: {choices: [{message: {content: "..."}}]}                     │
+│                                                                         │
+│   缓存: SHA256(消息)→ 磁盘文件 .cache/llm/{hash}.json                   │
+├─────────────────────────────────────────────────────────────────────────┤
+│ Agent1Parser(llm: LLMClient)                                            │
+│   parse(text: str) → FieldTree     ← COPYBOOK 文本 → 字段树             │
+│                                                                         │
+│   提示词: 解析 COBOL COPYBOOK → JSON {fields: [...]}                   │
+├─────────────────────────────────────────────────────────────────────────┤
+│ Agent2Data(llm: LLMClient)                                              │
+│   design(tree: FieldTree, target="boundary", spark_mode=False)          │
+│     → TestSuite                                                         │
+│                                                                         │
+│   提示词: 根据 FieldTree 设计测试用例 → JSON {test_cases: [...]}       │
+├─────────────────────────────────────────────────────────────────────────┤
+│ Agent3Diagnostic(llm: LLMClient)                                        │
+│   analyze(fr: FieldResult) → str    ← 差异诊断 → 建议文本              │
+│                                                                         │
+│   提示词: 分析 COBOL-Java 字段差异原因 → JSON {issue_type, suggestion}│
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+### 模块 3-3: `comparator`（对比引擎）
+
+**负责人**: F
+**依赖**: data/field_tree.py, data/diff_result.py
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│ align_records(cobol_records, java_records, key_field) → list[tuple]    │
+│  入力: COBOL记录列表, Java记录列表, 键字段名                             │
+│  出力: [(cobol_dict, java_dict, 'MATCHED'), ...]                       │
+│         (cobol_dict, None, 'MISSING_IN_SPARK')                        │
+│         (None, java_dict, 'EXTRA_IN_SPARK')                           │
+├─────────────────────────────────────────────────────────────────────────┤
+│ compare_field(name, c, j, field_type='decimal', tolerance=0.01)        │
+│     → FieldResult                                                      │
+│                                                                         │
+│  field_type 取值: 'decimal' / 'string' / 'date'                        │
+│  status 取值: PASS / TOLERATED / MISMATCH                              │
+├─────────────────────────────────────────────────────────────────────────┤
+│ CobolBinaryReader                                                        │
+│   read(binary_path: str, tree: FieldTree) → list[dict]                 │
+│     按 FieldTree 的 offset/length 解析二进制 → [{字段: 值}]            │
+├─────────────────────────────────────────────────────────────────────────┤
+│ Normalizer                                                               │
+│   normalize_comp3(data: bytes) → str     ← COMP-3 解码                 │
+├─────────────────────────────────────────────────────────────────────────┤
+│ detect_rounding(c: str, j: str) → RoundingResult                       │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+### 模块 3-4: `runners`（编译运行引擎）
+
+**负责人**: G
+**依赖**: data/test_case.py, data/diff_result.py
+
+```python
+@dataclass
+class BuildResult:
+    success: bool
+    artifact_path: str = ""
+    log: str = ""
+
+@dataclass
+class RunResult:
+    success: bool
+    records: list[dict] = field(default_factory=list)
+    log: str = ""
+    coverage_exec: str = ""
+
+@dataclass
+class CoverageReport:
+    branch_rate: float = 0.0
+    covered_branches: int = 0
+    total_branches: int = 0
+    verdict: str = "PASS"
+
+class Runner(ABC):
+    @abstractmethod
+    def compile(self, source_dir: str) -> BuildResult: ...
+    @abstractmethod
+    def run(self, artifact: str, input_path: str, output_path: str) -> RunResult: ...
+    @abstractmethod
+    def get_coverage(self, artifact: str, run_id: str) -> CoverageReport: ...
+
+class CobolRunner:
+    def compile(self, src: str, dialect="ibm", gcov=False) -> BuildResult: ...
+    def run(self, binary: str, input_path: str, output_path: str) -> RunResult: ...
+
+class NativeJavaRunner(Runner): ...    # mvn + java -jar
+class SparkJavaRunner(Runner): ...     # spark-submit
+
+class DataWriter:
+    def write_cobol_binary(tests: list[TestCase], path: Path): ...
+    def write_native_json(tests: list[TestCase], path: Path): ...
+    def write_spark_json(tests: list[TestCase], config: SparkConfig, outdir: Path): ...
+```
+
+---
+
+## 五、管道集成层 (Layer 4) — 接口规范
+
+### 模块 4-1: `orchestrator`（管道导演）
+
+**负责人**: H (@所有人 集成)
+**依赖**: Layer 2 + Layer 3 的所有模块
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│ run_pipeline(cfg: Config,                                             │
+│              cpath: str,          ← copybook 路径                      │
+│              cbl: str,            ← COBOL 源码路径                     │
+│              java: str,           ← Java 源码路径                      │
+│              map_path: str)       ← mapping 路径                       │
+│   → VerificationRun                                                     │
+│                                                                         │
+│  内部流程（各步骤可独立替换）:                                           │
+│                                                                         │
+│  Step 1:  Agent1Parser(llm).parse(cpath)         → FieldTree           │
+│  Step 2:  extract_structure(cbl)                  → structure dict     │
+│  Step 3:  generate_data(cbl, structure)           → TestCase[]         │
+│  Step 4:  compute_confidence(cbl, structure)       → HINA result       │
+│           classify_with_llm(structure, llm)                            │
+│  Step 5:  strategy_supplement(tests, hina)         → 补充 TestCase[]   │
+│  Step 6:  gate_check(tests, hina, cov, ...)        → 质量门禁          │
+│  Step 7:  Agent2Data(llm).design(tree)             → TestSuite         │
+│  Step 8:  DataWriter -> cobol_binary + json        → 输入文件          │
+│  Step 9:  CobolRunner.compile(cbl)                  → BuildResult      │
+│  Step 10: CobolRunner.run(binary, input)            → RunResult        │
+│  Step 11: Runner.compile(java)                      → BuildResult      │
+│  Step 12: Runner.run(jar, input)                    → RunResult        │
+│  Step 13: CobolBinaryReader.read(co_out, tree)      → COBOL records    │
+│  Step 14: align_records(cobol, java, key)           → aligned tuples   │
+│  Step 15: compare_field(field, c, j, type, tol)     → FieldResult[]    │
+│  Step 16: Agent3Diagnostic.analyze(mismatch)        → suggestions      │
+│  Step 17: ReportGenerator → JSON + HTML            → 报告文件          │
+│                                                                         │
+│  返回: VerificationRun (全结果聚合)                                     │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 六、模块耦合关系矩阵
+
+```
+           depends_on →  L1     L2-1   L2-2   L2-3   L2-4   L2-5   L2-6   L2-7   L2-8   L2-9   L3-1   L3-2   L3-3   L3-4
+module     ↓               data   cbl_t  conf   prepr  qual   jcl    rpt    para   jp     cov    hina   agnt   comp   runr
+───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+L2-1 cobol_testgen       ●      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
+L2-2 config              ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
+L2-3 preprocessor        ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
+L2-4 quality             ●      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
+L2-5 jcl/parser          ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
+L2-6 report              ●      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
+L2-7 parametrized        ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
+L2-8 japanese_data       ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
+L2-9 coverage            ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ●      ─      ─      ─
+L3-1 hina                ●      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
+L3-2 agents              ●      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
+L3-3 comparator          ●      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
+L3-4 runners             ●      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
+L4 orchestrator          ●      ●      ●      ─      ─      ─      ●      ─      ─      ─      ●      ●      ●      ●
+L4 web/L4 jcl exec       ●      ─      ●      ─      ─      ●      ─      ─      ─      ─      ─      ─      ─      ●
+───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+```
+
+---
+
+## 七、模块依赖图（协作视图）
+
+```
+                      ┌───────────┐
+                      │  data/    │ ← 所有模块共享的数据契约
+                      └─────┬─────┘
+                            │
+       ┌────┬────┬────┬────┼────┬────┬────┬────┬────┬────┬────┐
+       │    │    │    │    │    │    │    │    │    │    │    │
+       ▼    ▼    ▼    ▼    ▼    ▼    ▼    ▼    ▼    ▼    ▼    ▼
+     cobol_t para  jp   cov  conf  qual  hina agnt comp runr rpt
+       │    │     │    │    │     │     │    │    │    │    │
+       └────┴─────┴────┴────┴─────┴─────┴────┴────┴────┴────┘
+                         │
+                         ▼
+                    orchestrator.py
+                         │
+                    ┌────┴────┐
+                    │         │
+                    ▼         ▼
+                  web/       jcl/exec
+```
+
+---
+
+## 八、多人协作分工方案
+
+| 开发者 | 模块 | 需知接口 | 独立程度 |
+|:-------|:-----|:---------|:---------|
+| **A** | `cobol_testgen/` (read/core/cond/design/coverage/output) | data/ 数据模型 | ✅ 完全独立 |
+| **B** | `config/` + `preprocessor/` + `report/` | data/diff_result.py | ✅ 完全独立 |
+| **C** | `quality/` + `jcl/parser/` | data/field_tree.py | ✅ 完全独立 |
+| **D** | `hina/` (pipeline/classifier/gate/agent/strategy/retry/gcov/rule_engine) + `coverage/` | data/diff_result.py | ✅ 完全独立 |
+| **E** | `agents/` (LLM/parser/data/diagnostic) | data/ 全部 3 个模型 | ✅ 完全独立 |
+| **F** | `comparator/` (align/compare/reader/normalize/round) | data/全部 | ✅ 完全独立 |
+| **G** | `runners/` (cobol/java/spark/datawriter) | data/test_case.py | ✅ 完全独立 |
+| **I** | `parametrized/` + `japanese_data.py` | 无 | ✅ 完全独立 |
+| **H** | `orchestrator.py` (集成)+ `web/` + `jcl/exec` | 所有模块 API | ⛓️ 需要所有人 |
+
+**各 Layer 3 模块只有 1 个统一约束**: 接收的输入必须是 data/ 中的数据类实例，返回的也必须是 data/ 中的数据类实例。只要遵守这个契约，模块开发者不需要知道其他模块的内部实现。
+
+---
+
+## 九、当前系统问题 & 改进项
+
+| 问题 | 影响 | 解决方案 |
+|:-----|:-----|:---------|
+| **`cobol_testgen/__init__.py` 混用公开和私有符号** (如 `_add_subscript`/`_init_child_names`) | 外部不清楚哪些是稳定接口 | 添加 `__all__` 明确定义公开 API |
+| **多数模块没有 `__all__`** | 无法区分公开/内部函数 | 每个模块根文件添加 `__all__` |
+| **orchestrator 直接 import 内部子模块** (如 `cobol_testgen.coverage.check_coverage`) | Layer 越界，管道依赖了引擎内部 | orchestrator 只应 import 各模块的顶层公开函数 |
+| **Config 字段没有验证/文档** | 修改 Config 可能破坏其他模块 | 添加字段校验 + 注释 |
+| **函数签名缺少类型注解** (部分历史代码) | 接口不明确 | 补全所有公开函数的类型注解 |
+| **没有模块版本号/变更记录** | 无法追踪接口变更 | 添加 `__version__` 到每个模块 |