Files

T

hangshuo652 bc1d56d1a4 feat: Phase 2 complete — 13 Phases of COBOL type classification and test benchmark

P0.6: gcov infrastructure
P1: extract_structure output expansion (11 new feature fields)
P2: Confusion group rule engine (8 pairs + contradiction + backtrack)
P3: 4-factor confidence calculation + quality gate update
P4: 33+2 COBOL program type test samples (22 files, 7 categories)
P5: parametrized/ test data generation engine
P6: japanese_data.py lookup tables
P7-10: Type-specific test suites (~159 parametrized tests)
P11: Full classification pipeline (classify_program) + orchestrator integration
P12: Documentation (module-interfaces, test-plan v3.0, coverage-matrix)

Architecture decisions:
- classification_pipeline/ merged to hina/pipeline/
- parametrized/ as independent module
- japanese_data.py as root-level file
- hina/__all__ only exports classify_program()

Co-Authored-By: Claude <noreply@anthropic.com>

2026-06-19 23:51:55 +08:00

53 KiB

Raw Blame History

COBOL-Java 迁移验证平台 — 模块接口规范

目的：明确定义每个模块的边界、公开 API、数据契约，实现多人并行开发。每个模块可以由不同开发者独立开发，只要遵循接口契约即可集成。

一、模块分层架构

┌──────────────────────────────────────────────────────────────────────────┐
│                              Layer 4: 管道集成                           │
│                                                                          │
│  orchestrator.py  — 管道导演，编排全流程                                  │
│  web/              — FastAPI + Worker 网络层                             │
│  jcl/executor.py   — JCL 执行器                                          │
└───────────────────────────────────┬──────────────────────────────────────┘
                                    │ 调用
                                    ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                      Layer 3: 业务引擎                                    │
│                                                                          │
│  agents/    — LLM 智能体（解析/设计/诊断）                                 │
│  hina/      — 程序分类（关键字/规则/LLM）                                  │
│  comparator/— 对比引擎（对齐/比较/舍入）                                    │
│  runners/   — 编译运行引擎（COBOL/Java/Spark）                             │
└───────────────────────────────────┬──────────────────────────────────────┘
                                    │ 调用
                                    ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                      Layer 2: 核心引擎                                    │
│                                                                          │
│  cobol_testgen/  — COBOL 解析 + 测试数据生成                              │
│  report/         — 报告生成器                                             │
│  jcl/parser.py   — JCL 解析器                                            │
│  config/         — 配置管理                                               │
│  quality/        — 质量验证                                               │
│  preprocessor.py — COPYBOOK 预处理                                        │
└───────────────────────────────────┬──────────────────────────────────────┘
                                    │ 使用
                                    ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                      Layer 1: 数据模型 + 存储                             │
│                                                                          │
│  data/     — 核心数据模型（所有层共享）                                     │
│  storage/  — 持久化存储（缓存/报告库）                                     │
└──────────────────────────────────────────────────────────────────────────┘

二、数据模型层 (Layer 1) — 所有层的契约

`data/field_tree.py` — 字段树

@dataclass
class Field:
    name: str
    level: int
    pic: str
    usage: str = "DISPLAY"          # COMP / COMP-3 / DISPLAY / ...
    offset: int = 0
    length: int = 0
    decimal: int = 0
    signed: bool = False
    sign_separate: bool = False
    occurs: Optional[int] = None
    occurs_max: Optional[int] = None
    redefines: Optional[str] = None
    redefines_variant: Optional[str] = None
    conditions: list[dict] = field(default_factory=list)
    children: list["Field"] = field(default_factory=list)

@dataclass
class FieldTree:
    fields: list[Field] = field(default_factory=list)
    copybook_name: str = ""
    sha256: str = ""
    def flatten(self) -> dict[str, Field]: ...
    def get_by_name(self, name: str) -> Optional[Field]: ...
    @classmethod
    def from_list(cls, fields, name="") -> "FieldTree": ...

`data/test_case.py` — 测试用例

@dataclass
class TestCase:
    id: str
    fields: dict = field(default_factory=dict)      # {字段名: 值}
    coverage_targets: list[str] = field(default_factory=list)

@dataclass
class TestSuite:
    test_cases: list[TestCase] = field(default_factory=list)
    spark_config: Optional[SparkConfig] = None
    @property
    def has_spark(self) -> bool: ...

@dataclass
class SparkConfig:
    num_records: int = 100
    replication: str = "key_varied"
    key_field: str = ""
    edge_cases: list[str] = field(default_factory=list)

`data/diff_result.py` — 对比结果

@dataclass
class FieldResult:
    field_name: str = ""
    status: str = "PASS"            # PASS / TOLERATED / MISMATCH / NOT_SET
    cobol_value: str = ""
    java_value: str = ""
    tolerance_applied: float = 0.0
    rounding_detected: str = ""
    suggestion: str = ""

@dataclass
class VerificationRun:
    program: str = ""
    timestamp: str = ""
    status: str = "PASS"            # PASS / MISMATCH / BLOCKED / ERROR / FATAL
    exit_code: int = 0
    duration_s: float = 0.0
    fields_matched: int = 0
    fields_mismatched: int = 0
    field_results: list[FieldResult] = field(default_factory=list)
    runner: str = "native"          # native / spark
    branch_rate: float = 0.0
    paragraph_rate: float = 0.0
    decision_rate: float = 0.0
    hina_type: str = ""
    hina_confidence: float = 0.0
    quality_score: float = 0.0
    quality_warn: str = ""
    heal_retry: int = 0
    simple_retry: int = 0
    total_retry: int = 0
    llm_cost: float = 0.0
    report_path: str = ""
    debug: dict = field(default_factory=dict)
    @property
    def total_fields(self) -> int: ...
    def verdict(self) -> str: ...

三、核心引擎层 (Layer 2) — 接口规范

模块 2-1: `cobol_testgen`（COBOL 解析 + 数据生成）

负责人: A 依赖: data/ (Field, FieldTree, PicInfo, FieldDef, BrSeq, ...)

公开函数:
┌─────────────────────────────────────────────────────────────────────────┐
│ extract_structure(cobol_source: str, source_dir: str = None) → dict    │
│                                                                         │
│  入力: COBOL 源码文本、可选的 COPYBOOK 搜索路径                          │
│  出力: {                                                                │
│    paragraphs: list[str],          ← 段落名列表                         │
│    total_paragraphs: int,          ← 段落总数                          │
│    decision_points: list[dict],    ← [{id, kind, label, branches}, ...]│
│    total_branches: int,            ← 分支总数                          │
│    branch_tree: BrSeq,             ← 控制流树                           │
│    file_count: int,                ← SELECT 文件数                      │
│    open_directions: dict,          ← {文件名: INPUT/OUTPUT/I-O}        │
│    has_search_all: bool,           ← 是否有 SEARCH ALL                  │
│    has_evaluate: bool,             ← 是否有 EVALUATE                    │
│    has_call: bool,                 ← 是否有 CALL                        │
│    has_break: bool,                ← 是否有 key 中断                    │
│    branch_tree_obj: BrSeq,         ← 原始分支树对象                     │
│  }                                                                     │
├─────────────────────────────────────────────────────────────────────────┤
│ generate_data(cobol_source: str, structure: dict,                      │
│               source_dir: str = None) → list[dict]                     │
│                                                                         │
│  入力: COBOL源码, extract_structure 的输出, 搜索路径                    │
│  出力: [{字段名: 值, ...}, ...]  ← 每条记录覆盖一条分支路径              │
├─────────────────────────────────────────────────────────────────────────┤
│ incremental_supplement(branch_tree, decision_gaps: list[int])          │
│   → list[dict]                                                         │
│                                                                         │
│  入力: 分支树对象, 未覆盖决策点的 ID 列表                                │
│  出力: 补充的新测试记录                                                  │
└─────────────────────────────────────────────────────────────────────────┘

子模块职责:
  read.py    — 预处理 + DATA DIVISION 解析 + PIC 解析 → FieldDef[]
  core.py    — PROCEDURE DIVISION 解析 → BrSeq 树 + assignments
  cond.py    — 条件表达式解析 + MC/DC 枚举 → CondLeaf/And/Or/Not
  design.py  — 路径枚举 + 值生成 + 约束应用 → generate_records()
  coverage.py— 决策点收集 + 标记 + HTML报告 → check_coverage()
  output.py  — JSON/文件输出 → output_json/output_input_files
  models.py  — 数据模型 (共享)

模块 2-2: `config`（配置管理）

负责人: B 依赖: 无内部依赖

┌─────────────────────────────────────────────────────────────────────────┐
│ Config (dataclass)                                                      │
│                                                                         │
│  字段:                                                                   │
│    project_name: str = ""                                               │
│    copybook_paths: list = ["./copybooks"]                               │
│    dialect: str = "ibm"                 # cobc -std 参数               │
│    llm_model: str = "gpt-4o-mini"       # LLM 模型                     │
│    llm_timeout: int = 15                                                 │
│    llm_cache_dir: str = ".cache/llm"                                     │
│    coverage_default: str = "boundary"                                    │
│    rounding_mode: str = "TRUNCATE"                                       │
│    tolerance: float = 0.01               # 比较容忍度                   │
│    runner_mode: str = "native"           # native / spark               │
│    spark_master: str = "local[*]"                                        │
│    num_records: int = 1000                                               │
│    branch_pass: float = 0.80             # 覆盖率通过阈值               │
│    max_llm_cost: float = 0.50                                           │
│    quality_gate_mode: str = "warn"       # off / warn / strict          │
│    quality_gate_decision_threshold: float = 0.90                        │
│    quality_gate_paragraph_threshold: float = 1.0                        │
│    gcov_enabled: bool = False                                           │
│    max_quality_retries: int = 4                                         │
│                                                                         │
│  类方法:                                                                 │
│    @classmethod from_toml(path="aurak.toml") → Config                   │
└─────────────────────────────────────────────────────────────────────────┘

模块 2-3: `preprocessor`（COPYBOOK 预处理）

负责人: B 依赖: 无

┌─────────────────────────────────────────────────────────────────────────┐
│ CopybookPreprocessor                                                     │
│                                                                         │
│  __init__(paths: list = ["./copybooks"])                                │
│  expand(text: str) → str            # COPY 语句展开后的源码             │
└─────────────────────────────────────────────────────────────────────────┘

模块 2-4: `quality`（质量验证）

负责人: C 依赖: data/field_tree.py

┌─────────────────────────────────────────────────────────────────────────┐
│ L1OffsetValidator                                                        │
│   validate(tree: FieldTree, cpath: str) → dict {score, mismatches}     │
│                                                                         │
│ L2RoundtripValidator                                                     │
│   validate(tree: FieldTree) → dict {pass, results}                     │
└─────────────────────────────────────────────────────────────────────────┘

模块 2-5: `jcl/parser.py`（JCL 解析）

负责人: C 依赖: 无

┌─────────────────────────────────────────────────────────────────────────┐
│ parse_jcl(filepath: str) → Optional[Job]                                │
│                                                                         │
│ Job { job_name: str, steps: list[JobStep] }                             │
│ JobStep { step_name: str, program: str,                                 │
│           dd_entries: list[DDEntry], cond: Optional[CondParam] }        │
│ DDEntry { dd_name: str, dsn: Optional[str], disp: Optional[str],        │
│           sysout: Optional[str], inline_data: list[str] }               │
│ CondParam { code: int, operator: str, step_name: Optional[str] }        │
└─────────────────────────────────────────────────────────────────────────┘

模块 2-6: `report`（报告生成）

负责人: B 依赖: data/diff_result.py

┌─────────────────────────────────────────────────────────────────────────┐
│ ReportGenerator                                                          │
│   generate_json(vr: VerificationRun, path: Path)                        │
│   generate_html(vr: VerificationRun, path: Path)                        │
│   generate_machine_json(vr: VerificationRun, path: Path)                │
└─────────────────────────────────────────────────────────────────────────┘

模块 2-7: `parametrized`（测试数据生成器）

负责人: I (新增) 依赖: 无

公开函数（8 个）:
┌─────────────────────────────────────────────────────────────────────────┐
│ matching.py — 匹配系数据生成                                             │
│                                                                         │
│  generate_matching_data(matching_type="1:1",                            │
│                         record_count_r01=10,                            │
│                         record_count_r02=10,                            │
│                         key_match_ratio=1.0) → tuple[list, list]       │
│    出力: (主文件记录列表, 从文件记录列表)                                  │
│    匹配模式: "1:1" / "1:N" / "N:1"                                       │
├─────────────────────────────────────────────────────────────────────────┤
│ matching.py — KEY 切中断数据生成                                         │
│                                                                         │
│  generate_keybreak_data(group_count=3,                                  │
│                         records_per_group=2,                            │
│                         sum_type="accumulate") → list[dict]            │
│    出力: [{KEY, FIELD, GROUP, SEQ}, ...]                                │
│    sum_type: "accumulate" / "aggregate" / "mark"                        │
├─────────────────────────────────────────────────────────────────────────┤
│ division.py — 分割系数据生成                                             │
│                                                                         │
│  generate_division_data(division_type=50,                               │
│                         record_count=50) → list[list[dict]]             │
│    出力: [[文件1记录], [文件2记录], ...]                                  │
│    division_type: 50(对半) / 25(四等分) / 100(全量)                    │
├─────────────────────────────────────────────────────────────────────────┤
│ common.py — 通用数据生成工具                                              │
│                                                                         │
│  generate_zero_byte_file(path: str) → None                              │
│    写入 0 字节空文件                                                      │
│                                                                         │
│  generate_minimal_records(fields: list[dict]) → list[dict]              │
│    出力: 1 条类型合理默认值的记录                                         │
│                                                                         │
│  generate_sorted_records(record_count=10, key_field="KEY") → list[dict] │
│    出力: 已按 KEY 升序排列的记录列表                                      │
│                                                                         │
│  generate_duplicate_keys(records: list[dict], key_field="KEY")          │
│      → list[dict]                                                       │
│    出力: 原记录 + 同键值重复记录（用于 SORT MERGE / 去重测试）             │
│                                                                         │
│  generate_boundary_values(pic: str) → dict                              │
│    出力: {max, min, overflow, zero, pic_info}                           │
│    从 PIC 子句解析出最大值 / 最小值 / 溢出值                               │
└─────────────────────────────────────────────────────────────────────────┘

模块 2-8: `japanese_data.py`（日文测试数据生成）

负责人: I (新增) 依赖: 无

公开函数（8 个生成函数 + 常量表）:
┌─────────────────────────────────────────────────────────────────────────┐
│ 查找表常量                                                               │
│  FULLWIDTH_KATAKANA    — 全角片假名字符串                                │
│  FULLWIDTH_HIRAGANA    — 全角平假名字符串                                │
│  FULLWIDTH_DIGITS      — 全角数字                                        │
│  FULLWIDTH_ALPHA       — 全角字母                                        │
│  HALFWIDTH_KATAKANA    — 半角片假名字符串                                │
│  SJIS_5C_PROBLEM       — Shift-JIS 第2字节 0x5C 问题文字                │
│  SJIS_7C_PROBLEM       — Shift-JIS 第2字节 0x7C 问题文字                │
│  WAREKI_BOUNDARIES     — 和历边界对照表                                  │
├─────────────────────────────────────────────────────────────────────────┤
│ 生成函数                                                                 │
│                                                                         │
│  generate_fullwidth_text(field: dict) → str                             │
│    全角片假名填充 PIC N 字段                                              │
│                                                                         │
│  generate_halfwidth_katakana(field: dict) → str                         │
│    半角片假名填充 PIC X 字段                                              │
│                                                                         │
│  generate_sjis_5c_problem(field: dict) → str                            │
│    含 Shift-JIS 0x5C 问题文字的字符串                                    │
│                                                                         │
│  generate_sjis_7c_problem(field: dict) → str                            │
│    含 Shift-JIS 0x7C 问题文字的字符串                                    │
│                                                                         │
│  generate_wareki_date(wareki_type="R") → str                            │
│    和历日期字符串（格式: R050101）                                        │
│                                                                         │
│  generate_wareki_boundary(era="平成") → tuple[str, str]                  │
│    和历边界日期对（前代末日, 新代初日）                                    │
│                                                                         │
│  generate_encoding_test_data(from_enc="shift_jis", to_enc="utf-8")     │
│      → tuple[bytes, bytes]                                              │
│    Shift-JIS ↔ UTF-8 编码回环验证数据                                    │
│                                                                         │
│  select_data_type(field: dict) → str                                    │
│    字段类型判断: "japanese" / "numeric" / "halfwidth"                     │
└─────────────────────────────────────────────────────────────────────────┘

模块 2-9: `coverage/compare_coverage.py`（覆盖率比较）

负责人: D 依赖: 无

┌─────────────────────────────────────────────────────────────────────────┐
│ compare_coverage(program_name: str,                                     │
│                  static: dict,                                          │
│                  dynamic: dict) → dict                                  │
│                                                                         │
│  入力:                                                                   │
│    program_name — 程序名称                                               │
│    static — 静态覆盖率: {branch_rate, paragraph_rate, ...}               │
│    dynamic — 动态覆盖率: {gcov_cov, covered_branches, ...}              │
│                                                                         │
│  出力: {                                                                │
│    program: str,               ← 程序名称                               │
│    static: {branch_rate, paragraph_rate},                               │
│    dynamic: {gcov_cov},                                                 │
│    gap: float,                 ← static - dynamic 差异                  │
│    misleading_branches: list,  ← 静态覆盖但动态未覆盖的分支              │
│  }                                                                      │
│                                                                         │
│  用途: 识别 gcov 实际运行与静态分析之间的偏离                              │
└─────────────────────────────────────────────────────────────────────────┘

四、业务引擎层 (Layer 3) — 接口规范

模块 3-1: `hina`（程序分类 + 质量门禁 + 类型判定管道）

负责人: D 依赖: data/diff_result.py (VerificationRun)

├── 公开 API ────────────────────────────────────────────────────────────┤
│ pipeline/（类型判定管道 — 唯一入口）                                    │
│                                                                         │
│  classify_program(cobol_source: str) → dict                            │
│    出力: {category, confidence, subtype, strategy_params,              │
│            resolved_type, needs_review, ...}                            │
│                                                                         │
│    流程:                                                                 │
│    1. 并行执行关键字识别 + 结构提取                                      │
│    2. 确信度 ≥ 90% → 直接输出                                          │
│    3. 确信度 50-89% → 混淆组判定 + 4因子确信度 + 矛盾检测 + 回溯       │
│    4. 确信度 < 50% → 标记需人工处理                                     │
├─────────────────────────────────────────────────────────────────────────┤
│ confidence.py                                                           │
│                                                                         │
│  compute_confidence_v2(keyword_result, structure_features,             │
│                        contradictions=None, resolution=None) → dict    │
│    出力: {confidence, base, context_factor, consistency_factor,        │
│            structure_factor, judgment, needs_review}                    │
│                                                                         │
│    4 因子确信度公式:                                                     │
│      confidence = base × context_factor × consistency_factor           │
│                    × structure_factor                                   │
│                                                                         │
│    判定标准:                                                             │
│      >= 0.90 auto       — 自动通过                                     │
│      0.70-0.89 review    — 需要人工审核                                 │
│      0.50-0.69 manual    — 需要人工介入                                 │
│      < 0.50   impossible — 无法判定                                    │
├─────────────────────────────────────────────────────────────────────────┤
│ classifier.py                                                           │
│                                                                         │
│  L1_RULES: list[tuple[str, list[str], float]]    ← 11类关键字规则      │
│                                                                         │
│  detect_keyword(source: str) → list[tuple[str, float, str]]            │
│    出力: [(分类名, 确信度, 匹配关键字), ...]                             │
├─────────────────────────────────────────────────────────────────────────┤
│ hina_agent.py                                                           │
│                                                                         │
│  classify_with_llm(structure: dict, llm) → dict                        │
│    出力: {category, subtype, confidence, strategy_params}               │
│                                                                         │
│  _parse_llm_response(raw: str) → dict                                   │
│  _validate_result(parsed: dict) → dict                                  │
│  _fallback_classification(structure: dict) → dict  ← 7混淆组规则      │
├─────────────────────────────────────────────────────────────────────────┤
│ gate.py                                                                 │
│  compute_quality_score(branch_rate, paragraph_rate) → float             │
│  check(tests, hina_result, coverage, thresholds...) → dict              │
│    出力: {passed: bool, score: float, issues: dict}                    │
├─────────────────────────────────────────────────────────────────────────┤
│ strategy.py                                                             │
│  get_strategy(hina_type: str) → dict    ← 5类型策略模板                 │
│  supplement(base_tests, hina_result) → list[dict]                      │
│  supplement_only(base_tests, gaps) → list[dict]                        │
├─────────────────────────────────────────────────────────────────────────┤
│ retry.py                                                                │
│  RetryHandler(max_heal=2, max_simple=3)                                 │
│    run(pipeline_fn: Callable[[], VerificationRun]) → VerificationRun    │
├─────────────────────────────────────────────────────────────────────────┤
│ gcov_collector.py                                                       │
│  collect_gcov(cobol_src: Path, work_dir: Path) → dict                  │
│    出力: {available, line_rate, total_lines, executed_lines}           │
├─────────────────────────────────────────────────────────────────────────┤
├── 内部实现（不公开）───────────────────────────────────────────────────┤
│ rule_engine/（混淆组规则引擎 — 非公开，由 pipeline 内部调用）           │
│                                                                         │
│  confusion_groups.py — 8 个混淆组判定函数                               │
│    resolve_matching_vs_keybreak(features) → dict                        │
│    resolve_dedup_vs_nodedup(features) → dict                            │
│    resolve_validation_vs_keybreak(features) → dict                      │
│    resolve_csv_merge_vs_split(features) → dict                          │
│    resolve_simple_vs_two_stage(features) → dict                         │
│    resolve_pure_vs_mixed(features) → dict                               │
│    resolve_division_50_25_100(features) → dict                          │
│    resolve_mn_output_mode(features) → dict                              │
│                                                                         │
│  contradiction.py — 矛盾检测与解决                                       │
│    CONTRADICTION_PAIRS: list[tuple[str, str]]                           │
│    detect_contradictions(types: list[str]) → list[dict]                 │
│    resolve_contradiction(type_a, type_b, features) → str                │
│                                                                         │
│  backtrack.py — 多轮回溯判定                                             │
│    BacktrackResolver(max_iterations=3, fallback_type="unknown")         │
│      resolve(features, initial_types, contradictions) → dict            │
└─────────────────────────────────────────────────────────────────────────┘

模块 3-2: `agents`（LLM 智能体）

负责人: E 依赖: data/field_tree.py, data/test_case.py, data/diff_result.py

┌─────────────────────────────────────────────────────────────────────────┐
│ LLMClient(model="gpt-4o-mini", timeout=15, cache_dir=".cache/llm")      │
│   call(messages: list[dict], retries=1) → str                           │
│                                                                         │
│   通信契约: POST {base}/chat/completions                                │
│      Header: Authorization: Bearer $LLM_API_KEY                         │
│      Body: {model, messages}                                            │
│      成功: {choices: [{message: {content: "..."}}]}                     │
│                                                                         │
│   缓存: SHA256(消息)→ 磁盘文件 .cache/llm/{hash}.json                   │
├─────────────────────────────────────────────────────────────────────────┤
│ Agent1Parser(llm: LLMClient)                                            │
│   parse(text: str) → FieldTree     ← COPYBOOK 文本 → 字段树             │
│                                                                         │
│   提示词: 解析 COBOL COPYBOOK → JSON {fields: [...]}                   │
├─────────────────────────────────────────────────────────────────────────┤
│ Agent2Data(llm: LLMClient)                                              │
│   design(tree: FieldTree, target="boundary", spark_mode=False)          │
│     → TestSuite                                                         │
│                                                                         │
│   提示词: 根据 FieldTree 设计测试用例 → JSON {test_cases: [...]}       │
├─────────────────────────────────────────────────────────────────────────┤
│ Agent3Diagnostic(llm: LLMClient)                                        │
│   analyze(fr: FieldResult) → str    ← 差异诊断 → 建议文本              │
│                                                                         │
│   提示词: 分析 COBOL-Java 字段差异原因 → JSON {issue_type, suggestion}│
└─────────────────────────────────────────────────────────────────────────┘

模块 3-3: `comparator`（对比引擎）

负责人: F 依赖: data/field_tree.py, data/diff_result.py

┌─────────────────────────────────────────────────────────────────────────┐
│ align_records(cobol_records, java_records, key_field) → list[tuple]    │
│  入力: COBOL记录列表, Java记录列表, 键字段名                             │
│  出力: [(cobol_dict, java_dict, 'MATCHED'), ...]                       │
│         (cobol_dict, None, 'MISSING_IN_SPARK')                        │
│         (None, java_dict, 'EXTRA_IN_SPARK')                           │
├─────────────────────────────────────────────────────────────────────────┤
│ compare_field(name, c, j, field_type='decimal', tolerance=0.01)        │
│     → FieldResult                                                      │
│                                                                         │
│  field_type 取值: 'decimal' / 'string' / 'date'                        │
│  status 取值: PASS / TOLERATED / MISMATCH                              │
├─────────────────────────────────────────────────────────────────────────┤
│ CobolBinaryReader                                                        │
│   read(binary_path: str, tree: FieldTree) → list[dict]                 │
│     按 FieldTree 的 offset/length 解析二进制 → [{字段: 值}]            │
├─────────────────────────────────────────────────────────────────────────┤
│ Normalizer                                                               │
│   normalize_comp3(data: bytes) → str     ← COMP-3 解码                 │
├─────────────────────────────────────────────────────────────────────────┤
│ detect_rounding(c: str, j: str) → RoundingResult                       │
└─────────────────────────────────────────────────────────────────────────┘

模块 3-4: `runners`（编译运行引擎）

负责人: G 依赖: data/test_case.py, data/diff_result.py

@dataclass
class BuildResult:
    success: bool
    artifact_path: str = ""
    log: str = ""

@dataclass
class RunResult:
    success: bool
    records: list[dict] = field(default_factory=list)
    log: str = ""
    coverage_exec: str = ""

@dataclass
class CoverageReport:
    branch_rate: float = 0.0
    covered_branches: int = 0
    total_branches: int = 0
    verdict: str = "PASS"

class Runner(ABC):
    @abstractmethod
    def compile(self, source_dir: str) -> BuildResult: ...
    @abstractmethod
    def run(self, artifact: str, input_path: str, output_path: str) -> RunResult: ...
    @abstractmethod
    def get_coverage(self, artifact: str, run_id: str) -> CoverageReport: ...

class CobolRunner:
    def compile(self, src: str, dialect="ibm", gcov=False) -> BuildResult: ...
    def run(self, binary: str, input_path: str, output_path: str) -> RunResult: ...

class NativeJavaRunner(Runner): ...    # mvn + java -jar
class SparkJavaRunner(Runner): ...     # spark-submit

class DataWriter:
    def write_cobol_binary(tests: list[TestCase], path: Path): ...
    def write_native_json(tests: list[TestCase], path: Path): ...
    def write_spark_json(tests: list[TestCase], config: SparkConfig, outdir: Path): ...

五、管道集成层 (Layer 4) — 接口规范

模块 4-1: `orchestrator`（管道导演）

负责人: H (@所有人集成) 依赖: Layer 2 + Layer 3 的所有模块

┌─────────────────────────────────────────────────────────────────────────┐
│ run_pipeline(cfg: Config,                                             │
│              cpath: str,          ← copybook 路径                      │
│              cbl: str,            ← COBOL 源码路径                     │
│              java: str,           ← Java 源码路径                      │
│              map_path: str)       ← mapping 路径                       │
│   → VerificationRun                                                     │
│                                                                         │
│  内部流程（各步骤可独立替换）:                                           │
│                                                                         │
│  Step 1:  Agent1Parser(llm).parse(cpath)         → FieldTree           │
│  Step 2:  extract_structure(cbl)                  → structure dict     │
│  Step 3:  generate_data(cbl, structure)           → TestCase[]         │
│  Step 4:  compute_confidence(cbl, structure)       → HINA result       │
│           classify_with_llm(structure, llm)                            │
│  Step 5:  strategy_supplement(tests, hina)         → 补充 TestCase[]   │
│  Step 6:  gate_check(tests, hina, cov, ...)        → 质量门禁          │
│  Step 7:  Agent2Data(llm).design(tree)             → TestSuite         │
│  Step 8:  DataWriter -> cobol_binary + json        → 输入文件          │
│  Step 9:  CobolRunner.compile(cbl)                  → BuildResult      │
│  Step 10: CobolRunner.run(binary, input)            → RunResult        │
│  Step 11: Runner.compile(java)                      → BuildResult      │
│  Step 12: Runner.run(jar, input)                    → RunResult        │
│  Step 13: CobolBinaryReader.read(co_out, tree)      → COBOL records    │
│  Step 14: align_records(cobol, java, key)           → aligned tuples   │
│  Step 15: compare_field(field, c, j, type, tol)     → FieldResult[]    │
│  Step 16: Agent3Diagnostic.analyze(mismatch)        → suggestions      │
│  Step 17: ReportGenerator → JSON + HTML            → 报告文件          │
│                                                                         │
│  返回: VerificationRun (全结果聚合)                                     │
└─────────────────────────────────────────────────────────────────────────┘

六、模块耦合关系矩阵

           depends_on →  L1     L2-1   L2-2   L2-3   L2-4   L2-5   L2-6   L2-7   L2-8   L2-9   L3-1   L3-2   L3-3   L3-4
module     ↓               data   cbl_t  conf   prepr  qual   jcl    rpt    para   jp     cov    hina   agnt   comp   runr
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
L2-1 cobol_testgen       ●      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
L2-2 config              ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
L2-3 preprocessor        ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
L2-4 quality             ●      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
L2-5 jcl/parser          ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
L2-6 report              ●      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
L2-7 parametrized        ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
L2-8 japanese_data       ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
L2-9 coverage            ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ●      ─      ─      ─
L3-1 hina                ●      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
L3-2 agents              ●      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
L3-3 comparator          ●      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
L3-4 runners             ●      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─      ─
L4 orchestrator          ●      ●      ●      ─      ─      ─      ●      ─      ─      ─      ●      ●      ●      ●
L4 web/L4 jcl exec       ●      ─      ●      ─      ─      ●      ─      ─      ─      ─      ─      ─      ─      ●
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

七、模块依赖图（协作视图）

                      ┌───────────┐
                      │  data/    │ ← 所有模块共享的数据契约
                      └─────┬─────┘
                            │
       ┌────┬────┬────┬────┼────┬────┬────┬────┬────┬────┬────┐
       │    │    │    │    │    │    │    │    │    │    │    │
       ▼    ▼    ▼    ▼    ▼    ▼    ▼    ▼    ▼    ▼    ▼    ▼
     cobol_t para  jp   cov  conf  qual  hina agnt comp runr rpt
       │    │     │    │    │     │     │    │    │    │    │
       └────┴─────┴────┴────┴─────┴─────┴────┴────┴────┴────┘
                         │
                         ▼
                    orchestrator.py
                         │
                    ┌────┴────┐
                    │         │
                    ▼         ▼
                  web/       jcl/exec

八、多人协作分工方案

开发者	模块	需知接口	独立程度
A	`cobol_testgen/` (read/core/cond/design/coverage/output)	data/ 数据模型	✅ 完全独立
B	`config/` + `preprocessor/` + `report/`	data/diff_result.py	✅ 完全独立
C	`quality/` + `jcl/parser/`	data/field_tree.py	✅ 完全独立
D	`hina/` (pipeline/classifier/gate/agent/strategy/retry/gcov/rule_engine) + `coverage/`	data/diff_result.py	✅ 完全独立
E	`agents/` (LLM/parser/data/diagnostic)	data/ 全部 3 个模型	✅ 完全独立
F	`comparator/` (align/compare/reader/normalize/round)	data/全部	✅ 完全独立
G	`runners/` (cobol/java/spark/datawriter)	data/test_case.py	✅ 完全独立
I	`parametrized/` + `japanese_data.py`	无	✅ 完全独立
H	`orchestrator.py` (集成)+ `web/` + `jcl/exec`	所有模块 API	⛓️ 需要所有人

各 Layer 3 模块只有 1 个统一约束: 接收的输入必须是 data/ 中的数据类实例，返回的也必须是 data/ 中的数据类实例。只要遵守这个契约，模块开发者不需要知道其他模块的内部实现。

九、当前系统问题 & 改进项

问题	影响	解决方案
`cobol_testgen/__init__.py` 混用公开和私有符号 (如 `_add_subscript`/`_init_child_names`)	外部不清楚哪些是稳定接口	添加 `__all__` 明确定义公开 API
多数模块没有 `__all__`	无法区分公开/内部函数	每个模块根文件添加 `__all__`
orchestrator 直接 import 内部子模块 (如 `cobol_testgen.coverage.check_coverage`)	Layer 越界，管道依赖了引擎内部	orchestrator 只应 import 各模块的顶层公开函数
Config 字段没有验证/文档	修改 Config 可能破坏其他模块	添加字段校验 + 注释
函数签名缺少类型注解 (部分历史代码)	接口不明确	补全所有公开函数的类型注解
没有模块版本号/变更记录	无法追踪接口变更	添加 `__version__` 到每个模块

53 KiB Raw Blame History Unescape Escape

COBOL-Java 迁移验证平台 — 模块接口规范

一、模块分层架构

二、数据模型层 (Layer 1) — 所有层的契约

data/field_tree.py — 字段树

data/test_case.py — 测试用例

data/diff_result.py — 对比结果

三、核心引擎层 (Layer 2) — 接口规范

模块 2-1: cobol_testgen（COBOL 解析 + 数据生成）

模块 2-2: config（配置管理）

模块 2-3: preprocessor（COPYBOOK 预处理）

模块 2-4: quality（质量验证）

模块 2-5: jcl/parser.py（JCL 解析）

模块 2-6: report（报告生成）

模块 2-7: parametrized（测试数据生成器）

模块 2-8: japanese_data.py（日文测试数据生成）

模块 2-9: coverage/compare_coverage.py（覆盖率比较）