Files

T

hangshuo652 bc1d56d1a4 feat: Phase 2 complete — 13 Phases of COBOL type classification and test benchmark

P0.6: gcov infrastructure
P1: extract_structure output expansion (11 new feature fields)
P2: Confusion group rule engine (8 pairs + contradiction + backtrack)
P3: 4-factor confidence calculation + quality gate update
P4: 33+2 COBOL program type test samples (22 files, 7 categories)
P5: parametrized/ test data generation engine
P6: japanese_data.py lookup tables
P7-10: Type-specific test suites (~159 parametrized tests)
P11: Full classification pipeline (classify_program) + orchestrator integration
P12: Documentation (module-interfaces, test-plan v3.0, coverage-matrix)

Architecture decisions:
- classification_pipeline/ merged to hina/pipeline/
- parametrized/ as independent module
- japanese_data.py as root-level file
- hina/__all__ only exports classify_program()

Co-Authored-By: Claude <noreply@anthropic.com>

2026-06-19 23:51:55 +08:00

6.7 KiB

Raw Permalink Blame History

贡献指南 — 模块化开发规则

本文档定义多人协作开发本项目的规则。所有开发者必须遵守。

1. 模块分层

Layer 1: data/          ← 核心数据模型（所有人共享）
Layer 2: cobol_testgen/ config/ jcl/parser/ quality/ report/ preprocessor/
                        ← 业务引擎（每人负责 1-2 个）
Layer 3: hina/ agents/ comparator/ runners/
                        ← 高级引擎（每人负责 1 个）
Layer 4: orchestrator/ web/ jcl/executor/
                        ← 管道集成（H 负责）

规则: 下层不能依赖上层。Layer 2 不能 import Layer 3。

2. `all` 规则

每个模块的 __init__.py 必须有 __all__。

在 `__init__.py` 中:
  ✅ __all__ 里的 = 公开 API（其他人可依赖）
  ❌ __all__ 外的 = 内部实现（随时可改）

只有 __all__ 中列出的函数/类是稳定接口。修改 __all__ 外的代码不需要通知其他人。修改 __all__ 内的代码必须：

更新 docs/module-interfaces.md
在 PR 中标注 [BREAKING]
通知所有使用者

3. 导入规则

✅ 正确做法: 只从模块顶层导入

# ✅ Layer 2 → Layer 1
from data import Field, FieldTree, VerificationRun

# ✅ Layer 3 → Layer 1
from data import TestCase, TestSuite

# ✅ Layer 3 → Layer 2
from cobol_testgen import extract_structure

# ✅ Layer 4 → 各层
from hina import gate_check, compute_confidence
from comparator import compare_field
from runners import CobolRunner

❌ 错误做法: 钻入模块内部

# ❌ 不要直接从子模块导入 — 你的代码会依赖内部结构
from cobol_testgen.coverage import check_coverage    # ❌
from hina.gate import check                          # ❌
from runners.cobol_runner import CobolRunner         # ❌
from agents.agent1_parser import Agent1Parser        # ❌

为什么: 如果子模块重构（改名、拆分、合并），钻入内部的代码全部断裂。而模块顶层的 __all__ 提供了稳定抽象层。

4. 数据模型契约

4.1 所有模块共用 data/ 下的 3 组类

类	用途	属于谁
`Field`, `FieldTree`	字段树	cobol_testgen → → comparator
`TestCase`, `TestSuite`, `SparkConfig`	测试数据	cobol_testgen → → runners
`FieldResult`, `VerificationRun`	管道结果	orchestrator → → comparator → → report

4.2 修改 data/ 的规则

必填字段 直接添加到 dataclass（有默认值则向后兼容）
可选字段 使用 Optional 或 "" 默认
删除字段 必须标注 [BREAKING]，检查所有使用处
修改字段含义 必须更新 data/__init__.py 的文档注释

修改 data/ 必须通知所有开发者（微信群 / PR 标注 [DATA-CHANGE]）。

5. 函数签名规则

5.1 必须加类型注解

# ✅ 正确
def extract_structure(cobol_source: str, source_dir: str = None) -> dict:

# ❌ 错误 — 调用者不知道参数类型
def extract_structure(cobol_source, source_dir=None):

5.2 返回值必须符合 `data/` 模型

函数返回多个值 → 包装成 dataclass
函数返回可选 → Optional[...]
函数返回集合 → list[...] 明确元素类型

5.3 不要用 dict 当"隐式接口"

# ❌ 返回无类型 dict — 调用者需要读实现才知道 key 名
def run() -> dict:
    return {"status": "ok", "matched": 5}

# ✅ 返回类型化的对象
def run() -> VerificationRun:
    return vr

已有 verificationRun 这样的完整数据类，优先复用。

6. 测试规则

6.1 测试文件位置

tests/
├── cobol_testgen/    ← A 的测试
├── hina/             ← D 的测试
├── agents/           ← E 的测试
├── config/           ← B 的测试
├── comparator/       ← F 的测试
├── runners/          ← G 的测试
├── data/             ← 所有人的数据模型测试
├── nonfunctional/    ← 性能/并发/安全
└── test_*.py         ← 跨模块集成测试（H 负责）

6.2 测试用例命名

def test_[模块]_[功能]():    # ✅ 推荐
def test_extract_structure_simple_if():

6.3 测试覆盖要求

新功能必须附带测试
修复 bug 必须附带回归测试
覆盖率目标: 核心管道 ≥ 85%，各模块 ≥ 70%

7. Config 修改规则

config/__init__.py 的 Config 类是所有模块共享的配置。

修改 Config 的规则:

加字段: 添加到 dataclass（有默认值）
删字段: 必须标注 [BREAKING]，检查所有使用者
改含义: 更新 docs/module-interfaces.md

Config 的字段格式示例:

runner_mode: str = "native"
# 取值: "native" | "spark"
# 负责: G (runners 组)
# 默认说明: 优先使用本地 Java 执行，无 Spark 依赖

每个 Config 字段必须标注取值范围和负责组。

8. 模块负责人一览

模块	负责人	all 公开 API
`cobol_testgen/`	A	extract_structure, generate_data, incremental_supplement, check_coverage
`config/`	B	Config
`preprocessor/`	B	CopybookPreprocessor
`report/`	B	ReportGenerator
`quality/`	C	L1OffsetValidator, L2RoundtripValidator
`jcl/`	C	parse_jcl, JclExecutor, Job, JobStep, DDEntry, CondParam
`hina/`	D	gate_check, compute_confidence, classify_with_llm, supplement, RetryHandler, collect_gcov
`agents/`	E	LLMClient, Agent1Parser, Agent2Data, Agent3Diagnostic
`comparator/`	F	align_records, compare_field, CobolBinaryReader, Normalizer, detect_rounding
`runners/`	G	CobolRunner, NativeJavaRunner, SparkJavaRunner, DataWriter
`storage/`	G	DiskCache, ReportStore, TestDataBundle
`data/`	所有人	Field, FieldTree, TestCase, TestSuite, SparkConfig, FieldResult, VerificationRun
`orchestrator/`	H	run_pipeline
`web/`	H	FastAPI app, Worker

9. 新增模块规则

新增模块时的检查清单:

创建目录 + __init__.py
__init__.py 包含 __all__
docs/module-interfaces.md 更新接口描述
确认依赖方向正确（不反向依赖）
写测试文件
创建人加入模块负责人表

10. 快速参考

# 验证所有导入通畅
python -c "
from cobol_testgen import extract_structure
from hina import compute_confidence
from agents import LLMClient
from comparator import compare_field
from runners import CobolRunner
from data import Field, FieldTree, VerificationRun
"

# 运行测试
pytest tests/ -x --tb=short

# 检查 __all__ 完整性
grep -r "^__all__" */__init__.py

6.7 KiB Raw Permalink Blame History