Commit Graph

79 Commits

Author SHA1 Message Date
NB-076 a5939e6722 fix: subtype resolver + comprehensive matching program test
Fix 4 remaining defects found by adversarial testing:
1. MT03 N:1 → subtype corrected to N:1 (key suffix -M/-T heuristic)
2. MT32 混合 → subtype added (項目チェック programs with WS-PREV-KEY)
3. MT33 混合异键 → WS-ALT-KEY detection → 混合(异键)
4. MT18/MT19 → subtype M:N (correct: static cannot distinguish M:N→M vs M:N→N)

Also expand subtype resolver scope: now also processes 項目チェック
classified programs with matching-like characteristics (WS-PREV-KEY),
not just マッチング.

New test: test_matching_programs.py — 10 parametrized tests covering
all 4 dimensions (category, subtype, branches, files) for every
matching program. Known limitation documented: MT18 vs MT19
requires runtime data for M:N→M vs M:N→N distinction.

Regression: 755 passed (10 new, 0 failures).
2026-06-21 13:40:58 +08:00
NB-076 6b3f526b80 feat: agent-driven matching subtype discrimination
Refactor _resolve_matching_subtype to use an LLM agent for ambiguous
cases instead of pure static rules:

Architecture (3 layers):
1. Static deterministic rules: M:N→MxN, 1:N (WS-MAST/TRAN-KEY),
   二段階, 混合 — high confidence, no LLM needed
2. LLM agent: ambiguous cases (N:1 vs 1:1, M:N→M vs M:N→N)
   - _MATCHING_SUBTYPE_AGENT_PROMPT with 5 subtypes
   - Calls existing hina.hina_agent._parse_llm_response for parsing
   - Minimum confidence threshold 0.4 to gate low-quality LLM output
3. Fallback: conservative defaults (M:N or 1:1) when LLM unavailable

This follows the original architecture design: agent handles the
hard classification problems that static analysis alone can't resolve.
Regression: 745 passed (unchanged).
2026-06-21 13:36:57 +08:00
NB-076 7d5c82e0e2 feat: matching program subtype discrimination (1:1/1:N/M:N/MxN)
Add _resolve_matching_subtype post-processing step in classify_program()
that distinguishes matching program subtypes based on key variable naming
patterns and file/structural features:

Rules (in priority order):
1. 二段階 → 二段階 (already handled by rule engine)
2. 3 files + WS-SAVE-KEY → M:N→MxN (MT20)
3. WS-PREV-KEY present → 混合 (already handled, MT32)
4. WS-MAST-KEY + WS-TRAN-KEY → 1:N (MT02)
5. >=3 KEY vars + >=2 files → M:N (MT33)
6. Otherwise → 1:1 (MT01, MT03, MT18, MT19)

Results:
  MT01→1:1, MT02→1:N, MT03→1:1, MT16/17→二段階,
  MT18/19→1:1, MT20→M:N→MxN, MT33→M:N

Also fix double-backslash regex bug in classifier.py and pipeline.py
(r'[-\w]' should be r'[\w-]' for word character class).
Regression: 745 passed (unchanged).
2026-06-21 13:33:25 +08:00
NB-076 65e9919933 feat: matching program full recognition — L1 regex keyword + confidence consensus
Three-part fix for matching program classification:
1. L1 regex keyword WS-[-\w]*KEY (confidence 0.65):
   - Captures WS-KEY, WS-MAST-KEY, WS-TRAN-KEY, WS-PREV-KEY etc.
   - Matches ALL 10 matching programs including MT02 (which uses
     WS-MAST-KEY/WS-TRAN-KEY that literal 'WS-KEY' missed)
   - False positives (ST-SEARCH-ALL, VL01) overridden by rule engine
     or higher-confidence ORGANIZATION IS keyword
   - detect_keyword() extended with 're:' prefix for regex patterns

2. Consensus bonus in compute_confidence_v2:
   - When L1 keyword category matches rule engine's final category,
     context_factor boosted by +0.15
   - Pushes matching programs from manual (0.50-0.69) toward
     review (0.70-0.89) range

3. Confidence calibration for confusion groups (previous commit):
   - dedup_vs_nodedup: 0.85→0.50 for negative detection
   - validation_vs_keybreak: 0.80→0.55 for has_counter
   - simple_vs_two_stage: 0.80→0.50 for sequential OPEN

Results - matching programs:
  MT01: 0.38→0.75, MT02: 0.30→0.60, MT03: 0.30→0.60,
  MT16: 0.45→0.81, MT17: 0.36→0.65, MT18: 0.60→0.60,
  MT19: 0.30→0.60, MT20: 0.30→0.65, MT33: 0.30→0.60
  All now rule_engine (not fallback), no false negatives.

Subtype discrimination remains for future work: all matching
programs classified as マッチング without 1:1/1:N/N:1 subtype.
2026-06-21 13:25:39 +08:00
NB-076 958b12e9a9 fix: confusion group confidence calibration — false positive detection inflation
Issues found through matching program classification analysis:
1. dedup_vs_nodedup: 0.85→0.50 for negative detection (no WS-PREV-KEY
   is not strong evidence for '含まず')
2. validation_vs_keybreak: 0.80→0.55 for has_counter (counter is a
   generic pattern, not specific to key-break)
3. simple_vs_two_stage: 0.80→0.50 for non-open-close-open pattern
   (sequential OPEN is the default for most programs)

Result: matching programs now correctly classified:
- MT01-03/18/20 → マッチング  (was 項目チェック)
- MT16-17 → 二段階マッチング  (unchanged)
- MT32 → 項目チェック(重複含む)  (correct: has WS-PREV-KEY)
- VL01 → 項目チェック(重複含む)  (correct)
- CSV → CSV合并  (correct)
Regression: 745 passed (3 test expectation bounds updated)
2026-06-21 13:17:31 +08:00
NB-076 0b0a013f51 fix: 3 critical parsing bugs found through statement benchmark testing
Bug 1: ELSE IF breaks IF false_seq parsing (core.py)
- _parse_if checked self.clean() == 'ELSE' which fails on 'ELSE IF ...'
- Fix: use startswith('ELSE'), reinsert IF portion for recursive parse
- Impact: ALL ELSE IF chains were silently dropped (huge branch loss)

Bug 2: READ skip loop greedily consumes subsequent statements (core.py)
- READ's AT END / NOT AT END skip loop used bare advance() with no
  statement boundary detection
- Fix: add _stmt_boundary regex that stops on IF/PERFORM/READ/etc.
- Impact: everything after first READ was consumed as 'AT END' lines

Bug 3: _walk() in extract_structure doesn't descend into BrPerform (__init__.py)
- Branch counting _walk() only handled BrIf/BrEval/BrSeq
- IF statements inside PERFORM bodies were never counted
- Fix: add BrPerform.body_seq and BrSearch descent

Combined impact: matching programs (MT01-33) now correctly report
their branches instead of 0. Full regression: 749 passed (unchanged).
2026-06-21 12:52:04 +08:00
NB-076 dbee3b7251 fix: Lark grammar + parse_file_section SD/ASCENDING KEY support
Bug fixes found through statement benchmark testing:
1. grammar.lark: Add ASCENDING/DESCENDING KEY IS + INDEXED BY to
   occurs_clause — fixes HINA024 (SEARCH ALL) parsing crash
2. grammar.lark: Add SD (Sort Description) entry type to file_section
   — fixes HINA034 (SORT), ST01, ST02 parsing crashes
3. read.py parse_file_section(): Handle SD blocks alongside FD blocks
   — enables SORT/MERGE file structure extraction

4 previously crashing files now parse successfully:
- HINA024.cbl (SEARCH ALL): paras=3, files=0
- HINA034.cbl (SORT): paras=1, files=3
- ST01_SORT.cbl: paras=2, files=3
- ST02_MERGE.cbl: paras=1, files=4

Regression: 749 passed (unchanged — classify_program internally caught
the crashes, so tests already 'passed'; real improvement is in data
quality: structure extraction now works for these programs)
2026-06-21 12:21:36 +08:00
NB-076 d12a305dc4 test: add L1 data generation + L2 classifier validation (58 tests)
Phase C-D complete:
- test_l1_data_generation.py — 8 tests verifying generate_data across all P0 groups
- test_l2_classifier.py — 16 existing + 34 P0 classification verification tests
- hina/pipeline/__init__.py — export classify_program for cleaner imports

Key findings:
- Classifier correctly detects: CALL→子程序调用, CICS→online,
  DB→DB操作, ORGANIZATION IS→文件编成, DIVIDE→DIVIDE_50.0,
  ASCII/EBCDIC→编码转换 (keyword match)
- Rule engine provides baseline 項目チェック(重複含まず) for programs
  without L1 keyword matches
- SD keyword (SORT/MERGE sort-file) breaks Lark parser (known limitation)
- Full regression: 749 passed (0 new failures)
2026-06-21 12:16:12 +08:00
NB-076 fbaad010ab test: add L0 statement benchmark tests (34 parametrized tests)
6 test files covering:
- test_arithmetic_statements (9 samples)
- test_control_statements (6 samples)
- test_file_statements (6 samples)
- test_inspect_statements (3 samples)
- test_move_statements (5 samples)
- test_perform_statements (3 samples)
- test_search_statements (2 samples)

All 34/34 pass. Full regression: 691 passed (0 new failures).
2026-06-21 12:05:07 +08:00
NB-076 8c1f9114f6 feat: add COBOL statement benchmark plan and 34 P0 sample programs
- docs/cobol-statement-benchmark-plan.md — full coverage matrix and gap analysis
- 34 P0 COBOL samples: arithmetic(9), move(5), file(6), control(6),
  inspect(3), search(2), perform(3)
- test-data/validate_statements.py — automatic validation script
- Validation: 34/34 samples pass preprocess + extract_structure
2026-06-21 12:02:25 +08:00
NB-076 a6c454692a fix: resolve 3 MEDIUM code review findings
M1: Cache confusion-pair confidences in Path B (eliminate redundant
    resolve_confusion_pair re-calls in _path_rule_engine)
M2: Resolve contradictions in Path C instead of hardcoding
    resolved_count=0 in _path_llm_assisted
M4: Add DIVIDE_25 to contradiction pair coverage (50-25, 100-25)
    and update test_contradiction_pairs_defined to verify all 3 variants
2026-06-21 11:25:59 +08:00
hangshuo652 bc1d56d1a4 feat: Phase 2 complete — 13 Phases of COBOL type classification and test benchmark
P0.6: gcov infrastructure
P1: extract_structure output expansion (11 new feature fields)
P2: Confusion group rule engine (8 pairs + contradiction + backtrack)
P3: 4-factor confidence calculation + quality gate update
P4: 33+2 COBOL program type test samples (22 files, 7 categories)
P5: parametrized/ test data generation engine
P6: japanese_data.py lookup tables
P7-10: Type-specific test suites (~159 parametrized tests)
P11: Full classification pipeline (classify_program) + orchestrator integration
P12: Documentation (module-interfaces, test-plan v3.0, coverage-matrix)

Architecture decisions:
- classification_pipeline/ merged to hina/pipeline/
- parametrized/ as independent module
- japanese_data.py as root-level file
- hina/__all__ only exports classify_program()

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-19 23:51:55 +08:00
hangshuo652 63b5284715 fix: _parse_llm_response now handles empty/invalid JSON gracefully
test: add gap coverage tests (hina_agent/JCL/quality gate edge cases)
2026-06-18 17:31:16 +08:00
hangshuo652 b5e76306c3 test: add AI Agent v6 node compliance validation (6 nodes, 24/24) 2026-06-18 17:27:19 +08:00
hangshuo652 e530f6980d test: add deep validation suite (real COBOL/HINA/QG/retry/report/perf - 28/28) 2026-06-18 17:21:12 +08:00
hangshuo652 6ac9861c84 test: add master validation suite (Pipeline/HINA/Benchmark/QG/Retry/Report - 30/30) 2026-06-18 17:17:11 +08:00
hangshuo652 ecc5599b48 test: add platform user story tests (43/43, 4 categories) 2026-06-18 17:10:40 +08:00
hangshuo652 2662c6c0ac test: add comprehensive test plan and auto test runner (20/20 passed, 100%) 2026-06-18 17:05:51 +08:00
hangshuo652 9ad0e88a1a test: add HINA type-specific COBOL test data suite (10 programs, 8/10 pass) 2026-06-18 16:55:43 +08:00
hangshuo652 2e64f208ea fix: P1 - complete_tests now feeds DataWriter; P2 - loop syncs complete_tests; P5 - machine_json gets coverage fields 2026-06-18 16:47:21 +08:00
hangshuo652 c93104e6bf feat: Phase 3+4 - gcov support + enhanced report 2026-06-18 16:31:54 +08:00
hangshuo652 e2486db510 fix: 3 issues found during real COBOL validation 2026-06-18 16:26:44 +08:00
hangshuo652 de506d9c31 feat: Phase 2 - HINA Agent + Strategy Agent + classifier 2026-06-18 16:10:38 +08:00
hangshuo652 c021dfe01e feat: Phase 1 - orchestrator quality gate loop + hina/gate + main CLI args 2026-06-18 16:02:38 +08:00
hangshuo652 097530b036 feat: Phase 1 - cobol_testgen API + quality fields + retry handler 2026-06-18 15:47:35 +08:00
hangshuo652 7fcdb41a85 init: cobol-java migration verification platform v3 (42 tests, JCL module) 2026-05-27 08:42:41 +08:00
hangshuo652 faeedbc77b test: add edge case tests 2026-05-24 13:01:31 +08:00
hangshuo652 331b38eac1 feat: add web layer (FastAPI + worker) 2026-05-24 12:52:20 +08:00
hangshuo652 818e81269c v3: gstack-code-gen 生成 2026-05-24 12:36:44 +08:00