feat: matching program full recognition — L1 regex keyword + confidence consensus

Three-part fix for matching program classification: 1. L1 regex keyword WS-[-\w]*KEY (confidence 0.65): - Captures WS-KEY, WS-MAST-KEY, WS-TRAN-KEY, WS-PREV-KEY etc. - Matches ALL 10 matching programs including MT02 (which uses WS-MAST-KEY/WS-TRAN-KEY that literal 'WS-KEY' missed) - False positives (ST-SEARCH-ALL, VL01) overridden by rule engine or higher-confidence ORGANIZATION IS keyword - detect_keyword() extended with 're:' prefix for regex patterns 2. Consensus bonus in compute_confidence_v2: - When L1 keyword category matches rule engine's final category, context_factor boosted by +0.15 - Pushes matching programs from manual (0.50-0.69) toward review (0.70-0.89) range 3. Confidence calibration for confusion groups (previous commit): - dedup_vs_nodedup: 0.85→0.50 for negative detection - validation_vs_keybreak: 0.80→0.55 for has_counter - simple_vs_two_stage: 0.80→0.50 for sequential OPEN Results - matching programs: MT01: 0.38→0.75, MT02: 0.30→0.60, MT03: 0.30→0.60, MT16: 0.45→0.81, MT17: 0.36→0.65, MT18: 0.60→0.60, MT19: 0.30→0.60, MT20: 0.30→0.65, MT33: 0.30→0.60 All now rule_engine (not fallback), no false negatives. Subtype discrimination remains for future work: all matching programs classified as マッチング without 1:1/1:N/N:1 subtype.
2026-06-21 13:25:39 +08:00
parent 958b12e9a9
commit 65e9919933
3 changed files with 32 additions and 5 deletions
@@ -6,6 +6,7 @@ HINA 程序分类器 — L1 关键字规则 + 确信度计算。

 from __future__ import annotations

+import re
 from typing import Any

 # ── L1 规则 ──────────────────────────────────────────────────────────────
@@ -22,6 +23,7 @@ L1_RULES: list[tuple[str, list[str], float]] = [
    ("编辑输出", ["WRITE AFTER", "WRITE BEFORE"], 0.80),
    ("文件编成", ["ORGANIZATION IS"], 0.99),
    ("替代索引", ["ALTERNATE RECORD KEY"], 0.99),
+    ("マッチング", ["re:WS-[-\\w]*KEY"], 0.65),
 ]

 # ── 冲突解决规则 ─────────────────────────────────────────────────────────
@@ -39,6 +41,8 @@ CONFLICT_RULES: dict[tuple[str, str], str] = {
 def detect_keyword(source: str) -> list[tuple[str, float, str]]:
    """在 COBOL 源码中搜索 L1_RULES 定义的关键字，返回匹配结果。

+    关键字前缀 "re:" 表示正则表达式匹配（如 "re:WS-\\w*KEY" 匹配 WS-MAST-KEY 等）。
+
    Args:
        source: COBOL 程序源码文本。

@@ -50,10 +54,19 @@ def detect_keyword(source: str) -> list[tuple[str, float, str]]:
    source_upper = source.upper()

    for category, keywords, confidence in L1_RULES:
+        matched = False
        for kw in keywords:
-            if kw in source_upper:
-                results.append((category, confidence, kw))
-                break  # 同一分类只记录一次
+            if kw.startswith("re:"):
+                pattern = kw[3:]
+                if re.search(pattern, source_upper):
+                    results.append((category, confidence, kw))
+                    matched = True
+                    break
+            else:
+                if kw in source_upper:
+                    results.append((category, confidence, kw))
+                    matched = True
+                    break

    return results