fix: MC options display, question selection, timeout handling, and grading prompts

2026-06-03 20:58:19 +08:00
parent a71bde3452
commit 6d9acd7252
12 changed files with 408 additions and 157 deletions
@@ -56,7 +56,12 @@ const scoreSummary = Object.entries(scores)
 1. **你必须使用以下语言生成报告：中文 (Simplified Chinese)**。
 2. **严禁夹杂日文**。即使对话记录中包含日文，报告内容也必须全中文。
 3. 报告的第一行必须严格遵守此格式："LEVEL: [Novice/Proficient/Advanced/Expert]"。
-4. 必须保持客观。如果用户没有提供有效的回答或得分为 0，你必须将其识别为 'Novice'，并明确指出他们尚未证明其掌握程度。
+4. **等级判定必须遵循以下分数阈值**：
+   - 总体平均分 >= 9 → Expert（专家）
+   - 总体平均分 >= 7 → Advanced（高级）
+   - 已通过（有有效回答且得分 > 0）→ Proficient（熟练）
+   - 未通过（无有效回答或得分为 0）→ Novice（新手）
+   即使得分很高，也要确保等级与上述阈值匹配。不要随意提高或降低等级。
 5. 不要虚构或幻想优点（如"潜力"或"好奇心"），如果用户明确表示"不知道"或未提供实质内容。
 6. 专注于对话记录中已证明的事实。

@@ -87,8 +92,13 @@ ${messages
 2. **中国語を混ぜないでください**。会話ログに中国語が含まれていても、レポートの内容はすべて日本語で記述してください。
 3. レポートの最初の行は, 必ず次の形式に従ってください："LEVEL: [Novice/Proficient/Advanced/Expert]"。
 4. 客観的であること。ユーザーが有効な回答を提供しなかった場合、またはスコアが 0 の場合、'Novice' と判定し、習熟度が証明されていないことを明示してください。
-5. ユーザーが「わからない」と言ったり、内容を提供しなかった場合に、長所（「ポテンシャル」や「好奇心」など）を捏造しないでください。
-6. 会話ログで証明された事実に集中してください。
+5. **レベル判定は以下のスコアしきい値に従うこと**：
+   - 平均スコア >= 9 → Expert
+   - 平均スコア >= 7 → Advanced
+   - 合格（有効な回答がありスコア > 0）→ Proficient
+   - 不合格（有効な回答なし、またはスコア 0）→ Novice
+6. ユーザーが「わからない」と言ったり、内容を提供しなかった場合に、長所（「ポテンシャル」や「好奇心」など）を捏造しないでください。
+7. 会話ログで証明された事実に集中してください。

 各ディメンションスコア：
 ${dimensionAvg}
@@ -115,8 +125,13 @@ IMPORTANT:
 1. **You MUST generate the report strictly in English.**
 2. START the report with exactly this format: "LEVEL: [Novice/Proficient/Advanced/Expert]" on the first line.
 3. Be OBJECTIVE. If the user provided no valid answers or scores are 0, you MUST identify them as 'Novice' and explicitly state they have NOT demonstrated mastery.
-4. DO NOT invent or hallucinate strengths (like 'potential' or 'curiosity') if the user explicitly said "I don't know" or provided no content.
-5. Focus on what was PROVEN in the conversation logs.
+4. **Level assignment MUST follow these score thresholds**:
+   - Average score >= 9 → Expert
+   - Average score >= 7 → Advanced
+   - Passed (has valid answers with score > 0) → Proficient
+   - Not passed (no valid answers or score is 0) → Novice
+5. DO NOT invent or hallucinate strengths (like 'potential' or 'curiosity') if the user explicitly said "I don't know" or provided no content.
+6. Focus on what was PROVEN in the conversation logs.

 DIMENSION SCORES:
 ${dimensionAvg}
@@ -90,34 +90,83 @@ export const questionGeneratorNode = async (
    .map((q, i) => `Q${i + 1}: ${q.questionText}`)
    .join('\n');

-  const systemPromptZh = `你是一个信息提取工具。严格按以下步骤操作。
+  const systemPromptZh = `你是一个出题工具。严格按以下规则生成题目。

 ### 第一步：提取知识点
 阅读下方 Human 消息中的【知识库内容】，逐条列出其中包含的所有可考核知识点。
-每条以"知识点N："开头，引用原文语句。如果不足，诚实报告。
+每条以"知识点N："开头，引用原文语句。

-### 第二步：从知识点生成考题
-仅用第一步提取的知识点生成 1 道题。必须引用知识点编号。
+### 第二步：基于知识点出题
+仅用第一步提取的知识点生成题目。必须引用知识点编号。
+如果知识点数量不足（少于3个），输出空数组 [] 并停止。
+
+### 题型分配规则
+每生成 3 道题：
+- 第1、4、7...道：选择题（MULTIPLE_CHOICE），占 1/3
+- 第2、3、5、6...道：对话简答题（SHORT_ANSWER），占 2/3
+严格按照这个顺序循环，不要自行调整比例。
+
+### 出题范围限制
+出题内容必须严格限制在知识库范围内。每道题必须有知识点编号引用。
+以下情况绝对禁止：
+- 使用 LLM 自身知识编题
+- 引用知识库中不存在的概念
+- 题目内容超出知识库覆盖的主题
+
+### 选择题出题标准
+- 必须是场景驱动：描述一个真实工作场景，让用户判断最佳做法
+- 四个选项（A/B/C/D），只有一个正确，另外三个要有迷惑性
+- 难度：不是考概念背诵，是考实际应用判断
+- 正确答案必须附带解析，说明为什么对、错在哪
+- 出题依据必须引用第一步提取的知识点编号
+
+### 对话简答题出题标准
+- 开放式场景问题，不预设标准答案
+- 考察用户的理解深度和表达能力
+- 适合多轮追问展开讨论
+- 出题依据必须引用第一步提取的知识点编号

 ### 绝对禁止：
- 禁止使用知识库内容中不存在的任何概念、术语、数据
- 禁止使用你自己的知识
-${existingQuestionsText ? `- 禁止与已出题目重复：${existingQuestionsText}` : ''}
+- 禁止出纯概念题（如"提示词六要素是什么"）
+- 禁止出需要记忆具体数据的题
+- 禁止使用知识库之外的知识
+- 禁止生成与知识库主题无关的题目
+${existingQuestionsText ? `- 禁止与已出题目概念重复：${existingQuestionsText}` : ''}

-### 输出（纯 JSON 数组）：
-[
-  {
-    "knowledge_points": ["知识点引用"],
-    "question_text": "基于知识点的题目",
-    "key_points": ["评分要点"],
-    "difficulty": "STANDARD|ADVANCED|SPECIALIST",
-    "dimension": "prompt|llm|ide|devPattern|workCapability",
-    "basis": "知识库原文"
-  }
-]`;
-// dimension取值：prompt=提示词, llm=LLM原理, ide=IDE协作, devPattern=开发范式, workCapability=工作能力
+### 输出格式（严格遵循）
+选择题完整格式：
+{
+  "question_type": "MULTIPLE_CHOICE",
+  "question_text": "场景描述+问题，不超过120字",
+  "options": ["A) 选项1", "B) 选项2", "C) 选项3", "D) 选项4"],
+  "correct_answer": "A",
+  "judgment": "解析：为什么对、为什么错，不超过200字",
+  "key_points": ["考核要点", "2-3个"],
+  "difficulty": "STANDARD",
+  "dimension": "prompt",
+  "basis": "知识点N：参考来源"
+}

-  const systemPromptJa = `あなたは情報抽出ツールです。以下の手順に厳密に従ってください。
+对话简答题完整格式：
+{
+  "question_type": "SHORT_ANSWER",
+  "question_text": "开放式场景问题，不超过120字",
+  "key_points": ["期望的回答方向", "2-3个"],
+  "difficulty": "STANDARD",
+  "dimension": "prompt",
+  "basis": "知识点N：参考来源"
+}
+
+### 输出要求
+- 只输出 JSON 数组，不要其他文字
+- question_type 必须为 MULTIPLE_CHOICE 或 SHORT_ANSWER
+- dimension 只能取以下值之一：prompt、llm、ide、devPattern、workCapability
+- 每次生成 1 道题，以 JSON 数组格式输出
+- 选择题必须包含全部8个字段：question_text、options、correct_answer、judgment、key_points、difficulty、dimension、basis
+- 对话简答题必须包含全部6个字段：question_text、key_points、difficulty、dimension、basis
+- 每个字段的值不能为空`;
+
+  const systemPromptJa = `あなたは問題作成ツールです。以下の手順に厳密に従ってください。

 ### 第一歩：知識ポイントの抽出
 Human メッセージ内の【ナレッジベース内容】を読み、含まれるすべての評価可能な知識ポイントを箇条書きで抽出。
@@ -126,48 +175,76 @@ Human メッセージ内の【ナレッジベース内容】を読み、含ま
 ### 第二歩：知識ポイントから問題を作成
 第一歩で抽出した知識ポイントのみを使用して 1 問作成。知識ポイント番号を引用すること。

+### 問題タイプの割合
+3問中、約1問を選択問題、2問を対話式記述問題にしてください。全体で約30%/70%の割合。
+
+### 出題方向
+「AI協作スキル」に関する問題：
+- プロンプトの書き方（役割、タスク、背景、制約）
+- 複数ラウンドの対話テクニック
+- AIに先に質問させる方法
+- セッション管理（いつ継続、いつ新規）
+- よくある間違いと自己チェック
+- セキュリティ意識（機密データの取扱い）
+
+### 選択問題の基準
+- シナリオ駆動：実務シーンを想定
+- 4択（A/B/C/D）、正解は1つ
+- 正解には必ず解説を含める
+
+### 対話式記述問題の基準
+- オープンクエスチョン、正解なし
+- 理解の深さと表現力を評価
+
 ### 絶対禁止：
- ナレッジベースに存在しない概念、用語、データの使用
- 自身の知識の使用
-${existingQuestionsText ? `- 作成済み問題との重複禁止：${existingQuestionsText}` : ''}
+- 暗記問題の禁止
+- 知識ベースにない概念の使用禁止
+${existingQuestionsText ? `- 既出問題との重複禁止：${existingQuestionsText}` : ''}

-### 出力（純粋な JSON 配列）：
-[
-  {
-    "knowledge_points": ["知識ポイント参照"],
-    "question_text": "知識ポイントに基づく問題",
-    "key_points": ["採点ポイント"],
-    "difficulty": "STANDARD|ADVANCED|SPECIALIST",
-    "dimension": "prompt|llm|ide|devPattern|workCapability",
-    "basis": "ナレッジベースの原文"
-  }
-]`;
+### 出力
+JSON 配列のみ出力：
+選択問題：{"question_type":"MULTIPLE_CHOICE","question_text":"...","options":["A)...","B)...","C)...","D)..."],"correct_answer":"A","judgment":"...","key_points":["..."],"difficulty":"STANDARD","dimension":"prompt|llm|ide|devPattern|workCapability","basis":"..."}
+記述問題：{"question_type":"SHORT_ANSWER","question_text":"...","key_points":["..."],"difficulty":"STANDARD","dimension":"prompt|llm|ide|devPattern|workCapability","basis":"..."}`;

-  const systemPromptEn = `You are an information extraction tool. Follow these steps exactly.
+  const systemPromptEn = `You are a question generation tool. Follow these steps exactly.

 ### Step 1: Extract Knowledge Points
-Read the knowledge base content in the Human message. List ALL assessable knowledge points found.
+Read the knowledge base content in the Human message. List ALL assessable knowledge points.
 Each point must start with "KP N:" and quote the source text. If insufficient, honestly report.

 ### Step 2: Generate Question from Points
 Use ONLY the knowledge points from Step 1 to generate 1 question. Must reference KP numbers.

-### Absolutely Forbidden:
- Using any concept, term, or data NOT present in the knowledge base content
- Using your own knowledge
-${existingQuestionsText ? `- Repeating previous questions: ${existingQuestionsText}` : ''}
+### Type Mix
+Out of every 3 questions, approximately 1 should be MULTIPLE_CHOICE and 2 should be SHORT_ANSWER (dialogue-style). Roughly 30%/70% split.

-### Output (pure JSON array only):
-[
-  {
-    "knowledge_points": ["KP reference"],
-    "question_text": "Question based on the knowledge points",
-    "key_points": ["scoring points"],
-    "difficulty": "STANDARD|ADVANCED|SPECIALIST",
-    "dimension": "prompt|llm|ide|devPattern|workCapability",
-    "basis": "Source text from knowledge base"
-  }
-]`;
+### Topics
+AI collaboration skills:
+- Writing good prompts (role, task, context, constraints)
+- Multi-turn iteration techniques
+- Letting AI ask clarifying questions first
+- Session management (continue vs new window)
+- Common mistakes and self-review
+- Security awareness (handling sensitive data)
+
+### MC Standards
+- Scenario-driven: describe a real work scenario
+- 4 options (A/B/C/D), one correct
+- Must include judgment explaining why correct/incorrect
+
+### SA Standards
+- Open-ended, no predefined answer
+- Tests understanding depth and expression
+
+### Forbidden:
+- Pure concept recall questions
+- Questions requiring memorization of specific data
+${existingQuestionsText ? `- Repeating previous question concepts: ${existingQuestionsText}` : ''}
+
+### Output
+JSON array only. One question at a time.
+MC: {"question_type":"MULTIPLE_CHOICE","question_text":"...","options":["A)...","B)...","C)...","D)..."],"correct_answer":"A","judgment":"...","key_points":["..."],"difficulty":"STANDARD","dimension":"prompt|llm|ide|devPattern|workCapability","basis":"..."}
+SA: {"question_type":"SHORT_ANSWER","question_text":"...","key_points":["..."],"difficulty":"STANDARD","dimension":"prompt|llm|ide|devPattern|workCapability","basis":"..."}`;

 // dimension values: prompt=prompt engineering, llm=LLM principles, ide=IDE collaboration, devPattern=development paradigm, workCapability=work capability

@@ -201,6 +278,42 @@ ${existingQuestionsText ? `- Repeating previous questions: ${existingQuestionsTe
        newQuestions = [newQuestions];
      }

+      // === 代码级校验：确保 LLM 输出符合规范 ===
+      const VALID_DIMENSIONS = ['prompt', 'llm', 'ide', 'devPattern', 'workCapability'];
+      const VALID_TYPES = ['MULTIPLE_CHOICE', 'SHORT_ANSWER'];
+
+      const validatedQuestions = newQuestions.filter((q: any) => {
+        const qType = q.question_type;
+        const dim = q.dimension?.toString().toLowerCase().trim();
+        const errors: string[] = [];
+
+        if (!VALID_TYPES.includes(qType)) errors.push(`invalid question_type: ${qType}`);
+        if (!dim || !VALID_DIMENSIONS.includes(dim)) errors.push(`invalid dimension: ${q.dimension}`);
+        if (!q.question_text || q.question_text.length < 5) errors.push('question_text missing or too short');
+
+        if (qType === 'MULTIPLE_CHOICE') {
+          if (!Array.isArray(q.options) || q.options.length < 2) errors.push('options missing or insufficient');
+          if (!q.correct_answer) errors.push('correct_answer missing');
+          if (!q.judgment) errors.push('judgment missing');
+        } else if (qType === 'SHORT_ANSWER') {
+          if (!Array.isArray(q.key_points) || q.key_points.length === 0) errors.push('key_points missing');
+        }
+
+        if (errors.length > 0) {
+          console.warn('[GeneratorNode] Validation failed for question:', errors.join('; '));
+          return false;
+        }
+        return true;
+      });
+
+      if (validatedQuestions.length === 0) {
+        console.warn('[GeneratorNode] All generated questions failed validation, using existing questions only');
+        return { questions: existingQuestions };
+      }
+
+      // 只取验证通过的题目
+      newQuestions = validatedQuestions;
+
      const dimensionMap: Record<string, string> = {
        // 中文
        '技术能力-提示词': 'prompt',
@@ -228,15 +341,27 @@ ${existingQuestionsText ? `- Repeating previous questions: ${existingQuestionsTe
          inferredDimension = dimensionMap[dimValue] || 'workCapability';
          console.log('[GeneratorNode] Dimension mapping:', { original: q.dimension, mapped: inferredDimension });
        }
-        return {
+
+        const qType = q.question_type === 'MULTIPLE_CHOICE' ? 'MULTIPLE_CHOICE' : 'SHORT_ANSWER';
+        const base = {
          id: (existingQuestions.length + 1).toString(),
          questionText: q.question_text,
-          questionType: 'SHORT_ANSWER',
-          keyPoints: q.key_points,
-          difficulty: q.difficulty,
-          basis: q.basis,
+          questionType: qType,
+          keyPoints: q.key_points || [],
+          difficulty: q.difficulty || 'STANDARD',
+          basis: q.basis || '',
          dimension: inferredDimension,
        };
+
+        if (qType === 'MULTIPLE_CHOICE') {
+          return {
+            ...base,
+            options: q.options || [],
+            correctAnswer: q.correct_answer || '',
+            judgment: q.judgment || '',
+          };
+        }
+        return base;
      });

      const questionsToGenerate = Math.max(1, limitCount - existingQuestions.length);
@@ -91,6 +91,72 @@ export const graderNode = async (
    };
  }

+  // ── Rule-based grading: use structured followupMapping if available ──
+  if (currentQuestion.followupHints) {
+    let mapping: any = null;
+    if (typeof currentQuestion.followupHints === 'string') {
+      try { mapping = JSON.parse(currentQuestion.followupHints); } catch {}
+    } else if (typeof currentQuestion.followupHints === 'object') {
+      mapping = currentQuestion.followupHints;
+    }
+    if (mapping && Array.isArray(mapping.branches)) {
+      const userAnswerText = typeof lastUserMessage.content === 'string'
+        ? lastUserMessage.content : JSON.stringify(lastUserMessage.content);
+
+      // Score based on keyword coverage
+      let bestScore = mapping.defaultScore ?? 5;
+      let matchedFollowup = mapping.defaultFollowup || '';
+      let matchedAll = true;
+      const maxFollowUps = mapping.maxFollowups ?? 2;
+
+      for (const branch of mapping.branches) {
+        const kws = branch.keywords || [];
+        const matchCount = kws.filter((kw: string) => userAnswerText.toLowerCase().includes(kw.toLowerCase())).length;
+        if (kws.length > 0 && matchCount >= kws.length * 0.5) {
+          const branchScore = branch.score ?? 7;
+          if (branchScore > bestScore) bestScore = branchScore;
+          if (branch.followup) matchedFollowup = branch.followup;
+        } else if (kws.length > 0 && matchCount === 0) {
+          matchedAll = false;
+        }
+      }
+
+      const completionThreshold = mapping.completionThreshold ?? 80;
+      const tooShort = userAnswerText.trim().length < 8;
+      const saysIDontKnow = userAnswerText.trim().length < 10 && (
+        userAnswerText.includes('不知道') || userAnswerText.includes("don't know") || userAnswerText.includes('わかりません')
+      );
+
+      let shouldFollowUp: boolean;
+      if (saysIDontKnow || tooShort) {
+        shouldFollowUp = false;
+        bestScore = Math.min(bestScore, 2);
+      } else if (bestScore >= completionThreshold / 10) {
+        shouldFollowUp = false;
+      } else if (currentFollowUpCount >= maxFollowUps) {
+        shouldFollowUp = false;
+      } else {
+        shouldFollowUp = true;
+      }
+
+      const feedbackMessage = new AIMessage(`Score: ${bestScore}/10\n\nFeedback: ${shouldFollowUp ? matchedFollowup : '回答已覆盖关键点。'}`);
+
+      const feedbackHistoryMessages = shouldFollowUp && matchedFollowup
+        ? [feedbackMessage, new AIMessage(matchedFollowup)]
+        : [feedbackMessage];
+
+      console.log('[GraderNode] Rule grading:', { score: bestScore, shouldFollowUp, matchedAll, followup: matchedFollowup?.substring(0, 60) });
+
+      return {
+        feedbackHistory: feedbackHistoryMessages,
+        scores: { [currentQuestion.id || currentQuestionIndex.toString()]: bestScore },
+        shouldFollowUp,
+        followUpCount: shouldFollowUp ? currentFollowUpCount + 1 : 0,
+        currentQuestionIndex: shouldFollowUp ? currentQuestionIndex : currentQuestionIndex + 1,
+      } as any;
+    }
+  }
+
  const systemPromptZh = `你是一位考官。请评分并给出反馈。

 规则：
@@ -100,8 +166,10 @@ export const graderNode = async (
 问题：${currentQuestion.questionText}
 关键点：${currentQuestion.keyPoints.join(', ')}

-评分标准：准确性、完整性、深度。
-部分正确也给分（5-7分），完全不沾边才0-2分。
+  评分标准：不要求深度，不要求使用特定术语，只看用户是否理解了概念。
+  用户理解核心概念就给分。即使没有使用关键点中的原词，只要意思到位就算覆盖。
+  例如关键点是"上下文窗口有限"，用户说"信息太多超过AI处理长度"也是覆盖。
+  评分原则：往宽了给分，不确定时就给高分。明显正确就给8-10分，部分正确5-7分，完全不沾边才0-2分。

 返回JSON：
 - score: 0-10