feat: implement QuestionBank CRUD with pagination and template query
- Add pagination support to findAll (page, limit query params) - Add findByTemplateId method to service - Add GET /by-template/:templateId endpoint to controller - Service already includes CRUD for QuestionBank and QuestionBankItem
This commit is contained in:
@@ -0,0 +1,57 @@
|
||||
FROM python:3.12-alpine
|
||||
|
||||
# 配置 APK standard repositories and install LibreOffice 及 dependencies
|
||||
RUN echo "https://dl-cdn.alpinelinux.org/alpine/v3.19/main" > /etc/apk/repositories && \
|
||||
echo "https://dl-cdn.alpinelinux.org/alpine/v3.19/community" >> /etc/apk/repositories && \
|
||||
apk update && \
|
||||
apk add --no-cache \
|
||||
libreoffice \
|
||||
libreoffice-common \
|
||||
libreoffice-writer \
|
||||
libreoffice-impress \
|
||||
libreoffice-calc \
|
||||
font-noto \
|
||||
font-noto-cjk \
|
||||
ttf-dejavu \
|
||||
imagemagick \
|
||||
&& rm -rf /var/cache/apk/*
|
||||
|
||||
# 安装 Python 依赖
|
||||
COPY requirements.txt .
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
# Install Node.js, Chromium, and essential dependencies for Puppeteer
|
||||
RUN apk add --no-cache \
|
||||
nodejs \
|
||||
npm \
|
||||
chromium \
|
||||
nss \
|
||||
freetype \
|
||||
harfbuzz \
|
||||
ca-certificates \
|
||||
ttf-dejavu \
|
||||
&& rm -rf /var/cache/apk/*
|
||||
|
||||
# Configure Puppeteer to use installed Chromium
|
||||
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true \
|
||||
PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser
|
||||
|
||||
# Install Node dependencies
|
||||
COPY package.json .
|
||||
RUN npm install --registry=https://registry.npmmirror.com
|
||||
|
||||
WORKDIR /app
|
||||
COPY main.py /app/
|
||||
COPY md_to_pdf.js /app/
|
||||
|
||||
# Link node_modules to app directory so the script can find required modules
|
||||
RUN ln -sf /node_modules /app/node_modules
|
||||
|
||||
# 创建挂载目录
|
||||
RUN mkdir -p /app/uploads /temp
|
||||
|
||||
# 暴露端口
|
||||
EXPOSE 8100
|
||||
|
||||
# 启动 FastAPI
|
||||
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8100", "--workers", "2"]
|
||||
@@ -0,0 +1,214 @@
|
||||
# LibreOffice FastAPI ドキュメント変換サービス
|
||||
|
||||
## 📋 概要
|
||||
|
||||
これは FastAPI ベースの独立したドキュメント変換サービスで、Word、PPT、Excel などのドキュメントを PDF に変換するために使用されます。RAG の高精度モードにおけるドキュメント処理パイプラインをサポートします。
|
||||
|
||||
## 🎯 コア機能
|
||||
|
||||
- **形式変換**: Word/PPT/Excel → PDF
|
||||
- **PDF スルーパス**: PDF ファイルは変換せずにそのまま返却
|
||||
- **自動ドキュメント生成**: `/docs` にアクセスしてインタラクティブな API ドキュメントを確認可能
|
||||
- **ヘルスチェック**: `/health` エンドポイントによるサービス状態の監視
|
||||
- **非同期処理**: FastAPI の非同期アーキテクチャによる高性能な処理
|
||||
|
||||
## 🚀 クイックスタート
|
||||
|
||||
### ローカル開発
|
||||
|
||||
```bash
|
||||
# 1. 依存関係のインストール
|
||||
pip install -r requirements.txt
|
||||
|
||||
# 2. サービスの起動
|
||||
uvicorn main:app --reload --port 8100
|
||||
|
||||
# 3. ドキュメントへのアクセス
|
||||
open http://localhost:8100/docs
|
||||
```
|
||||
|
||||
### Docker デプロイ
|
||||
|
||||
```bash
|
||||
# 1. イメージのビルド
|
||||
docker build -t libreoffice-server .
|
||||
|
||||
# 2. コンテナの実行
|
||||
docker run -d \
|
||||
--name lo-converter \
|
||||
-p 8100:8100 \
|
||||
-v ./uploads:/uploads \
|
||||
-v ./temp:/temp \
|
||||
libreoffice-server
|
||||
|
||||
# 3. ヘルスチェック
|
||||
curl http://localhost:8100/health
|
||||
```
|
||||
|
||||
## 📡 API エンドポイント
|
||||
|
||||
### POST /convert
|
||||
|
||||
ドキュメントを PDF に変換します。
|
||||
|
||||
**リクエスト**:
|
||||
|
||||
```bash
|
||||
curl -X POST -F "file=@test.docx" http://localhost:8100/convert
|
||||
```
|
||||
|
||||
**レスポンス**:
|
||||
|
||||
```json
|
||||
{
|
||||
"pdf_path": "/uploads/test.pdf",
|
||||
"converted": true,
|
||||
"original": "test.docx",
|
||||
"file_size": 102400
|
||||
}
|
||||
```
|
||||
|
||||
**サポートされている形式**:
|
||||
|
||||
- `.pdf` - そのまま返却
|
||||
- `.doc`, `.docx` - Word ドキュメント
|
||||
- `.ppt`, `.pptx` - PowerPoint プレゼンテーション
|
||||
- `.xls`, `.xlsx` - Excel スプレッドシート
|
||||
|
||||
### GET /health
|
||||
|
||||
ヘルスチェックを行います。
|
||||
|
||||
**レスポンス**:
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"service": "libreoffice-converter",
|
||||
"version": "1.0.0",
|
||||
"uptime": 123.45
|
||||
}
|
||||
```
|
||||
|
||||
### GET /docs
|
||||
|
||||
自動生成される API ドキュメント (Swagger UI) です。
|
||||
|
||||
## 🐳 Docker Compose との統合
|
||||
|
||||
メインプロジェクトの `docker-compose.yml` に以下を追加してください:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
libreoffice:
|
||||
build: ./libreoffice-server
|
||||
container_name: lo-converter
|
||||
volumes:
|
||||
- ./uploads:/uploads
|
||||
- ./temp:/temp
|
||||
ports:
|
||||
- "8100:8100"
|
||||
restart: unless-stopped
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 1G
|
||||
cpus: '1.0'
|
||||
```
|
||||
|
||||
## 🔧 環境変数
|
||||
|
||||
サービス自体に特別な環境変数は必要ありませんが、ボリュームのマウントによって以下のディレクトリを構成できます:
|
||||
|
||||
- `/uploads` - ドキュメントの保存ディレクトリ
|
||||
- `/temp` - 一時ファイルのディレクトリ
|
||||
|
||||
## 📊 パフォーマンスの目安
|
||||
|
||||
| ドキュメント形式 | ページ数 | 変換時間 | 備考 |
|
||||
|---------|------|---------|------|
|
||||
| Word | 50ページ | 約10秒 | 書式を保持 |
|
||||
| PPT | 50ページ | 約15秒 | 各ページを画像として処理 |
|
||||
| Excel | 10ページ | 約8秒 | 表として変換 |
|
||||
| PDF | 任意 | 約0秒 | 直接返却 |
|
||||
|
||||
## 🛠️ デバッグのヒント
|
||||
|
||||
### ログの確認
|
||||
|
||||
```bash
|
||||
docker logs -f lo-converter
|
||||
```
|
||||
|
||||
### 変換テスト
|
||||
|
||||
```bash
|
||||
# テストファイルの準備
|
||||
echo "test" > test.docx
|
||||
|
||||
# 変換テストの実行
|
||||
curl -X POST -F "file=@test.docx" http://localhost:8100/convert | jq
|
||||
```
|
||||
|
||||
### コンテナ内でのデバッグ
|
||||
|
||||
```bash
|
||||
docker exec -it lo-converter sh
|
||||
```
|
||||
|
||||
## 🔗 依存関係の説明
|
||||
|
||||
- **FastAPI**: モダンな Python Web フレームワーク
|
||||
- **Uvicorn**: ASGI サーバー
|
||||
- **LibreOffice**: ドキュメント変換エンジン
|
||||
- **Pydantic**: データバリデーション
|
||||
|
||||
## 📝 注意事項
|
||||
|
||||
1. **ファイルサイズ**: 100MB 以内に制限することを推奨します。
|
||||
2. **タイムアウト**: デフォルトは 300 秒です。必要に応じてコード内で調整してください。
|
||||
3. **並列処理**: 2〜3 個のワーカーを推奨します。
|
||||
4. **メモリ制限**: 1GB を推奨します。
|
||||
5. **一時ファイル**: 定期的なクリーンアップが必要です。
|
||||
|
||||
## 🎯 メインシステムとの連携
|
||||
|
||||
### サーバー側の呼び出し例
|
||||
|
||||
```typescript
|
||||
// server/src/libreoffice/libreoffice.service.ts
|
||||
async convertToPDF(filePath: string): Promise<string> {
|
||||
const fileName = path.basename(filePath);
|
||||
const fileBuffer = await fs.readFile(filePath);
|
||||
|
||||
const formData = new FormData();
|
||||
formData.append('file', fileBuffer, fileName);
|
||||
|
||||
const response = await axios.post(
|
||||
`${this.baseUrl}/convert`,
|
||||
formData,
|
||||
{ timeout: 300000 }
|
||||
);
|
||||
|
||||
return response.data.pdf_path;
|
||||
}
|
||||
```
|
||||
|
||||
## 📚 関連ドキュメント
|
||||
|
||||
- [メインプロジェクト README](../README.md)
|
||||
- [Vision Pipeline の設計](../docs/VISION_PIPELINE_COMPLETE.md)
|
||||
- [デプロイガイド](../docs/DEPLOYMENT.md)
|
||||
|
||||
## 🚨 故障診断
|
||||
|
||||
| 問題 | 原因 | 解決策 |
|
||||
|------|------|---------|
|
||||
| 変換失敗 | LibreOffice がインストールされていない | Dockerfile の依存関係を確認してください |
|
||||
| タイムアウト | ファイルが大きすぎる | タイムアウト時間を増やすか、ファイルを分割してください |
|
||||
| ポート競合 | 8100 ポートが既に使用されている | ポートマッピングを変更してください |
|
||||
| 権限エラー | ディレクトリの権限不足 | ボリュームの権限を確認してください |
|
||||
|
||||
## 📄 ライセンス
|
||||
|
||||
MIT License
|
||||
@@ -0,0 +1,191 @@
|
||||
import io
|
||||
import os
|
||||
import subprocess
|
||||
import time
|
||||
from typing import Optional
|
||||
|
||||
from fastapi import FastAPI, File, HTTPException, UploadFile
|
||||
from fastapi.responses import FileResponse, RedirectResponse
|
||||
from PIL import Image # Pillow library for image processing
|
||||
from pydantic import BaseModel
|
||||
|
||||
# Response models
|
||||
class ConvertResponse(BaseModel):
|
||||
pdf_path: str
|
||||
converted: bool
|
||||
original: Optional[str] = None
|
||||
file_size: Optional[int] = None
|
||||
error: Optional[str] = None
|
||||
|
||||
class HealthResponse(BaseModel):
|
||||
status: str
|
||||
service: str
|
||||
version: str
|
||||
uptime: float
|
||||
|
||||
# FastAPI Application
|
||||
app = FastAPI(
|
||||
title="LibreOffice Document Conversion Service",
|
||||
description="Convert Word/PPT/Excel/PDF to PDF and support mixed content document processing",
|
||||
version="1.0.0",
|
||||
docs_url="/docs",
|
||||
redoc_url="/redoc"
|
||||
)
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
@app.get("/", include_in_schema=False)
|
||||
async def root():
|
||||
"""Redirect to documentation page"""
|
||||
return RedirectResponse(url="/docs")
|
||||
|
||||
@app.get("/health", response_model=HealthResponse)
|
||||
async def health():
|
||||
"""Health check interface"""
|
||||
return HealthResponse(
|
||||
status="healthy",
|
||||
service="libreoffice-converter",
|
||||
version="1.0.0",
|
||||
uptime=time.time() - start_time
|
||||
)
|
||||
|
||||
@app.post("/convert")
|
||||
async def convert(file: UploadFile = File(...)):
|
||||
"""
|
||||
Document conversion interface
|
||||
Returns: PDF file stream
|
||||
"""
|
||||
try:
|
||||
# File format validation
|
||||
allowed_extensions = [
|
||||
'.pdf', '.doc', '.docx', '.ppt', '.pptx', '.xls', '.xlsx',
|
||||
'.md', '.txt', '.rtf', '.odt', '.ods', '.odp',
|
||||
'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp'
|
||||
]
|
||||
file_ext = os.path.splitext(file.filename)[1].lower()
|
||||
|
||||
if file_ext not in allowed_extensions:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"Unsupported file format: {file_ext}. Supported formats: {', '.join(allowed_extensions)}"
|
||||
)
|
||||
|
||||
# Check uploads directory existence
|
||||
upload_dir = "/app/uploads" if os.path.exists("/app/uploads") else "./uploads"
|
||||
os.makedirs(upload_dir, exist_ok=True)
|
||||
|
||||
# Save uploaded file
|
||||
filepath = os.path.join(upload_dir, file.filename)
|
||||
with open(filepath, "wb") as buffer:
|
||||
content = await file.read()
|
||||
buffer.write(content)
|
||||
|
||||
# For PDF files, return directly without conversion
|
||||
if file_ext == '.pdf':
|
||||
return FileResponse(filepath, filename=file.filename, media_type='application/pdf')
|
||||
|
||||
if file_ext == '.md':
|
||||
# Use Node.js script to render Markdown to PDF
|
||||
expected_pdf = filepath.rsplit('.', 1)[0] + '.pdf'
|
||||
cmd = [
|
||||
'node',
|
||||
'/app/md_to_pdf.js',
|
||||
filepath,
|
||||
expected_pdf
|
||||
]
|
||||
elif file_ext in ['.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp']:
|
||||
# For image files, use Pillow to convert to PDF
|
||||
expected_pdf = filepath.rsplit('.', 1)[0] + '.pdf'
|
||||
|
||||
# Open image and save as PDF
|
||||
with Image.open(filepath) as img:
|
||||
# Convert RGBA mode to RGB (support for transparent images)
|
||||
if img.mode in ('RGBA', 'LA', 'P'):
|
||||
# Convert to white background
|
||||
background = Image.new('RGB', img.size, (255, 255, 255))
|
||||
if img.mode == 'P':
|
||||
img = img.convert('RGBA')
|
||||
background.paste(img, mask=img.split()[-1] if img.mode in ('RGBA', 'LA') else None)
|
||||
img = background
|
||||
elif img.mode != 'RGB':
|
||||
img = img.convert('RGB')
|
||||
|
||||
# Save as PDF
|
||||
img.save(expected_pdf, 'PDF', resolution=100.0, save_all=False)
|
||||
|
||||
# Verify PDF generation completed
|
||||
if not os.path.exists(expected_pdf):
|
||||
raise HTTPException(
|
||||
status_code=500,
|
||||
detail="Image to PDF conversion succeeded but output file not found"
|
||||
)
|
||||
|
||||
# Image conversion completed, return PDF file
|
||||
filename_base = os.path.splitext(file.filename)[0]
|
||||
return FileResponse(expected_pdf, filename=f"{filename_base}.pdf", media_type='application/pdf')
|
||||
else:
|
||||
# Conversion using LibreOffice
|
||||
cmd = [
|
||||
'soffice',
|
||||
'--headless',
|
||||
'--convert-to', 'pdf',
|
||||
'--outdir', upload_dir,
|
||||
filepath
|
||||
]
|
||||
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=600, # Extended to 10 minutes to support complex Markdown conversion
|
||||
)
|
||||
|
||||
# Combine stdout and stderr for error reporting since capture_output uses PIPE
|
||||
combined_output = result.stdout if result.stdout else ""
|
||||
if result.stderr:
|
||||
combined_output += "\n" + result.stderr
|
||||
|
||||
# Display Node.js script output for debugging
|
||||
print(f"Node.js script output: {combined_output}")
|
||||
|
||||
if result.returncode != 0:
|
||||
print(f"Subprocess failed with return code: {result.returncode}")
|
||||
|
||||
# Combine stdout and stderr for error reporting
|
||||
combined_output = result.stdout if result.stdout else ""
|
||||
if result.stderr:
|
||||
combined_output += "\n" + result.stderr
|
||||
|
||||
print(f"Subprocess output: {combined_output}")
|
||||
raise HTTPException(
|
||||
status_code=500,
|
||||
detail=f"Conversion failed: {combined_output}"
|
||||
)
|
||||
|
||||
# Verify output file
|
||||
expected_pdf = filepath.rsplit('.', 1)[0] + '.pdf'
|
||||
if not os.path.exists(expected_pdf):
|
||||
raise HTTPException(
|
||||
status_code=500,
|
||||
detail="Conversion succeeded but output file not found"
|
||||
)
|
||||
|
||||
filename_base = os.path.splitext(file.filename)[0]
|
||||
return FileResponse(expected_pdf, filename=f"{filename_base}.pdf", media_type='application/pdf')
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
except subprocess.TimeoutExpired:
|
||||
raise HTTPException(status_code=504, detail="Conversion timeout (300 seconds)")
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
||||
@app.get("/version")
|
||||
async def version():
|
||||
"""Version information"""
|
||||
return {
|
||||
"service": "libreoffice-converter",
|
||||
"version": "1.0.0",
|
||||
"framework": "FastAPI",
|
||||
"libreoffice": "7.x"
|
||||
}
|
||||
@@ -0,0 +1,498 @@
|
||||
const fs = require('fs');
|
||||
const { execSync } = require('child_process');
|
||||
const path = require('path');
|
||||
const puppeteer = require('puppeteer');
|
||||
|
||||
console.log('=== MD to PDF Converter Starting ===');
|
||||
console.log('Node.js version:', process.version);
|
||||
console.log('Working directory:', process.cwd());
|
||||
console.log('Input path:', process.argv[2]);
|
||||
console.log('Output path:', process.argv[3]);
|
||||
|
||||
// Arguments: node md_to_pdf.js <input_md_path> <output_pdf_path>
|
||||
const inputPath = process.argv[2];
|
||||
const outputPath = process.argv[3];
|
||||
|
||||
if (!inputPath || !outputPath) {
|
||||
console.error('Usage: node md_to_pdf.js <input_md_path> <output_pdf_path>');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
console.log(`Processing Markdown: ${inputPath}`);
|
||||
|
||||
(async () => {
|
||||
try {
|
||||
console.log('Reading input file...');
|
||||
let mdContent = fs.readFileSync(inputPath, 'utf8');
|
||||
console.log(`File read successfully, length: ${mdContent.length} characters`);
|
||||
|
||||
// 1. Protect Math Blocks
|
||||
const mathBlocks = [];
|
||||
const placeholderPrefix = 'MATHBLOCK_PLACEHOLDER_';
|
||||
|
||||
mdContent = mdContent.replace(/\$\$([\s\S]*?)\$\$/g, (match, p1) => {
|
||||
const id = mathBlocks.length;
|
||||
mathBlocks.push(`$$${p1}$$`);
|
||||
return `${placeholderPrefix}${id}`;
|
||||
});
|
||||
|
||||
mdContent = mdContent.replace(/\$([^\$\n]+?)\$/g, (match, p1) => {
|
||||
const id = mathBlocks.length;
|
||||
mathBlocks.push(`$${p1}$`);
|
||||
return `${placeholderPrefix}${id}`;
|
||||
});
|
||||
console.log(`Protected ${mathBlocks.length} math blocks`);
|
||||
|
||||
// 2. Convert to HTML using marked (CLI via npx or library?)
|
||||
// Since we are in a container, we should use the library directly if possible,
|
||||
// but the reference uses npx. To avoid npx/network dependency at runtime,
|
||||
// we will require 'marked' from node_modules (assuming we verify it's installed).
|
||||
const marked = require('marked');
|
||||
console.log('Parsing markdown content...');
|
||||
let finalHtml = marked.parse(mdContent);
|
||||
console.log('Markdown parsed successfully');
|
||||
|
||||
// 3. Restore Math Blocks
|
||||
mathBlocks.forEach((block, index) => {
|
||||
finalHtml = finalHtml.replace(`${placeholderPrefix}${index}`, block);
|
||||
});
|
||||
|
||||
// 4. Fix Mermaid syntax
|
||||
finalHtml = finalHtml.replace(
|
||||
/<pre><code class="language-mermaid">([\s\S]*?)<\/code><\/pre>/g,
|
||||
(match, content) => {
|
||||
content = content.replace(/"/g, '"')
|
||||
.replace(/'/g, "'")
|
||||
.replace(/>/g, '>')
|
||||
.replace(/</g, '<')
|
||||
.replace(/&/g, '&');
|
||||
return `<div class="mermaid">${content}</div>`;
|
||||
}
|
||||
);
|
||||
|
||||
// 5. Wrap in Template
|
||||
const template = `
|
||||
<!DOCTYPE html>
|
||||
<html lang="zh-CN">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<title>Document</title>
|
||||
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/github-markdown-css/5.2.0/github-markdown-light.min.css">
|
||||
|
||||
<!-- Mermaid -->
|
||||
<script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"></script>
|
||||
|
||||
<!-- MathJax -->
|
||||
<script>
|
||||
window.MathJax = {
|
||||
tex: {
|
||||
inlineMath: [['$', '$'], ['\\\\(', '\\\\)']],
|
||||
displayMath: [['$$', '$$'], ['\\\\[', '\\\\]']],
|
||||
processEscapes: false
|
||||
},
|
||||
startup: {
|
||||
pageReady: () => {
|
||||
return MathJax.startup.defaultPageReady().then(() => {
|
||||
const div = document.createElement('div');
|
||||
div.id = 'mathjax-finished';
|
||||
div.style.display = 'none';
|
||||
document.body.appendChild(div);
|
||||
});
|
||||
}
|
||||
}
|
||||
};
|
||||
</script>
|
||||
<script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js"></script>
|
||||
|
||||
<style>
|
||||
body {
|
||||
box-sizing: border-box;
|
||||
margin: 0 auto;
|
||||
padding: 20px;
|
||||
}
|
||||
.mermaid {
|
||||
display: flex;
|
||||
justify-content: center;
|
||||
margin: 20px 0;
|
||||
}
|
||||
table {
|
||||
width: 100% !important;
|
||||
display: table !important;
|
||||
}
|
||||
</style>
|
||||
|
||||
<!-- Embedded Mermaid Library -->
|
||||
<script>
|
||||
// This is a minimal stub to prevent errors when mermaid is referenced but not available
|
||||
if (typeof window.mermaid === 'undefined') {
|
||||
window.mermaid = {
|
||||
initialize: function() {},
|
||||
init: function() {},
|
||||
render: function() {}
|
||||
};
|
||||
}
|
||||
</script>
|
||||
|
||||
<!-- MathJax configuration and library -->
|
||||
<script>
|
||||
window.MathJax = {
|
||||
tex: {
|
||||
inlineMath: [['$', '$'], ['\\\\(', '\\\\)']],
|
||||
displayMath: [['$$', '$$'], ['\\\\[', '\\\\]']],
|
||||
processEscapes: false
|
||||
},
|
||||
startup: {
|
||||
pageReady: () => {
|
||||
return MathJax.startup.defaultPageReady().then(() => {
|
||||
const div = document.createElement('div');
|
||||
div.id = 'mathjax-finished';
|
||||
div.style.display = 'none';
|
||||
document.body.appendChild(div);
|
||||
});
|
||||
}
|
||||
}
|
||||
};
|
||||
</script>
|
||||
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js"></script>
|
||||
</head>
|
||||
<body class="markdown-body">
|
||||
${finalHtml}
|
||||
<script>
|
||||
// Initialize mermaid if it's available
|
||||
if (typeof mermaid !== 'undefined') {
|
||||
mermaid.initialize({ startOnLoad: true, theme: 'default', securityLevel: 'loose' });
|
||||
} else {
|
||||
console.log('Mermaid library not loaded, skipping initialization');
|
||||
}
|
||||
</script>
|
||||
</body>
|
||||
</html>`;
|
||||
|
||||
console.log('Template prepared, starting PDF generation...');
|
||||
|
||||
// 6. Generate PDF with Puppeteer
|
||||
console.log('Starting Puppeteer browser launch...');
|
||||
const browser = await puppeteer.launch({
|
||||
executablePath: '/usr/bin/chromium-browser', // Alpine location
|
||||
args: [
|
||||
'--no-sandbox',
|
||||
'--disable-setuid-sandbox',
|
||||
'--disable-dev-shm-usage',
|
||||
'--disable-background-timer-throttling',
|
||||
'--disable-renderer-backgrounding',
|
||||
'--disable-backgrounding-occluded-windows',
|
||||
'--memory-pressure-off',
|
||||
'--js-flags=--max-old-space-size=4096', // 增加内存限制
|
||||
'--enable-features=NetworkService',
|
||||
'--disable-features=VizDisplayCompositor',
|
||||
'--disable-gpu',
|
||||
'--disable-web-security',
|
||||
'--disable-features=VizDisplayCompositor'
|
||||
],
|
||||
headless: 'new',
|
||||
timeout: 120000 // Increased timeout for containerized environment
|
||||
});
|
||||
console.log('Browser launched successfully');
|
||||
|
||||
const page = await browser.newPage();
|
||||
console.log('Page created successfully');
|
||||
|
||||
// ページのビューポートとユーザーエージェントを設定
|
||||
await page.setViewport({ width: 1200, height: 800 });
|
||||
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36');
|
||||
console.log('Viewport and user agent set');
|
||||
|
||||
// さまざまなタイムアウトを設定 - 長時間の待機を避けるためにデフォルト値を低下
|
||||
await page.setDefaultNavigationTimeout(30000); // 30秒
|
||||
await page.setDefaultTimeout(30000); // 30秒
|
||||
console.log('Timeouts configured');
|
||||
|
||||
// すべての外部リソースの読み込みをブロックするリクエストをインターセプト
|
||||
await page.setRequestInterception(true);
|
||||
page.on('request', (req) => {
|
||||
// すべての外部リソース要求を完全にブロック(CDNリソースを含む)してネットワークタイムアウトを回避
|
||||
const url = req.url();
|
||||
|
||||
if (url.startsWith('http') || url.startsWith('https') || url.startsWith('ftp')) {
|
||||
// すべての外部リクエストに空白のレスポンスを返して、ネットワークタイムアウトエラーを回避
|
||||
req.respond({
|
||||
status: 200,
|
||||
contentType: 'text/plain',
|
||||
body: ''
|
||||
}).catch(() => {});
|
||||
} else {
|
||||
// ローカルおよびdata URLリソースを許可
|
||||
req.continue().catch(() => {});
|
||||
}
|
||||
});
|
||||
console.log('Request interception configured to block all external resources');
|
||||
|
||||
// エラーイベントを監視
|
||||
page.on('error', (error) => {
|
||||
console.error('Page error:', error);
|
||||
});
|
||||
|
||||
page.on('pageerror', (error) => {
|
||||
console.error('Page error event:', error);
|
||||
});
|
||||
|
||||
page.on('console', (msg) => {
|
||||
console.log('Browser console:', msg.text());
|
||||
});
|
||||
|
||||
console.log('Error listeners attached');
|
||||
|
||||
// 再試行メカニズム
|
||||
let success = false;
|
||||
let attempts = 0;
|
||||
const maxAttempts = 3;
|
||||
|
||||
while (!success && attempts < maxAttempts) {
|
||||
attempts++;
|
||||
console.log(`Attempt ${attempts} of ${maxAttempts} for PDF generation...`);
|
||||
console.log(`HTML template length: ${template.length} characters`);
|
||||
|
||||
try {
|
||||
console.log('About to navigate to data URL...');
|
||||
// 外部リソースを待たずに高速なナビゲーションオプションを使用
|
||||
await page.goto(`data:text/html;charset=UTF-8,${encodeURIComponent(template)}`, {
|
||||
waitUntil: 'domcontentloaded', // 等待DOM加载完成,但不等待资源
|
||||
timeout: 30000 // Reduced timeout for faster failure
|
||||
});
|
||||
console.log('Page loaded successfully');
|
||||
|
||||
// 画像の読み込みを待機(タイムアウトあり、読み込み失敗の画像は素早くスキップ)
|
||||
try {
|
||||
console.log('Checking for images to load...');
|
||||
await page.evaluate(async () => {
|
||||
const images = Array.from(document.querySelectorAll('img'));
|
||||
console.log(`Found ${images.length} images on the page`);
|
||||
if (images.length > 0) {
|
||||
// すべての画像の読み込みを待つのではなく、短時間だけ待って次に進む
|
||||
await new Promise((resolve) => {
|
||||
setTimeout(() => {
|
||||
console.log(`Continuing after attempting to load ${images.length} images`);
|
||||
resolve();
|
||||
}, 500); // 只等待500ms,不管图像是否加载完成
|
||||
});
|
||||
}
|
||||
});
|
||||
} catch (e) {
|
||||
console.warn('Error checking images:', e.message);
|
||||
}
|
||||
|
||||
// MathJaxのレンダリングを待機(タイムアウトあり)
|
||||
console.log('Checking for MathJax...');
|
||||
let mathjaxFinished = false;
|
||||
let mermaidProcessed = false; // 移动变量声明到这里
|
||||
try {
|
||||
// ページに数式が含まれているか確認(MathJaxは通常、$...$または$$...$$形式の数式を処理します)
|
||||
const hasMathContent = await page.evaluate(() => {
|
||||
const html = document.documentElement.innerHTML;
|
||||
// 数学記号のタグを確認
|
||||
return html.includes('$') || html.includes('\\(') || html.includes('\\[') ||
|
||||
html.includes('\\begin{') || html.includes('math-tex') ||
|
||||
document.querySelectorAll('mjx-container').length > 0 ||
|
||||
document.querySelectorAll('[class*="math"]').length > 0;
|
||||
});
|
||||
|
||||
console.log(`Math content found: ${hasMathContent}`);
|
||||
|
||||
if (hasMathContent) {
|
||||
console.log('Math content detected, waiting for MathJax...');
|
||||
|
||||
// 特定のセレクタを無限に待つのではなく、MathJaxの初期化に合理的な時間を待機
|
||||
await new Promise(r => setTimeout(r, 1000)); // 短暂等待1秒
|
||||
|
||||
// MathJaxが存在するか再度確認
|
||||
const mathjaxExists = await page.evaluate(() => typeof window.MathJax !== 'undefined');
|
||||
|
||||
if (mathjaxExists) {
|
||||
// MathJaxが存在する場合、レンダリング完了を待機
|
||||
await page.evaluate(async () => {
|
||||
if (window.MathJax && window.MathJax.Hub) {
|
||||
await window.MathJax.Hub.Queue(['Typeset', window.MathJax.Hub]);
|
||||
} else if (window.MathJax && window.MathJax.typesetPromise) {
|
||||
await window.MathJax.typesetPromise();
|
||||
}
|
||||
});
|
||||
|
||||
console.log('MathJax typesetting completed');
|
||||
mathjaxFinished = true;
|
||||
} else {
|
||||
console.log('MathJax not found after content check');
|
||||
}
|
||||
} else {
|
||||
console.log('No math content found, skipping MathJax wait');
|
||||
}
|
||||
} catch (e) {
|
||||
console.warn('Error checking MathJax:', e.message);
|
||||
}
|
||||
|
||||
// MathJaxが完了していない場合、追加の時間を待機
|
||||
if (!mathjaxFinished) {
|
||||
console.log('Waiting 1 second before generating PDF...');
|
||||
await new Promise(r => setTimeout(r, 1000));
|
||||
}
|
||||
|
||||
// Mermaidが完了していない場合、追加の時間を待機
|
||||
if (!mermaidProcessed) {
|
||||
console.log('Waiting 1 second before generating PDF...');
|
||||
await new Promise(r => setTimeout(r, 1000));
|
||||
}
|
||||
|
||||
// 等待 Mermaid 图表渲染
|
||||
console.log('Checking for Mermaid diagrams...');
|
||||
try {
|
||||
// ページにMermaidチャートコンテナがあるか確認
|
||||
const mermaidElementsCount = await page.evaluate(() => document.querySelectorAll('.mermaid').length);
|
||||
console.log(`Mermaid diagrams found: ${mermaidElementsCount > 0}`);
|
||||
|
||||
if (mermaidElementsCount > 0) {
|
||||
console.log(`Processing ${mermaidElementsCount} Mermaid diagrams...`);
|
||||
|
||||
// Mermaidライブラリが存在するか確認し、初期化を試みる
|
||||
const mermaidExists = await page.evaluate(() => typeof mermaid !== 'undefined');
|
||||
|
||||
if (mermaidExists) {
|
||||
console.log('Mermaid library found, attempting to initialize...');
|
||||
|
||||
await page.evaluate(async () => {
|
||||
// mermaidオブジェクトが存在するか確認
|
||||
if (typeof mermaid !== 'undefined' && mermaid.init) {
|
||||
try {
|
||||
// Mermaidチャートの初期化を試みる
|
||||
mermaid.init(undefined, '.mermaid');
|
||||
} catch (e) {
|
||||
console.log('Mermaid init error:', e.message);
|
||||
}
|
||||
} else {
|
||||
console.log('Mermaid library not fully loaded, skipping initialization');
|
||||
}
|
||||
|
||||
// レンダリング完了を待機(最大5秒)
|
||||
const startTime = Date.now();
|
||||
while (Date.now() - startTime < 5000) {
|
||||
// 未完成のMermaidチャートがあるか確認
|
||||
const incompleteCharts = document.querySelectorAll('.mermaid:not(.mermaid-loaded)');
|
||||
if (incompleteCharts.length === 0) {
|
||||
break;
|
||||
}
|
||||
// 等待一小段时间后重试
|
||||
await new Promise(r => setTimeout(r, 100));
|
||||
}
|
||||
});
|
||||
} else {
|
||||
console.log('Mermaid library not found in document, skipping processing');
|
||||
}
|
||||
|
||||
console.log('Mermaid diagrams processed');
|
||||
mermaidProcessed = true;
|
||||
} else {
|
||||
console.log('No Mermaid diagrams found, skipping wait');
|
||||
}
|
||||
} catch (e) {
|
||||
console.warn('Error processing Mermaid:', e.message);
|
||||
}
|
||||
|
||||
// 等待页面基本渲染完成(不等待所有外部资源)
|
||||
console.log('Waiting for basic page content to be loaded...');
|
||||
try {
|
||||
// complete状態ではなくDOMContentLoadedイベントを待機
|
||||
await page.waitForFunction(() => document.readyState !== 'loading', { timeout: 10000 }); // Reduced timeout
|
||||
console.log('Page DOM loaded, readyState is not loading');
|
||||
} catch (e) {
|
||||
console.warn('DOM did not finish loading, continuing...', e.message);
|
||||
}
|
||||
|
||||
// 确保所有异步操作完成后再生成PDF
|
||||
console.log('Waiting 2 seconds before generating PDF...');
|
||||
await new Promise(r => setTimeout(r, 2000));
|
||||
|
||||
console.log('Generating PDF file...');
|
||||
await page.pdf({
|
||||
path: outputPath,
|
||||
format: 'A4',
|
||||
printBackground: true,
|
||||
scale: 0.75, // Scale down to fit more content
|
||||
margin: { top: '10mm', right: '10mm', bottom: '10mm', left: '10mm' },
|
||||
timeout: 120000
|
||||
});
|
||||
console.log('PDF generated successfully');
|
||||
|
||||
success = true;
|
||||
console.log(`PDF successfully generated at ${outputPath}`);
|
||||
|
||||
} catch (error) {
|
||||
console.error(`Attempt ${attempts} failed:`, error.message);
|
||||
console.error(`Error stack:`, error.stack);
|
||||
|
||||
// 致命的なエラーの場合は再試行不要
|
||||
if (error.message.includes('Protocol error') ||
|
||||
error.message.includes('Target closed') ||
|
||||
error.message.includes('Browser closed') ||
|
||||
error.message.includes('Connection closed') ||
|
||||
error.message.includes('Navigation failed') ||
|
||||
error.message.includes('net::ERR_CONNECTION_CLOSED')) {
|
||||
console.error('Fatal browser error occurred, aborting retries');
|
||||
throw error;
|
||||
}
|
||||
|
||||
if (attempts >= maxAttempts) {
|
||||
// すべての再試行が失敗した場合、最も簡略化された方法を試す
|
||||
console.log('All attempts failed, trying most basic PDF generation...');
|
||||
console.log('Creating a new page for basic method...');
|
||||
|
||||
// 重新创建页面以确保干净的状态
|
||||
const basicPage = await browser.newPage();
|
||||
await basicPage.setViewport({ width: 1200, height: 800 });
|
||||
await basicPage.setDefaultNavigationTimeout(60000);
|
||||
|
||||
await basicPage.goto(`data:text/html;charset=UTF-8,${encodeURIComponent(template)}`, {
|
||||
waitUntil: 'domcontentloaded',
|
||||
timeout: 120000 // Increased timeout for containerized environment
|
||||
});
|
||||
|
||||
// 等待一段较短的时间
|
||||
console.log('Waiting 2 seconds in basic method...');
|
||||
await new Promise(r => setTimeout(r, 2000));
|
||||
|
||||
try {
|
||||
console.log('Generating PDF with basic method...');
|
||||
await basicPage.pdf({
|
||||
path: outputPath,
|
||||
format: 'A4',
|
||||
printBackground: true,
|
||||
scale: 0.75,
|
||||
margin: { top: '10mm', right: '10mm', bottom: '10mm', left: '10mm' },
|
||||
timeout: 300000 // Increased timeout for containerized environment
|
||||
});
|
||||
success = true;
|
||||
console.log(`PDF generated using basic method at ${outputPath}`);
|
||||
await basicPage.close();
|
||||
} catch (basicError) {
|
||||
console.error('Basic PDF generation also failed:', basicError.message);
|
||||
console.error('Basic error stack:', basicError.stack);
|
||||
await basicPage.close();
|
||||
throw basicError;
|
||||
}
|
||||
} else {
|
||||
// 一定時間待機してから再試行(システムが復旧する時間を与える)
|
||||
const delay = 10000 * attempts; // 逐次的に遅延時間を増加
|
||||
console.log(`Waiting ${delay}ms before retry...`);
|
||||
await new Promise(r => setTimeout(r, delay));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
console.log('Closing browser...');
|
||||
await browser.close();
|
||||
console.log('Browser closed');
|
||||
console.log('=== MD to PDF Conversion Completed Successfully ===');
|
||||
|
||||
} catch (err) {
|
||||
console.error('Error during conversion:', err);
|
||||
console.error('Error stack:', err.stack);
|
||||
process.exit(1);
|
||||
}
|
||||
})();
|
||||
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"name": "md-to-pdf-service",
|
||||
"version": "1.0.0",
|
||||
"dependencies": {
|
||||
"marked": "^11.1.1",
|
||||
"puppeteer": "^21.7.0"
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,5 @@
|
||||
fastapi==0.104.1
|
||||
uvicorn[standard]==0.24.0
|
||||
python-multipart==0.0.6
|
||||
pydantic==2.5.0
|
||||
Pillow==10.1.0
|
||||
Reference in New Issue
Block a user