jzhang533 commited on
Commit
1f4004d
Β·
1 Parent(s): da50597

working demo

Browse files

Signed-off-by: Zhang Jun <[email protected]>

Files changed (4) hide show
  1. .gitignore +50 -0
  2. README.md +77 -6
  3. app.py +790 -0
  4. requirements.txt +5 -0
.gitignore ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ *.py[cod]
3
+ *$py.class
4
+
5
+ # Distribution / packaging
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+
23
+ # Environment variables
24
+ .env
25
+ .env.local
26
+ .env.development.local
27
+ .env.test.local
28
+ .env.production.local
29
+
30
+ # Temporary files
31
+ *.tmp
32
+ *.temp
33
+ /tmp/
34
+
35
+ # IDE
36
+ .vscode/
37
+ .idea/
38
+ *.swp
39
+ *.swo
40
+
41
+ # OS
42
+ .DS_Store
43
+ Thumbs.db
44
+
45
+ # Logs
46
+ *.log
47
+
48
+ # Test files
49
+ test_*.py
50
+ *_test.py
README.md CHANGED
@@ -1,14 +1,85 @@
1
  ---
2
- title: Doc2page
3
- emoji: πŸ¦€
4
- colorFrom: purple
5
- colorTo: indigo
6
  sdk: gradio
7
  sdk_version: 5.47.2
8
  app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
- short_description: turn your document into webpage
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Doc2Page - Document to Webpage Converter
3
+ emoji: πŸ„
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: gradio
7
  sdk_version: 5.47.2
8
  app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
+ short_description: Convert docs to webpages using PaddleOCR and ERNIE
12
  ---
13
 
14
+ # πŸ“„βž‘οΈπŸŒ Doc2Page - Document to Webpage Converter
15
+
16
+ Convert your PDF documents or images into beautiful, responsive HTML webpages!
17
+
18
+ ## ✨ Features
19
+
20
+ - πŸ“– **Smart OCR**: Extract text from PDFs and images using PaddleOCR
21
+ - πŸ€– **AI Enhancement**: Transform content into well-structured HTML using ERNIE
22
+ - 🎨 **Beautiful Output**: Generate responsive, styled webpages with modern CSS
23
+ - πŸš€ **Easy Deployment**: Optional one-click deployment to GitHub Pages
24
+ - πŸ“± **Mobile Friendly**: Responsive design that works on all devices
25
+
26
+ ## πŸ”§ How It Works
27
+
28
+ 1. **Upload**: Drop your PDF or image file
29
+ 2. **Extract**: PaddleOCR extracts text and structure
30
+ 3. **Transform**: ERNIE converts to beautiful HTML
31
+ 4. **Deploy**: Optionally publish to GitHub Pages
32
+
33
+ ## πŸ“ Supported Formats
34
+
35
+ - **PDFs**: `.pdf`
36
+ - **Images**: `.png`, `.jpg`, `.jpeg`, `.bmp`, `.tiff`
37
+
38
+ ## πŸš€ Quick Start
39
+
40
+ 1. Upload a document using the file picker
41
+ 2. Click "Convert to Webpage"
42
+ 3. Preview your generated webpage
43
+ 4. Download the HTML file
44
+ 5. Optionally deploy to GitHub Pages
45
+
46
+ ## βš™οΈ Configuration
47
+
48
+ **Setup using .env file:**
49
+
50
+ 1. Copy the example environment file:
51
+ ```bash
52
+ cp .env.example .env
53
+ ```
54
+
55
+ 2. Edit the `.env` file with your credentials:
56
+ ```bash
57
+ # Required API Configuration for PP-StructureV3
58
+ API_URL=your_pp_structurev3_api_url
59
+ API_TOKEN=your_api_token
60
+
61
+ # Optional ERNIE API Configuration for enhanced HTML generation
62
+ ERNIE_CLIENT_ID=your_client_id_here
63
+ ERNIE_CLIENT_SECRET=your_client_secret_here
64
+ ```
65
+
66
+ **Note:** The `.env` file is automatically loaded when the application starts. Without ERNIE credentials, the app will use a high-quality fallback HTML generator.
67
+
68
+ ## πŸ—οΈ Technical Stack
69
+
70
+ - **Frontend**: Gradio for the web interface
71
+ - **OCR Engine**: PP-StructureV3 API (PaddlePaddle)
72
+ - **AI Processing**: ERNIE 4.5-X1.1-Preview (optional)
73
+ - **Image Processing**: Pillow
74
+
75
+ ## πŸ“ Example Use Cases
76
+
77
+ - Convert research papers to web format
78
+ - Digitize scanned documents
79
+ - Create web-friendly versions of presentations
80
+ - Transform printed materials to responsive websites
81
+ - Archive documents in searchable HTML format
82
+
83
+ ## πŸ“„ License
84
+
85
+ This project is licensed under the Apache 2.0 License.
app.py ADDED
@@ -0,0 +1,790 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import os
3
+ import tempfile
4
+ from pathlib import Path
5
+ import requests
6
+ import base64
7
+ import re
8
+ from typing import Tuple
9
+ import markdown
10
+ from dotenv import load_dotenv
11
+ from openai import OpenAI
12
+
13
+ # Load environment variables from .env file
14
+ load_dotenv()
15
+
16
+ # API Configuration
17
+ API_URL = os.getenv("API_URL", "")
18
+ API_TOKEN = os.getenv("API_TOKEN", "")
19
+
20
+
21
+ class Doc2PageConverter:
22
+ def __init__(self):
23
+ self.qianfan_token = os.getenv('QIANFAN_TOKEN')
24
+ self.qianfan_model = "ernie-x1.1-preview"
25
+ self.client = None
26
+
27
+ if self.qianfan_token:
28
+ self.client = OpenAI(
29
+ base_url="https://qianfan.baidubce.com/v2",
30
+ api_key=self.qianfan_token
31
+ )
32
+
33
+
34
+
35
+ def extract_text_with_api(self, file_path: str) -> str:
36
+ """Extract text and structure using PP-StructureV3 API"""
37
+ try:
38
+ if not API_URL or not API_TOKEN:
39
+ raise ValueError(
40
+ "API_URL and API_TOKEN must be configured in .env file")
41
+
42
+ # Determine file type
43
+ file_extension = Path(file_path).suffix.lower()
44
+ if file_extension == ".pdf":
45
+ file_type = 0 # PDF
46
+ else:
47
+ file_type = 1 # Image
48
+
49
+ # Read file content
50
+ with open(file_path, "rb") as f:
51
+ file_bytes = f.read()
52
+
53
+ # Encode file to base64
54
+ file_data = base64.b64encode(file_bytes).decode("ascii")
55
+
56
+ # Prepare API request
57
+ headers = {
58
+ "Authorization": f"token {API_TOKEN}",
59
+ "Content-Type": "application/json",
60
+ }
61
+
62
+ # Use default settings for simplicity
63
+ payload = {
64
+ "file": file_data,
65
+ "fileType": file_type,
66
+ "useFormulaRecognition": True,
67
+ "useChartRecognition": False,
68
+ "useDocOrientationClassify": False,
69
+ "useDocUnwarping": False,
70
+ "useTextlineOrientation": False,
71
+ "useSealRecognition": True,
72
+ "useRegionDetection": True,
73
+ "useTableRecognition": True,
74
+ "layoutThreshold": 0.5,
75
+ "layoutNms": True,
76
+ "layoutUnclipRatio": 1.0,
77
+ "textDetLimitType": "min",
78
+ "textTetLimitSideLen": 736,
79
+ "textDetThresh": 0.30,
80
+ "textDetBoxThresh": 0.60,
81
+ "textDetUnclipRatio": 1.5,
82
+ "textRecScoreThresh": 0.00,
83
+ "sealDetLimitType": "min",
84
+ "sealDetLimitSideLen": 736,
85
+ "sealDetThresh": 0.20,
86
+ "sealDetBoxThresh": 0.60,
87
+ "sealDetUnclipRatio": 0.5,
88
+ "sealRecScoreThresh": 0.00,
89
+ "useOcrResultsWithTableCells": True,
90
+ "useE2eWiredTableRecModel": False,
91
+ "useE2eWirelessTableRecModel": False,
92
+ "useWiredTableCellsTransToHtml": False,
93
+ "useWirelessWableCellsTransToHtml": False,
94
+ "useTableOrientationClassify": True,
95
+ }
96
+
97
+ # Call API
98
+ response = requests.post(
99
+ API_URL,
100
+ json=payload,
101
+ headers=headers,
102
+ timeout=300, # 5 minutes timeout
103
+ )
104
+
105
+ response.raise_for_status()
106
+ result = response.json()
107
+
108
+ # Process API response
109
+ layout_results = result.get("result", {}).get(
110
+ "layoutParsingResults", [])
111
+
112
+ markdown_content_list = []
113
+ markdown_list = []
114
+
115
+ for res in layout_results:
116
+ markdown_data = res["markdown"]
117
+ markdown_text = markdown_data["text"]
118
+ img_path_to_url = markdown_data["images"]
119
+
120
+ # Embed images into markdown
121
+ markdown_content = self.embed_images_into_markdown_text(
122
+ markdown_text, img_path_to_url
123
+ )
124
+ markdown_content_list.append(markdown_content)
125
+
126
+ # Prepare for concatenation
127
+ markdown_with_content = markdown_data.copy()
128
+ markdown_with_content["text"] = markdown_content
129
+ markdown_list.append(markdown_with_content)
130
+
131
+ # Concatenate all pages
132
+ concatenated_markdown = self.concatenate_markdown_pages(markdown_list)
133
+
134
+ return concatenated_markdown
135
+
136
+ except requests.exceptions.RequestException as e:
137
+ raise RuntimeError(f"API request failed: {str(e)}")
138
+ except Exception as e:
139
+ print(f"Error in API extraction: {e}")
140
+ return ""
141
+
142
+ def embed_images_into_markdown_text(self, markdown_text, markdown_images):
143
+ """Embed images into markdown text"""
144
+ for img_path, img_url in markdown_images.items():
145
+ markdown_text = markdown_text.replace(
146
+ f'<img src="{img_path}"', f'<img src="{img_url}"'
147
+ )
148
+ return markdown_text
149
+
150
+ def concatenate_markdown_pages(self, markdown_list):
151
+ """Concatenate markdown pages into single document"""
152
+ markdown_texts = ""
153
+ previous_page_last_element_paragraph_end_flag = True
154
+
155
+ for res in markdown_list:
156
+ page_first_element_paragraph_start_flag: bool = res["isStart"]
157
+ page_last_element_paragraph_end_flag: bool = res["isEnd"]
158
+
159
+ if (
160
+ not page_first_element_paragraph_start_flag
161
+ and not previous_page_last_element_paragraph_end_flag
162
+ ):
163
+ last_char_of_markdown = (markdown_texts[-1]
164
+ if markdown_texts else "")
165
+ first_char_of_handler = res["text"]
166
+
167
+ last_is_chinese_char = (
168
+ re.match(r"[\u4e00-\u9fff]", last_char_of_markdown)
169
+ if last_char_of_markdown
170
+ else False
171
+ )
172
+ first_is_chinese_char = (
173
+ re.match(r"[\u4e00-\u9fff]", first_char_of_handler)
174
+ if first_char_of_handler
175
+ else False
176
+ )
177
+ if not (last_is_chinese_char or first_is_chinese_char):
178
+ markdown_texts += " " + res["text"]
179
+ else:
180
+ markdown_texts += res["text"]
181
+ else:
182
+ markdown_texts += "\n\n" + res["text"]
183
+ previous_page_last_element_paragraph_end_flag = (
184
+ page_last_element_paragraph_end_flag
185
+ )
186
+
187
+ return markdown_texts
188
+
189
+ def markdown_to_html_with_ernie(self, markdown_text: str) -> str:
190
+ """Convert markdown to HTML using ERNIE API"""
191
+ if not self.client:
192
+ # Fallback to basic markdown conversion if no API client
193
+ return self.basic_markdown_to_html(markdown_text)
194
+
195
+ try:
196
+ prompt = f"""Please convert the following markdown text into a modern, clean HTML page. Use contemporary typography with the Inter font family and clean design principles. Make it visually appealing with proper CSS styling, responsive design, and excellent readability.
197
+
198
+ Design requirements:
199
+ - Use Inter font from Google Fonts
200
+ - Clean, modern spacing and typography
201
+ - Subtle shadows and rounded corners
202
+ - Good color contrast and hierarchy
203
+ - Responsive design that works on all devices
204
+ - Include proper HTML structure with head, body, and semantic elements
205
+
206
+ Important: Add a footer at the bottom with "Powered by PaddleOCR and ERNIE" where PaddleOCR links to https://github.com/PaddlePaddle/PaddleOCR and ERNIE links to https://huggingface.co/BAIDU. Style it with modern, subtle styling.
207
+
208
+ Markdown content:
209
+ {markdown_text}
210
+
211
+ IMPORTANT: Return ONLY the raw HTML code starting with <!DOCTYPE html> and ending with </html>. Do NOT wrap it in markdown code blocks or add any explanations. I need the pure HTML content that can be directly saved as an .html file."""
212
+
213
+ messages = [{"role": "user", "content": prompt}]
214
+
215
+ response = self.client.chat.completions.create(
216
+ model=self.qianfan_model,
217
+ messages=messages,
218
+ max_tokens=64000,
219
+ )
220
+
221
+ html_content = response.choices[0].message.content
222
+
223
+ # Clean up markdown code block markers if present
224
+ if html_content.startswith('```html'):
225
+ html_content = html_content[7:] # Remove ```html
226
+ elif html_content.startswith('```'):
227
+ html_content = html_content[3:] # Remove ```
228
+
229
+ if html_content.endswith('```'):
230
+ html_content = html_content[:-3] # Remove ending ```
231
+
232
+ # Strip any extra whitespace
233
+ html_content = html_content.strip()
234
+
235
+ return html_content
236
+
237
+ except Exception as e:
238
+ print(f"Error calling ERNIE API: {e}")
239
+ return self.basic_markdown_to_html(markdown_text)
240
+
241
+ def basic_markdown_to_html(self, markdown_text: str) -> str:
242
+ """Fallback markdown to HTML conversion"""
243
+ html = markdown.markdown(markdown_text)
244
+
245
+ # Wrap in a complete HTML document with styling
246
+ complete_html = f"""
247
+ <!DOCTYPE html>
248
+ <html lang="en">
249
+ <head>
250
+ <meta charset="UTF-8">
251
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
252
+ <title>Converted Document</title>
253
+ <style>
254
+ /* Modern, clean typography */
255
+ @import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap');
256
+
257
+ * {{
258
+ margin: 0;
259
+ padding: 0;
260
+ box-sizing: border-box;
261
+ }}
262
+
263
+ body {{
264
+ font-family: 'Inter', system-ui, -apple-system, sans-serif;
265
+ font-weight: 400;
266
+ line-height: 1.7;
267
+ color: #1a1a1a;
268
+ max-width: 850px;
269
+ margin: 0 auto;
270
+ padding: 32px 24px;
271
+ background: #fafafa;
272
+ font-size: 16px;
273
+ }}
274
+
275
+ .container {{
276
+ background: #ffffff;
277
+ padding: 48px;
278
+ border-radius: 12px;
279
+ box-shadow: 0 1px 3px rgba(0,0,0,0.08), 0 4px 24px rgba(0,0,0,0.04);
280
+ border: 1px solid rgba(0,0,0,0.06);
281
+ }}
282
+
283
+ /* Typography hierarchy */
284
+ h1, h2, h3, h4, h5, h6 {{
285
+ font-weight: 600;
286
+ color: #0f0f0f;
287
+ margin: 32px 0 16px 0;
288
+ letter-spacing: -0.02em;
289
+ }}
290
+
291
+ h1 {{
292
+ font-size: 2.25rem;
293
+ font-weight: 700;
294
+ margin-top: 0;
295
+ margin-bottom: 24px;
296
+ border-bottom: 2px solid #e5e7eb;
297
+ padding-bottom: 16px;
298
+ }}
299
+
300
+ h2 {{
301
+ font-size: 1.75rem;
302
+ margin-top: 48px;
303
+ }}
304
+
305
+ h3 {{
306
+ font-size: 1.375rem;
307
+ margin-top: 40px;
308
+ }}
309
+
310
+ h4 {{
311
+ font-size: 1.125rem;
312
+ }}
313
+
314
+ p {{
315
+ margin-bottom: 20px;
316
+ color: #374151;
317
+ line-height: 1.75;
318
+ }}
319
+
320
+ /* Code styling */
321
+ code {{
322
+ font-family: 'SF Mono', Consolas, 'Liberation Mono', monospace;
323
+ background-color: #f3f4f6;
324
+ color: #1f2937;
325
+ padding: 3px 6px;
326
+ border-radius: 4px;
327
+ font-size: 0.875rem;
328
+ font-weight: 500;
329
+ }}
330
+
331
+ pre {{
332
+ background-color: #f8fafc;
333
+ border: 1px solid #e5e7eb;
334
+ padding: 20px;
335
+ border-radius: 8px;
336
+ overflow-x: auto;
337
+ margin: 24px 0;
338
+ font-size: 0.875rem;
339
+ line-height: 1.6;
340
+ }}
341
+
342
+ pre code {{
343
+ background: none;
344
+ padding: 0;
345
+ border-radius: 0;
346
+ }}
347
+
348
+ /* Blockquotes */
349
+ blockquote {{
350
+ border-left: 4px solid #6366f1;
351
+ padding-left: 20px;
352
+ margin: 24px 0;
353
+ font-style: normal;
354
+ color: #4b5563;
355
+ background-color: #f8fafc;
356
+ padding: 16px 20px;
357
+ border-radius: 0 8px 8px 0;
358
+ }}
359
+
360
+ /* Images */
361
+ img {{
362
+ max-width: 100%;
363
+ height: auto;
364
+ border-radius: 8px;
365
+ margin: 20px 0;
366
+ box-shadow: 0 4px 12px rgba(0,0,0,0.1);
367
+ }}
368
+
369
+ /* Tables */
370
+ table {{
371
+ border-collapse: collapse;
372
+ width: 100%;
373
+ margin: 24px 0;
374
+ background: #ffffff;
375
+ border-radius: 8px;
376
+ overflow: hidden;
377
+ box-shadow: 0 1px 3px rgba(0,0,0,0.1);
378
+ }}
379
+
380
+ th, td {{
381
+ padding: 16px;
382
+ text-align: left;
383
+ border-bottom: 1px solid #e5e7eb;
384
+ }}
385
+
386
+ th {{
387
+ background-color: #f9fafb;
388
+ font-weight: 600;
389
+ color: #374151;
390
+ font-size: 0.875rem;
391
+ text-transform: uppercase;
392
+ letter-spacing: 0.05em;
393
+ }}
394
+
395
+ tr:last-child td {{
396
+ border-bottom: none;
397
+ }}
398
+
399
+ /* Lists */
400
+ ul, ol {{
401
+ margin: 16px 0 20px 24px;
402
+ color: #374151;
403
+ }}
404
+
405
+ li {{
406
+ margin-bottom: 8px;
407
+ line-height: 1.6;
408
+ }}
409
+
410
+ /* Links */
411
+ a {{
412
+ color: #6366f1;
413
+ text-decoration: none;
414
+ font-weight: 500;
415
+ }}
416
+
417
+ a:hover {{
418
+ color: #4f46e5;
419
+ text-decoration: underline;
420
+ }}
421
+ /* Footer */
422
+ .footer {{
423
+ margin-top: 64px;
424
+ padding-top: 24px;
425
+ border-top: 1px solid #e5e7eb;
426
+ text-align: center;
427
+ font-size: 14px;
428
+ color: #6b7280;
429
+ font-weight: 400;
430
+ }}
431
+
432
+ .footer a {{
433
+ color: #6366f1;
434
+ font-weight: 500;
435
+ text-decoration: none;
436
+ }}
437
+
438
+ .footer a:hover {{
439
+ color: #4f46e5;
440
+ text-decoration: underline;
441
+ }}
442
+ </style>
443
+ </head>
444
+ <body>
445
+ <div class="container">
446
+ {html}
447
+ <div class="footer">
448
+ Powered by <a href="https://github.com/PaddlePaddle/PaddleOCR" target="_blank">PaddleOCR</a> and
449
+ <a href="https://huggingface.co/BAIDU" target="_blank">ERNIE</a>
450
+ </div>
451
+ </div>
452
+ </body>
453
+ </html>
454
+ """
455
+ return complete_html
456
+
457
+ def process_document(self, file_path: str) -> Tuple[str, str]:
458
+ """Process uploaded document and convert to HTML"""
459
+ try:
460
+ file_extension = Path(file_path).suffix.lower()
461
+
462
+ # Check supported formats
463
+ if file_extension == '.pdf' or file_extension in [
464
+ '.png', '.jpg', '.jpeg', '.bmp', '.tiff']:
465
+ # Process with PP-StructureV3 API
466
+ markdown_content = self.extract_text_with_api(file_path)
467
+ else:
468
+ return ("Error: Unsupported file format. "
469
+ "Please upload PDF or image files."), ""
470
+
471
+ if not markdown_content.strip():
472
+ return ("Warning: No text content extracted "
473
+ "from the document."), ""
474
+
475
+ # Convert markdown to HTML using ERNIE or fallback
476
+ html_content = self.markdown_to_html_with_ernie(markdown_content)
477
+
478
+ return markdown_content, html_content
479
+
480
+ except Exception as e:
481
+ return f"Error processing document: {str(e)}", ""
482
+
483
+ # Initialize converter
484
+ converter = Doc2PageConverter()
485
+
486
+ def process_upload(file):
487
+ """Process uploaded file and return markdown and HTML"""
488
+ if file is None:
489
+ return "Please upload a file.", "", ""
490
+
491
+ try:
492
+ # Process the document
493
+ markdown_result, html_result = converter.process_document(file.name)
494
+
495
+ if html_result:
496
+ return "Document processed successfully!", markdown_result, html_result
497
+ else:
498
+ return markdown_result, "", "" # Error message in markdown_result
499
+
500
+ except Exception as e:
501
+ return f"Error: {str(e)}", "", ""
502
+
503
+ def save_html_file(html_content, filename="converted_page"):
504
+ """Save HTML content to file for download"""
505
+ if not html_content:
506
+ return None
507
+
508
+ # Create temporary file
509
+ temp_file = tempfile.NamedTemporaryFile(mode='w', suffix='.html', delete=False,
510
+ prefix=f"{filename}_")
511
+ temp_file.write(html_content)
512
+ temp_file.close()
513
+
514
+ return temp_file.name
515
+
516
+ # Create custom theme for a clean, modern look
517
+ custom_theme = gr.themes.Default(
518
+ primary_hue="blue",
519
+ secondary_hue="gray",
520
+ neutral_hue="gray",
521
+ font=("Inter", "system-ui", "sans-serif"),
522
+ font_mono=("SF Mono", "Consolas", "monospace")
523
+ ).set(
524
+ body_background_fill="#fafafa",
525
+ background_fill_primary="#ffffff",
526
+ background_fill_secondary="#f8f9fa",
527
+ border_color_primary="#e5e7eb",
528
+ button_primary_background_fill="#6366f1",
529
+ button_primary_background_fill_hover="#4f46e5",
530
+ button_primary_text_color="#ffffff",
531
+ )
532
+
533
+ # Create Gradio interface
534
+ with gr.Blocks(
535
+ title="Doc2Page - Simple Document Converter",
536
+ theme=custom_theme,
537
+ css="""
538
+ .gradio-container {
539
+ max-width: 1200px !important;
540
+ margin: auto;
541
+ padding: 32px 16px;
542
+ }
543
+
544
+ /* Enhanced button styling */
545
+ .gr-button {
546
+ font-weight: 500;
547
+ border-radius: 10px;
548
+ font-size: 14px;
549
+ transition: all 0.2s ease;
550
+ box-shadow: 0 2px 4px rgba(99, 102, 241, 0.1);
551
+ }
552
+
553
+ .gr-button:hover {
554
+ transform: translateY(-1px);
555
+ box-shadow: 0 4px 8px rgba(99, 102, 241, 0.2);
556
+ }
557
+
558
+ /* Input styling */
559
+ .gr-textbox, .gr-file {
560
+ border-radius: 10px;
561
+ font-family: 'Inter', system-ui, sans-serif;
562
+ border: 1px solid #e5e7eb;
563
+ transition: border-color 0.2s ease;
564
+ }
565
+
566
+ .gr-textbox:focus, .gr-file:focus {
567
+ border-color: #6366f1;
568
+ box-shadow: 0 0 0 3px rgba(99, 102, 241, 0.1);
569
+ }
570
+
571
+ /* Typography */
572
+ h1 {
573
+ font-weight: 700;
574
+ color: #1a1a1a;
575
+ margin-bottom: 8px;
576
+ font-size: 2.5rem;
577
+ }
578
+
579
+ .app-description {
580
+ color: #6b7280;
581
+ font-size: 18px;
582
+ margin-bottom: 40px;
583
+ font-weight: 400;
584
+ }
585
+
586
+ /* Tab styling */
587
+ .gr-tab {
588
+ border-radius: 8px 8px 0 0;
589
+ font-weight: 500;
590
+ }
591
+
592
+ /* Card-like sections */
593
+ .gr-column {
594
+ background: rgba(255, 255, 255, 0.5);
595
+ border-radius: 12px;
596
+ padding: 16px;
597
+ margin: 8px;
598
+ }
599
+
600
+ /* Status styling */
601
+ .gr-textbox[data-testid*="status"] {
602
+ background-color: #f8fafc;
603
+ border: 1px solid #e2e8f0;
604
+ }
605
+
606
+ /* Download section styling */
607
+ .download-section {
608
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
609
+ border-radius: 12px;
610
+ padding: 20px;
611
+ color: white;
612
+ margin-top: 20px;
613
+ }
614
+ """
615
+ ) as app:
616
+
617
+ # Header
618
+ gr.Markdown(
619
+ "# Doc2Page",
620
+ elem_classes="main-title"
621
+ )
622
+ gr.Markdown(
623
+ "πŸ₯ƒ Transform your documents into beautiful webpages!",
624
+ elem_classes="app-description"
625
+ )
626
+
627
+ # Main interface
628
+ with gr.Row():
629
+ with gr.Column(scale=1, min_width=350):
630
+ with gr.Group():
631
+ gr.Markdown("### πŸ“„ Upload Document")
632
+ file_input = gr.File(
633
+ label="Choose your file",
634
+ file_types=[".pdf", ".png", ".jpg", ".jpeg", ".bmp", ".tiff"],
635
+ file_count="single",
636
+ height=140
637
+ )
638
+
639
+ process_btn = gr.Button(
640
+ "✨ Convert to Webpage",
641
+ variant="primary",
642
+ size="lg",
643
+ scale=1
644
+ )
645
+
646
+ status_output = gr.Textbox(
647
+ label="Status",
648
+ placeholder="Ready to convert your document...",
649
+ interactive=False,
650
+ lines=3,
651
+ max_lines=3
652
+ )
653
+
654
+ with gr.Column(scale=2):
655
+ gr.Markdown("### πŸ“‹ Results")
656
+ with gr.Tabs():
657
+ with gr.TabItem("❀️ Preview", id="preview"):
658
+ html_preview = gr.HTML(
659
+ label="",
660
+ value="<div style='padding: 40px; text-align: center; color: #6b7280;'>Your converted webpage will appear here</div>",
661
+ )
662
+
663
+ with gr.TabItem("πŸ“ Markdown Source", id="markdown"):
664
+ markdown_output = gr.Textbox(
665
+ label="",
666
+ placeholder="Extracted markdown content will appear here...",
667
+ lines=22,
668
+ interactive=False,
669
+ show_copy_button=True
670
+ )
671
+
672
+ with gr.TabItem("🌐 HTML Source", id="html"):
673
+ html_output = gr.Code(
674
+ label="",
675
+ language="html",
676
+ lines=22,
677
+ interactive=False
678
+ )
679
+
680
+ # Success & Download section
681
+ with gr.Row(visible=False) as download_section:
682
+ with gr.Column():
683
+ gr.Markdown("""
684
+ <div style="background: linear-gradient(135deg, #10b981, #059669); border-radius: 12px; padding: 20px; color: white; text-align: center; margin: 20px 0;">
685
+ <h3 style="margin: 0 0 8px 0; color: white;">βœ… Conversion Successful!</h3>
686
+ <p style="margin: 0; opacity: 0.9;">Your document has been converted to a beautiful webpage</p>
687
+ </div>
688
+ """)
689
+
690
+ with gr.Row():
691
+ with gr.Column(scale=1):
692
+ gr.Markdown("### πŸ“₯ Download Your Webpage")
693
+ download_btn = gr.File(
694
+ label="HTML File",
695
+ visible=True
696
+ )
697
+
698
+ with gr.Column(scale=1):
699
+ gr.Markdown("### πŸš€ Quick Deploy Guide")
700
+ gr.Markdown("""
701
+ 1. **GitHub Pages**: Upload as `index.html` to your repo
702
+ 2. **Netlify**: Drag & drop the file to netlify.app
703
+ 3. **Vercel**: Use their simple file deployment
704
+ 4. **Local**: Double-click to open in browser
705
+ """, elem_classes="deploy-guide")
706
+
707
+ # Event handlers
708
+ def process_and_update(file):
709
+ status, markdown_content, html_content = process_upload(file)
710
+
711
+ # Create download file if HTML was generated
712
+ download_file = None
713
+ show_download = False
714
+
715
+ if html_content:
716
+ filename = Path(file.name).stem if file else "converted_page"
717
+ download_file = save_html_file(html_content, filename)
718
+ show_download = True
719
+
720
+ # Preview content with better styling when no content
721
+ preview_content = html_content if html_content else """
722
+ <div style='padding: 60px 20px; text-align: center; color: #6b7280;
723
+ background: #f9fafb; border-radius: 8px; border: 2px dashed #d1d5db;'>
724
+ <h3 style='color: #9ca3af; margin: 0;'>No preview available</h3>
725
+ <p style='margin: 8px 0 0 0;'>Convert a document to see the preview</p>
726
+ </div>
727
+ """
728
+
729
+ return (
730
+ status, # status_output
731
+ markdown_content, # markdown_output
732
+ html_content, # html_output
733
+ preview_content, # html_preview
734
+ download_file, # download_btn
735
+ gr.update(visible=show_download) # download_section
736
+ )
737
+
738
+ process_btn.click(
739
+ fn=process_and_update,
740
+ inputs=[file_input],
741
+ outputs=[
742
+ status_output,
743
+ markdown_output,
744
+ html_output,
745
+ html_preview,
746
+ download_btn,
747
+ download_section
748
+ ]
749
+ )
750
+
751
+ # Footer
752
+ gr.Markdown(
753
+ """
754
+ <div style="text-align: center; padding: 20px 0; margin-top: 40px; border-top: 1px solid #e5e7eb; color: #6b7280; font-size: 14px;">
755
+ Powered by <a href="https://github.com/PaddlePaddle/PaddleOCR" target="_blank" style="color: #6366f1; text-decoration: none;">PaddleOCR</a>
756
+ for text extraction and <a href="https://huggingface.co/BAIDU" target="_blank" style="color: #6366f1; text-decoration: none;">ERNIE</a>
757
+ for HTML generation
758
+ </div>
759
+ """,
760
+ elem_id="footer"
761
+ )
762
+
763
+ # Tips section
764
+ with gr.Accordion("πŸ’‘ Tips for Best Results", open=False):
765
+ gr.Markdown("""
766
+ **File Types:** PDF, PNG, JPG, JPEG, BMP, TIFF
767
+
768
+ **For Best OCR Results:**
769
+ - Use high-resolution, clear images
770
+ - Ensure good contrast between text and background
771
+ - Avoid skewed or rotated documents
772
+ - PDFs generally produce the best results
773
+
774
+ **πŸš€ Deploy to GitHub Pages:**
775
+ 1. Create a new GitHub repository or use an existing one
776
+ 2. Download the generated HTML file from above
777
+ 3. Upload it to your repository as `index.html`
778
+ 4. Go to repository Settings β†’ Pages
779
+ 5. Select "Deploy from a branch" β†’ Choose "main" branch
780
+ 6. Your page will be live at `https://yourusername.github.io/yourrepository`
781
+
782
+ **πŸ’‘ Pro Tips:**
783
+ - Enable custom domains in GitHub Pages settings
784
+ - Use GitHub Actions for automated deployments
785
+ - Consider using Jekyll themes for enhanced styling
786
+ """)
787
+
788
+
789
+ if __name__ == "__main__":
790
+ app.launch()
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ gradio==5.47.2
2
+ requests
3
+ markdown
4
+ python-dotenv
5
+ openai