Parsing Features Introduction

Basic Functions

Parse text/tables/formulas/layout from PDFs and restore them to Markdown, LaTeX, and Word formats (Word does not include layout restoration)
Use cases: Provide higher quality data for large language model training and RAG
Core scenarios: Including but not limited to Chinese/English papers, financial reports/annual reports, middle school science test papers, various books, etc.

Features

Remove Headers and Footers from PDFs

Such as page numbers, journal names, authors that repeatedly appear at the top/bottom of paper pages

Universal Table Recognition

Recognizes tables in HTML format (markdown tables don't support merged cell syntax)
No specific table type limitations, performs well in general scenarios
Supports recognition of rotated tables on pages (both left and right rotated tables)
Supports recognition of formulas/images/paragraphs within tables
Supports merging of cross-page tables, removing continuation table text, merging cross-page cells and removing duplicate headers
Does not support recognition of nested tables

Formula Recognition

Supports mixed text and formula recognition as well as Chinese formula recognition
Supports most formulas except extremely large equation systems and matrices

Layout Restoration

Restores complex layout documents to single-column text flow
Supports most layouts except newspaper-style multi-column layouts
Currently supporting multi-level headings (h1-h5)
Partial support for code block indentation

Supported Languages

Supported languages: Chinese (Simplified/Traditional), English, Western European languages, Japanese
Future support planned: Russian, Hindi, Arabic

Handwriting Recognition

Handwritten text/formula recognition is continuously supported

Parsing Tutorial

Step 1: Upload Document

Click the "Start Parsing File" button or directly drag and drop PDF files to the upload area
Supports single file upload, maximum 300MB PDF documents supported

parse_step1

Step 2: Start Parsing

Page Range: Select the page range to parse (All/Specific pages/Specific range)
Click the "Confirm Processing" button, system begins processing the document
Processing progress is displayed in real-time
After parsing is complete, you can preview results and download files

parse_step2

Step 3: Preview Parsing Results

Parsing Results: View document elements identified by the system, such as titles, paragraphs, tables, images, etc.
Operation Menu:
- Copy parsing results as Markdown
- Export as Markdown, Word, and other formats
- Single-column/double-column toggle

parse_step3

Step 4: Download Parsing Results

Click the "Export" icon, select the file format to save (Markdown, Word, etc.)
You can then save the parsing results locally

parse_step4