Skip to content

Doc2X API v2 PDF Interface Documentation

Basic Information

Base URL

https://v2.doc2x.noedgeai.com

Important Reminders

  1. Please access the API interface directly. Regions outside mainland China may experience significant network fluctuations, leading to severe file upload interruptions
  2. Not recommended for large-scale online service integration. Due to limited computing resources, queuing may occur (manifested as progress showing 0 during polling). More suitable for batch data processing
  3. After obtaining results through status, if you need to save images, please download manually or obtain images locally through the export interface as soon as possible. The server only temporarily retains results for 24 hours

Authorization

First, you need to obtain an API Key (similar to sk-xxx). Get API website: open.noedgeai.com

Add to HTTP request headers:

bash
Authorization: Bearer sk-xxx

Authorization

Authorization

POST /api/v2/parse/preupload File Pre-upload

Large file upload interface, file size <= 1GB

Request Parameters

None

Request Example

bash
curl -X POST 'https://v2.doc2x.noedgeai.com/api/v2/parse/preupload' \
--header 'Authorization: Bearer sk-xxx'

Response Example

json
{
  "code": "success",
  "data": {
    "uid": "0192d745-5776-7261-abbd-814df3af3449",
    "url": "https://doc2x-pdf.oss-cn-beijing.aliyuncs.com/tmp/0192d745-5776-7261-abbd-814df3af3449.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256..."
  }
}
  1. After obtaining the url, use HTTP PUT method to upload the file to the url field in the returned result
  2. After upload completion, use the /api/v2/parse/status interface to poll for results. Uses Alibaba Cloud OSS, specific speed depends on your network speed (overseas users may experience upload failures).

Interface Description

Flow chart as follows:

Interface Description

Exception errors (such as processing limit/processing page limit restrictions) will be returned in the status interface

Python Example

python
import json
import time
import requests as rq

base_url = "https://v2.doc2x.noedgeai.com"
secret = "sk-xxx"

def preupload():
    url = f"{base_url}/api/v2/parse/preupload"
    headers = {
        "Authorization": f"Bearer {secret}"
    }
    res = rq.post(url, headers=headers)
    if res.status_code == 200:
        data = res.json()
        if data["code"] == "success":
            return data["data"]
        else:
            raise Exception(f"get preupload url failed: {data}")
    else:
        raise Exception(f"get preupload url failed: {res.text}")

def put_file(path: str, url: str):
    with open(path, "rb") as f:
        res = rq.put(url, data=f) # body is file binary stream
        if res.status_code != 200:
            raise Exception(f"put file failed: {res.text}")

def get_status(uid: str):
    url = f"{base_url}/api/v2/parse/status?uid={uid}"
    headers = {
        "Authorization": f"Bearer {secret}"
    }
    res = rq.get(url, headers=headers)
    if res.status_code == 200:
        data = res.json()
        if data["code"] == "success":
            return data["data"]
        else:
            raise Exception(f"get status failed: {data}")
    else:
        raise Exception(f"get status failed: {res.text}")

upload_data = preupload()
print(upload_data)
url = upload_data["url"]
uid = upload_data["uid"]

put_file("test.pdf", url)

while True:
    status_data = get_status(uid)
    print(status_data)
    if status_data["status"] == "success":
        result = status_data["result"]
        with open("result.json", "w") as f:
            json.dump(result, f)
        break
    elif status_data["status"] == "failed":
        detail = status_data["detail"]
        raise Exception(f"parse failed: {detail}")
    elif status_data["status"] == "processing":
        # processing
        progress = status_data["progress"]
        print(f"progress: {progress}")
        time.sleep(3)

Notes

  • Since there is a certain delay for the server to fetch after users upload to OSS, the status will not immediately update to "task in progress" after uploading files, you need to wait (<20s)
  • The link is valid for 5 minutes after obtaining it, pay attention to timing
  • URL links cannot be reused: if HTTP PUT fails (i.e., status_code!=200), you can retry. If PUT gets a 200 return, the link cannot be reused
  • Since the number of pages cannot be known before uploading the file, rate limit triggers (parse_concurrency_limit, parse_task_limit_exceeded) will only be triggered in the status interface

GET /api/v2/parse/status View Asynchronous Status

After using the above asynchronous call, use this interface to poll status. Recommended polling frequency is 1~3 seconds per time

Cloud status (including images on CDN) can only be queried for results within 24 hours. Please export and save as soon as possible

View Asynchronous Status Request Parameters

Request Headers

NameDescriptionExample Value
AuthorizationApi keyBearer sk-usui9lodl89p7r51suvo0awdawd

Request Body

NameLocationTypeRequiredDescription
uidquerystringYesAsynchronous task id

View Asynchronous Status Request Example

bash
curl --request GET 'https://v2.doc2x.noedgeai.com/api/v2/parse/status?uid=01920000-0000-0000-0000-000000000000' \
--header 'Authorization: Bearer sk-xxx'
python
import requests

url = 'https://v2.doc2x.noedgeai.com/api/v2/parse/status?uid=01920000-0000-0000-0000-000000000000'
headers = {'Authorization': 'Bearer sk-xxx'}

response = requests.get(url, headers=headers)

print(response.text)

View Asynchronous Status Response Example

json
{
  "code": "success",
  "data": {
    "progress": 100,
    "status": "success",
    "detail": "",
    "result": {
      "version": "v2",
      "pages": [
        {
          "url": "",
          "page_idx": 0,
          "page_width": 1802,
          "page_height": 2332,
          "md": ""
        }
      ]
    }
  }
}

Failed Case

json
{
  "code": "parse_error",
  "msg": "Parse error"
}

Field Explanations

FieldMeaningExample
data.progressTask progress, integer from 0~100100
data.statusprocessing, failed, successIn progress, failed, success
data.detailDetailed error information when status=failedParse failed, file too large
result.pagesResults
page.urlPage URL, not empty if small images exist on page, otherwise emptyhttps://cdn.noedgeai.com/xxx.jpg
page.page_idxPage id, starting from 0
page.page_width/heightPage width/height, unit: pixels
page.mdMarkdown format text for this page

POST /api/v2/convert/parse Request Export File (Asynchronous)

Export File Request Parameters

NameLocationTypeRequiredDescription
uidbodyjsonYesParse task id
tobodyjsonYesExport format, supports: md|tex|docx
formula_modebodyjsonYesExport mode, fill in: normal; change to: dollar when exporting md files with $ formula markers
filenamebodyjsonNoExported md/tex filename (without extension), default output.md/output.tex, only valid for md and tex
merge_cross_page_formsbodyboolNoMerge cross-page tables

Export File Request Example

bash
curl --location --request POST 'https://v2.doc2x.noedgeai.com/api/v2/convert/parse' \
--header 'Authorization: Bearer sk-xxx' \
--header 'Content-Type: application/json' \
--data-raw '{
    "uid": "01920000-0000-0000-0000-000000000000",
    "to": "md",
    "formula_mode": "normal",
    "filename": "my_markdown.md",
    "merge_cross_page_forms": false
}'
python
import requests
import json

url = "https://v2.doc2x.noedgeai.com/api/v2/convert/parse"
headers = {
    "Authorization": "Bearer sk-xxx",
    "Content-Type": "application/json",
}

data = {
    "uid": "01920000-0000-0000-0000-000000000000",
    "to": "md",
    "formula_mode": "normal",
    "filename": "my_markdown.md",
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.text)

Export File Response Example

json
{
  "code": "success",
  "data": {
    "status": "processing",
    "url": ""
  }
}

Note: The interface /api/v2/convert/parse is used to trigger export file tasks. Subsequently, you need to use the /api/v2/convert/parse/result interface to poll export task status. Do not repeatedly poll the /convert/parse interface

GET /api/v2/convert/parse/result Export Get Results

Export Get Results Request Parameters

Export Get Results Request Headers

NameDescriptionExample Value
AuthorizationApi keyBearer sk-usui9lodl89p7r51suvo0awdawd

Export Get Results Request Body

NameLocationTypeRequiredDescription
uidquerystringYesAsynchronous task id

Export Get Results Request Example

bash
curl --location --request GET 'https://v2.doc2x.noedgeai.com/api/v2/convert/parse/result?uid=01920000-0000-0000-0000-000000000000' \
--header 'Authorization: Bearer sk-xxx'
python
import requests

url = 'https://v2.doc2x.noedgeai.com/api/v2/convert/parse/result?uid=01920000-0000-0000-0000-000000000000'
headers = {'Authorization': 'Bearer sk-xxx'}

response = requests.get(url, headers=headers)

print(response.text)

Export Get Results Response Example

json
{
  "code": "success",
  "data": {
    "status": "success",
    "url": "https://doc2x-backend.s3.cn-north-1.amazonaws.com.cn/objects/xxx/convert_tex_none.zip?..."
  }
}

Same as /api/v2/convert/parse export file return results, then you need to use the URL in it to download the file

Download File from URL

After getting successful return examples from /api/v2/convert/parse/result or /api/v2/convert/parse interfaces, you can use HTTP GET method to request the url to download the file:

Note: In some scenarios, the returned url will represent & with \u0026, which needs to be actively replaced with &

Download File from URL Request Example

bash
curl -L -o downloaded_file.zip "https://doc2x-backend.s3.cn-north-1.amazonaws.com.cn/objects/xxx/convert_tex_none.zip?..."
python
import requests

response = requests.get("https://doc2x-backend.s3.cn-north-1.amazonaws.com.cn/objects/xxx/convert_tex_none.zip?...")

with open('downloaded_file.zip', 'wb') as f:
    f.write(response.content)

Error Codes

HTTP Status Codes

  • When httpcode is 429, it's an API rate limit exceeded error, wait for previously submitted tasks to complete
  • When httpcode is 200, it's a business-related error

Error Code Descriptions

Error CodeReasonSolution
parse_task_limit_exceededTask number limit exceededNumber of tasks being processed has reached the limit, wait for previously submitted tasks to complete
parse_concurrency_limitTask file page limit exceededPages being processed have reached the limit, wait for previously submitted tasks to complete
parse_quota_limitInsufficient parsing page quotaCurrent available pages are insufficient
parse_errorParse errorWait briefly and retry, if error persists contact the person in charge
parse_create_task_errorTask creation failedWait briefly and retry, if error persists contact the person in charge
parse_status_not_foundStatus expired or uid errorWait briefly and retry, if error persists contact the person in charge
parse_file_too_largeSingle file size exceeds limitCurrently allows single file size <= 300M, please split PDF
parse_page_limit_exceededSingle file page count exceeds limitCurrently allows single file pages <= 2000 pages, please split PDF
parse_file_lockFile parsing failedTo prevent repeated parsing, temporarily locked for one day. Consider PDF compatibility issues, try reprinting and try again. If still fails, feedback request_id to person in charge
parse_file_not_pdfUploaded file is not a PDFPlease parse files with .pdf extension
parse_file_invalidParse file error or invalidWe cannot parse this PDF, usually PDF format issues or non-standard PDF
parse_timeoutProcessing time exceeds 15minUsually caused by content being too long to process completely within 15min, try splitting PDF for recognition

Useful Integrations

Packaged Python Library - pdfdeal

Coze Plugin - PDF Recognition from URL

Other References