-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
result_from_official_unstructured.json
result_from_testing_server_installed_unstructured.json
- first element , official give type : Title, testing server , give UncategorizedText
- testing server , language , said : kor, it should be chinese , lang detect problem
additional info : about the testing server config
self.extraction_config = {
'strategy' : 'hi_res', # 高精度策略
'model_name' : "detectron2_onnx", # 佈局檢測模型:yolox, detectron2_onnx
'chunking_strategy' : 'by_title',
# 添加 OCR 支援(如果 PDF 是掃描件)
'ocr_languages' : 'chi_tra',
'languages' : ['chi_tra', 'chi_sim', 'eng'],
# 提取圖片配置
'extract_images_in_pdf' : True,
'extract_image_block_types' : ["Image", "Table"],
# 重要:添加這些參數來改善標題識別
'pdf_infer_table_structure' : True,
'infer_table_structure' : True, # 推斷表格結構
'include_page_breaks' : False,
'include_metadata' : True,
# 調整分塊參數
'max_characters' : 4000,
'new_after_n_chars' : 3800,
'combine_text_under_n_chars' : 500,
#'pdf_image_dpi' : 300
# 其它
#'extract_image_block_to_payload' : False, # 不將圖片存入 payload(節省空間)
}
** file source, from public online , google , just for testing
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working