Skip to content

bug/parse chinese document ( tranditional chinese ) with attachment #4119

@onmagic

Description

@onmagic

govform02.pdf

result_from_official_unstructured.json

result_from_testing_server_installed_unstructured.json

  1. first element , official give type : Title, testing server , give UncategorizedText
  2. testing server , language , said : kor, it should be chinese , lang detect problem

additional info : about the testing server config

self.extraction_config = {
            'strategy'                      : 'hi_res',                 # 高精度策略
            'model_name'                    : "detectron2_onnx",        # 佈局檢測模型:yolox, detectron2_onnx
            
            'chunking_strategy'             : 'by_title',
            
            
            # 添加 OCR 支援(如果 PDF 是掃描件)
            'ocr_languages'                 : 'chi_tra',
            'languages'                     : ['chi_tra', 'chi_sim', 'eng'],
            
            # 提取圖片配置
            'extract_images_in_pdf'         : True,
            'extract_image_block_types'     : ["Image", "Table"],
            
            # 重要:添加這些參數來改善標題識別
            'pdf_infer_table_structure'     : True,
            'infer_table_structure'         : True,         # 推斷表格結構
            'include_page_breaks'           : False,
            'include_metadata'              : True,
            
            # 調整分塊參數
            'max_characters'                : 4000,
            'new_after_n_chars'             : 3800,
            'combine_text_under_n_chars'    : 500,
            
            #'pdf_image_dpi'                 : 300
            
            # 其它
            #'extract_image_block_to_payload'    : False,  # 不將圖片存入 payload(節省空間)
        }

** file source, from public online , google , just for testing

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions