-
Notifications
You must be signed in to change notification settings - Fork 13
Description
@aymenfurter
The file upload encountered an HTTP 500 error.
I wonder if the file hasn’t passed through Document Intelligence? The PDF files are fine, but Word or other file formats encounter errors.
2024-10-21T02:43:41.099251851Z INFO:geventwebsocket.handler:100.100.0.115 - - [2024-10-21 02:43:41] "GET /indexes/espp/files?is_restricted=false HTTP/1.1" 200 171 0.007272
2024-10-21T02:43:41.376999492Z INFO:geventwebsocket.handler:100.100.0.115 - - [2024-10-21 02:43:41] "GET /indexes HTTP/1.1" 200 169 0.045113
2024-10-21T02:43:42.110639426Z INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'https://strwxmbueydoikkg.queue.core.windows.net/indexing/messages?numofmessages=REDACTED&visibilitytimeout=REDACTED'
2024-10-21T02:43:42.110684370Z Request method: 'GET'
2024-10-21T02:43:42.110695080Z Request headers:
2024-10-21T02:43:42.110703155Z 'x-ms-version': 'REDACTED'
2024-10-21T02:43:42.110710829Z 'Accept': 'application/xml'
2024-10-21T02:43:42.110718764Z 'User-Agent': 'azsdk-python-storage-queue/12.11.0 Python/3.11.10 (Linux-5.15.164.1-1.cm2-x86_64-with-glibc2.36)'
2024-10-21T02:43:42.110726770Z 'x-ms-date': 'REDACTED'
2024-10-21T02:43:42.110734664Z 'x-ms-client-request-id': '48ae17ca-8f56-11ef-8667-3e4e57cc0722'
2024-10-21T02:43:42.110741838Z 'Authorization': 'REDACTED'
2024-10-21T02:43:42.110749071Z No body was attached to the request
2024-10-21T02:43:42.115860276Z INFO:azure.core.pipeline.policies.http_logging_policy:Response status: 200
2024-10-21T02:43:42.115884081Z Response headers:
2024-10-21T02:43:42.115894380Z 'Cache-Control': 'no-cache'
2024-10-21T02:43:42.115903427Z 'Transfer-Encoding': 'chunked'
2024-10-21T02:43:42.115911342Z 'Content-Type': 'application/xml'
2024-10-21T02:43:42.115918606Z 'Server': 'Windows-Azure-Queue/1.0 Microsoft-HTTPAPI/2.0'
2024-10-21T02:43:42.115925689Z 'x-ms-request-id': 'dec4a70e-2003-0054-0263-23f559000000'
2024-10-21T02:43:42.115933443Z 'x-ms-client-request-id': '48ae17ca-8f56-11ef-8667-3e4e57cc0722'
2024-10-21T02:43:42.115940537Z 'x-ms-version': 'REDACTED'
2024-10-21T02:43:42.115947640Z 'Date': 'Mon, 21 Oct 2024 02:43:41 GMT'
2024-10-21T02:43:45.406998738Z ERROR:root:Error getting PDF page count: EOF marker not found
2024-10-21T02:43:45.407628523Z ERROR:main:Exception on /indexes/espp/upload [POST]
2024-10-21T02:43:45.407666334Z Traceback (most recent call last):
2024-10-21T02:43:45.407677155Z File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 1473, in wsgi_app
2024-10-21T02:43:45.407685611Z response = self.full_dispatch_request()
2024-10-21T02:43:45.407693956Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-10-21T02:43:45.407702672Z File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 882, in full_dispatch_request
2024-10-21T02:43:45.407711248Z rv = self.handle_user_exception(e)
2024-10-21T02:43:45.407718973Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-10-21T02:43:45.407726527Z File "/usr/local/lib/python3.11/site-packages/flask_cors/extension.py", line 178, in wrapped_function
2024-10-21T02:43:45.407734732Z return cors_after_request(app.make_response(f(*args, **kwargs)))
2024-10-21T02:43:45.407741976Z ^^^^^^^^^^^^^^^^^^
2024-10-21T02:43:45.407749510Z File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 880, in full_dispatch_request
2024-10-21T02:43:45.407757455Z rv = self.dispatch_request()
2024-10-21T02:43:45.407765060Z ^^^^^^^^^^^^^^^^^^^^^^^
2024-10-21T02:43:45.407773124Z File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 865, in dispatch_request
2024-10-21T02:43:45.407780558Z return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return]
2024-10-21T02:43:45.407802430Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-10-21T02:43:45.407810214Z File "/app/app/api/routes.py", line 213, in _upload_file
2024-10-21T02:43:45.407817638Z num_pages = get_pdf_page_count(file_buffer)
2024-10-21T02:43:45.407825593Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-10-21T02:43:45.407833478Z File "/app/app/ingestion/pdf_processing.py", line 24, in get_pdf_page_count
2024-10-21T02:43:45.407841393Z reader = PdfReader(pdf_bytes)
2024-10-21T02:43:45.407849067Z ^^^^^^^^^^^^^^^^^^^^
2024-10-21T02:43:45.407857022Z File "/usr/local/lib/python3.11/site-packages/PyPDF2/_reader.py", line 319, in init
2024-10-21T02:43:45.407864606Z self.read(stream)
2024-10-21T02:43:45.407872281Z File "/usr/local/lib/python3.11/site-packages/PyPDF2/_reader.py", line 1415, in read
2024-10-21T02:43:45.407879564Z self._find_eof_marker(stream)
2024-10-21T02:43:45.407887118Z File "/usr/local/lib/python3.11/site-packages/PyPDF2/_reader.py", line 1471, in _find_eof_marker
2024-10-21T02:43:45.407894743Z raise PdfReadError("EOF marker not found")
2024-10-21T02:43:45.407902166Z PyPDF2.errors.PdfReadError: EOF marker not found