Skip to content

Commit 2cb1046

Browse files
yngvarhuangvirgilwong
andauthored
fix: The doc file cannot be parsed(infiniflow#11092) (infiniflow#11093)
### What problem does this PR solve? The doc file cannot be parsed(infiniflow#11092) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: virgilwong <hyhvirgil@gmail.com>
1 parent a880beb commit 2cb1046

1 file changed

Lines changed: 8 additions & 1 deletion

File tree

rag/app/naive.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -724,8 +724,15 @@ def chunk(filename, binary=None, from_page=0, to_page=100000,
724724
elif re.search(r"\.doc$", filename, re.IGNORECASE):
725725
callback(0.1, "Start to parse.")
726726

727+
try:
728+
from tika import parser as tika_parser
729+
except Exception as e:
730+
callback(0.8, f"tika not available: {e}. Unsupported .doc parsing.")
731+
logging.warning(f"tika not available: {e}. Unsupported .doc parsing for {filename}.")
732+
return []
733+
727734
binary = BytesIO(binary)
728-
doc_parsed = parser.from_buffer(binary)
735+
doc_parsed = tika_parser.from_buffer(binary)
729736
if doc_parsed.get('content', None) is not None:
730737
sections = doc_parsed['content'].split('\n')
731738
sections = [(_, "") for _ in sections if _]

0 commit comments

Comments
 (0)