make sure you have
- models folder if not get it from here:https://drive.google.com/drive/folders/11BEK-gFWjFB1Qb3mxHMg1OCYQn-QtV29?usp=drive_link
/models
/Layout
/MFD
/MFR
README.MD
2.EngineeringHistory3Books_text.parquet if not get it from here: https://drive.google.com/file/d/1DwXRLUqc7W4fLAtZR3XWiLva0Dc2VBAY/view?usp=sharing
conda environment that used for this part is: LMMRAGwithGPU from computer 391
.env for test can use .env_for_testing
including:
- image extrcation
- captiongeneration
- text OCR
these 3 using same environment basically start from 1 -> 2 -> 3
- imageextract.ipynb -> will give you crop image folder and full page folder, and also give you .json pairing each image to page number
- captiongeneration.ipynb -> will give you .json of image and associate caption
- textOCR.ipynb -> will give you .json of Text OCR
So after these 3 steps you will get
- imagecaption.json
- text.json
including:
- embed.ipynb
You may reuse the conda env from part 1.
- After Part1 you get .json for image and .json for text dataset
- parquet files: Run
embed.ipynb
to read from the above 2 json files and embed both, stored toxxx_text.parquet
andxxx_image.parquet
- RAG: Run
rag.ipynb
to perform vector search and get RAG results.
run rag.ipynb