Skip to content

Commit d333423

Browse files
authored
Merge pull request #585 from h2oai/control_embedding_migration
Control embedding migration
2 parents 738d7ac + fe6aaef commit d333423

31 files changed

+938
-404
lines changed

README.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ Query and summarize your documents or just chat with local private GPT LLMs usin
1717
- **Inference Servers** support (HF TGI server, vLLM, Gradio, ExLLaMa, OpenAI)
1818
- **OpenAI-compliant Python client API** for client-server control
1919
- **Evaluate** performance using reward models
20+
- **Quality** maintained with over 250 unit and integration tests taking over 4 GPU-hours
2021

2122
### Getting Started
2223

@@ -128,9 +129,10 @@ GPU and CPU mode tested on variety of NVIDIA GPUs in Ubuntu 18-22, but any moder
128129
- To run h2oGPT tests:
129130
```bash
130131
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin
131-
pip install requirements-parser
132-
pytest -s -v tests client/tests
132+
pip install requirements-parser pytest-instafail
133+
pytest --instafail -s -v tests client/tests
133134
```
135+
or tweak/run `tests/test4gpus.sh` to run tests in parallel.
134136

135137
### Help
136138

docs/FAQ.md

+17-1
Original file line numberDiff line numberDiff line change
@@ -182,11 +182,27 @@ This warning can be safely ignored.
182182
- `CUDA_VISIBLE_DEVICES`: Standard list of CUDA devices to make visible.
183183
- `PING_GPU`: ping GPU every few minutes for full GPU memory usage by torch, useful for debugging OOMs or memory leaks
184184
- `GET_GITHASH`: get git hash on startup for system info. Avoided normally as can fail with extra messages in output for CLI mode
185-
185+
- `H2OGPT_SCRATCH_PATH`: Choose base scratch folder for scratch databases and files
186+
- `H2OGPT_BASE_PATH`: Choose base folder for all files except scratch files
186187
These can be useful on HuggingFace spaces, where one sets secret tokens because CLI options cannot be used.
187188

188189
> **_NOTE:_** Scripts can accept different environment variables to control query arguments. For instance, if a Python script takes an argument like `--load_8bit=True`, the corresponding ENV variable would follow this format: `H2OGPT_LOAD_8BIT=True` (regardless of capitalization). It is important to ensure that the environment variable is assigned the exact value that would have been used for the script's query argument.
189190
191+
### How to run functions in src from Python interpreter
192+
193+
E.g.
194+
```python
195+
import sys
196+
sys.path.append('src')
197+
from src.gpt_langchain import get_supported_types
198+
non_image_types, image_types, video_types = get_supported_types()
199+
print(non_image_types)
200+
print(image_types)
201+
for x in image_types:
202+
print(' - `.%s` : %s Image (optional),' % (x.lower(), x.upper()))
203+
print(video_types)
204+
```
205+
190206
### GPT4All not producing output.
191207

192208
Please contact GPT4All team. Even a basic test can give empty result.

docs/README_LangChain.md

+67-2
Original file line numberDiff line numberDiff line change
@@ -44,9 +44,72 @@ Open-source data types are supported, .msg is not supported due to GPL-3 require
4444
- `.odt`: Open Document Text,
4545
- `.pptx` : PowerPoint Document,
4646
- `.ppt` : PowerPoint Document,
47+
- `.apng` : APNG Image (optional),
48+
- `.blp` : BLP Image (optional),
49+
- `.bmp` : BMP Image (optional),
50+
- `.bufr` : BUFR Image (optional),
51+
- `.bw` : BW Image (optional),
52+
- `.cur` : CUR Image (optional),
53+
- `.dcx` : DCX Image (optional),
54+
- `.dds` : DDS Image (optional),
55+
- `.dib` : DIB Image (optional),
56+
- `.emf` : EMF Image (optional),
57+
- `.eps` : EPS Image (optional),
58+
- `.fit` : FIT Image (optional),
59+
- `.fits` : FITS Image (optional),
60+
- `.flc` : FLC Image (optional),
61+
- `.fli` : FLI Image (optional),
62+
- `.fpx` : FPX Image (optional),
63+
- `.ftc` : FTC Image (optional),
64+
- `.ftu` : FTU Image (optional),
65+
- `.gbr` : GBR Image (optional),
66+
- `.gif` : GIF Image (optional),
67+
- `.grib` : GRIB Image (optional),
68+
- `.h5` : H5 Image (optional),
69+
- `.hdf` : HDF Image (optional),
70+
- `.icb` : ICB Image (optional),
71+
- `.icns` : ICNS Image (optional),
72+
- `.ico` : ICO Image (optional),
73+
- `.iim` : IIM Image (optional),
74+
- `.im` : IM Image (optional),
75+
- `.j2c` : J2C Image (optional),
76+
- `.j2k` : J2K Image (optional),
77+
- `.jfif` : JFIF Image (optional),
78+
- `.jp2` : JP2 Image (optional),
79+
- `.jpc` : JPC Image (optional),
80+
- `.jpe` : JPE Image (optional),
81+
- `.jpeg` : JPEG Image (optional),
82+
- `.jpf` : JPF Image (optional),
83+
- `.jpg` : JPG Image (optional),
84+
- `.jpx` : JPX Image (optional),
85+
- `.mic` : MIC Image (optional),
86+
- `.mpeg` : MPEG Image (optional),
87+
- `.mpg` : MPG Image (optional),
88+
- `.msp` : MSP Image (optional),
89+
- `.pbm` : PBM Image (optional),
90+
- `.pcd` : PCD Image (optional),
91+
- `.pcx` : PCX Image (optional),
92+
- `.pgm` : PGM Image (optional),
4793
- `.png` : PNG Image (optional),
48-
- `.jpg` : JPEG Image (optional),
49-
- `.jpeg` : JPEG Image (optional).
94+
- `.pnm` : PNM Image (optional),
95+
- `.ppm` : PPM Image (optional),
96+
- `.ps` : PS Image (optional),
97+
- `.psd` : PSD Image (optional),
98+
- `.pxr` : PXR Image (optional),
99+
- `.qoi` : QOI Image (optional),
100+
- `.ras` : RAS Image (optional),
101+
- `.rgb` : RGB Image (optional),
102+
- `.rgba` : RGBA Image (optional),
103+
- `.sgi` : SGI Image (optional),
104+
- `.tga` : TGA Image (optional),
105+
- `.tif` : TIF Image (optional),
106+
- `.tiff` : TIFF Image (optional),
107+
- `.vda` : VDA Image (optional),
108+
- `.vst` : VST Image (optional),
109+
- `.webp` : WEBP Image (optional),
110+
- `.wmf` : WMF Image (optional),
111+
- `.xbm` : XBM Image (optional),
112+
- `.xpm` : XPM Image (optional).
50113

51114
To support image captioning, on Ubuntu run:
52115
```bash
@@ -326,6 +389,8 @@ For links to direct to the document and download to your local machine, the orig
326389

327390
* [docquery](https://github.com/impira/docquery) like PrivateGPT but uses LayoutLM.
328391

392+
* [KhoJ](https://github.com/khoj-ai/khoj) but also access from emacs or Obsidian.
393+
329394
* [ChatPDF](https://www.chatpdf.com/) but h2oGPT is open-source and private and many more data types.
330395

331396
* [Sharly](https://www.sharly.ai/) but h2oGPT is open-source and private and many more data types. Sharly and h2oGPT both allow sharing work through UserData shared collection.

reqs_optional/requirements_optional_langchain.txt

+2-1
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@ chromadb==0.3.25
2020
unstructured[local-inference]==0.7.4
2121
#pdf2image==1.16.3
2222
#pytesseract==0.3.10
23-
pillow
23+
pillow>=10.0.0
24+
posthog>=3.0.1
2425

2526
pdfminer.six==20221105
2627
urllib3

requirements.txt

+8
Original file line numberDiff line numberDiff line change
@@ -66,3 +66,11 @@ text-generation==0.6.0
6666
tiktoken==0.4.0
6767
# optional: for OpenAI endpoint or embeddings (requires key)
6868
openai==0.27.8
69+
70+
requests>=2.31.0
71+
urllib3>=1.26.16
72+
filelock>=3.12.2
73+
joblib>=1.3.1
74+
tqdm>=4.65.0
75+
tabulate>=0.9.0
76+
packaging>=23.1

src/cli.py

+5-3
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ def run_cli( # for local function:
1515
score_model=None, load_8bit=None, load_4bit=None, load_half=None,
1616
load_gptq=None, load_exllama=None, use_safetensors=None, revision=None,
1717
use_gpu_id=None, tokenizer_base_model=None,
18-
gpu_id=None, local_files_only=None, resume_download=None, use_auth_token=None,
18+
gpu_id=None, n_jobs=None, local_files_only=None, resume_download=None, use_auth_token=None,
1919
trust_remote_code=None, offload_folder=None, rope_scaling=None, max_seq_len=None, compile_model=None,
2020
# for some evaluate args
2121
stream_output=None, async_output=None, num_async=None,
@@ -40,11 +40,13 @@ def run_cli( # for local function:
4040
raise_generate_gpu_exceptions=None, load_db_if_exists=None, use_llm_if_no_docs=None,
4141
my_db_state0=None, selection_docs_state0=None, dbs=None, langchain_modes=None, langchain_mode_paths=None,
4242
detect_user_path_changes_every_query=None,
43-
use_openai_embedding=None, use_openai_model=None, hf_embedding_model=None, cut_distance=None,
43+
use_openai_embedding=None, use_openai_model=None,
44+
hf_embedding_model=None, migrate_embedding_model=None,
45+
cut_distance=None,
4446
answer_with_sources=None,
4547
append_sources_to_answer=None,
4648
add_chat_history_to_context=None,
47-
db_type=None, n_jobs=None, first_para=None, text_limit=None, verbose=None, cli=None, reverse_docs=None,
49+
db_type=None, first_para=None, text_limit=None, verbose=None, cli=None, reverse_docs=None,
4850
use_cache=None,
4951
auto_reduce_chunks=None, max_chunks=None, model_lock=None, force_langchain_evaluate=None,
5052
model_state_none=None,

src/client_test.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@
4949
from bs4 import BeautifulSoup # pip install beautifulsoup4
5050

5151
from enums import DocumentSubset, LangChainAction
52+
from tests.utils import get_inf_server
5253

5354
debug = False
5455

@@ -58,7 +59,7 @@
5859
def get_client(serialize=True):
5960
from gradio_client import Client
6061

61-
client = Client(os.getenv('HOST', "http://localhost:7860"), serialize=serialize)
62+
client = Client(get_inf_server(), serialize=serialize)
6263
if debug:
6364
print(client.view_api(all_endpoints=True))
6465
return client

src/eval.py

+6-4
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ def run_eval( # for local function:
2424
score_model=None, load_8bit=None, load_4bit=None, load_half=None,
2525
load_gptq=None, load_exllama=None, use_safetensors=None, revision=None,
2626
use_gpu_id=None, tokenizer_base_model=None,
27-
gpu_id=None, local_files_only=None, resume_download=None, use_auth_token=None,
27+
gpu_id=None, n_jobs=None, local_files_only=None, resume_download=None, use_auth_token=None,
2828
trust_remote_code=None, offload_folder=None, rope_scaling=None, max_seq_len=None, compile_model=None,
2929
# for evaluate args beyond what's already above, or things that are always dynamic and locally created
3030
temperature=None,
@@ -60,11 +60,13 @@ def run_eval( # for local function:
6060
raise_generate_gpu_exceptions=None, load_db_if_exists=None, use_llm_if_no_docs=None,
6161
my_db_state0=None, selection_docs_state0=None, dbs=None, langchain_modes=None, langchain_mode_paths=None,
6262
detect_user_path_changes_every_query=None,
63-
use_openai_embedding=None, use_openai_model=None, hf_embedding_model=None, cut_distance=None,
63+
use_openai_embedding=None, use_openai_model=None,
64+
hf_embedding_model=None, migrate_embedding_model=None,
65+
cut_distance=None,
6466
answer_with_sources=None,
6567
append_sources_to_answer=None,
6668
add_chat_history_to_context=None,
67-
db_type=None, n_jobs=None, first_para=None, text_limit=None, verbose=None, cli=None, reverse_docs=None,
69+
db_type=None, first_para=None, text_limit=None, verbose=None, cli=None, reverse_docs=None,
6870
use_cache=None,
6971
auto_reduce_chunks=None, max_chunks=None,
7072
model_lock=None, force_langchain_evaluate=None,
@@ -121,7 +123,7 @@ def run_eval( # for local function:
121123
num_examples = len(examples)
122124
scoring_path = 'scoring'
123125
# if no permissions, assume may not want files, put into temp
124-
scoring_path = makedirs(scoring_path, tmp_ok=True)
126+
scoring_path = makedirs(scoring_path, tmp_ok=True, use_base=True)
125127
if eval_as_output:
126128
used_base_model = 'gpt35'
127129
used_lora_weights = ''

0 commit comments

Comments
 (0)