Why ingest and then chat are using only 6 cores? #610
-
I was thinking about upgrading the CPU to 16 cores / 32 threads model. But I've noticed that only 6 cores are used in my tests (out of 12). |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
Maybe this line has to be multiplied by 2??
On Intel, it's the case you need to check how much to increase: the number of CPU cores by the number of threads supported by each core. I just tried, looks to me faster 😃 |
Beta Was this translation helpful? Give feedback.
-
There is a fundamental issue: All that said, it will get better as we improve the overlap of operations and enable effective GPU usage. Some tools need specific arguments (often very lightly documented) to enable usage of more CPU and/or GPU usage resources. Those things need to be done carefully to avoid breaking things for the bulk of users who don't have the fastest CPU or most expensive GPU. |
Beta Was this translation helpful? Give feedback.
-
@johnbrisbin I agree Chroma is single-threaded, but at least load of documents: maybe we can try out a somewhat similar solution but with threading. To make your Python code use more cores, you need to design your program in a way that you divide your task into independent subtasks. Then, you can use multi-threading or multi-processing to execute these subtasks in parallel on different cores. However, keep in mind that multi-threading in Python can be tricky due to the Global Interpreter Lock (GIL), which allows only one thread to execute Python bytecodes at a time. So, for CPU-bound tasks, multi-threading may not provide any speedup and could even slow down your program. Multi-processing does not have this issue because each process has its own Python interpreter and memory space, but communication between processes can be slower than between threads, and starting a new process is slower than starting a new thread. import concurrent
def load_documents(source_dir: str, ignored_files: List[str] = []) -> List[Document]:
"""
Loads all documents from the source documents directory, ignoring specified files
"""
all_files = []
for ext in LOADER_MAPPING:
all_files.extend(
glob.glob(os.path.join(source_dir, f"**/*{ext}"), recursive=True)
)
filtered_files = [file_path for file_path in all_files if file_path not in ignored_files]
with ThreadPoolExecutor(max_workers=os.cpu_count()) as executor:
results = []
with tqdm(total=len(filtered_files), desc='Loading new documents', ncols=80) as pbar:
future_to_file = {executor.submit(load_single_document, file_path): file_path for file_path in filtered_files}
for future in concurrent.futures.as_completed(future_to_file):
file_path = future_to_file[future]
try:
data = future.result()
except Exception as exc:
print(f'{file_path} generated an exception: {exc}')
else:
results.extend(data)
pbar.update()
return results |
Beta Was this translation helpful? Give feedback.
@sime2408,
For privateGPT.py you can set the thread count as high as you like using this parameter to LllamaCpp: Add n_threads=psutil.cpu_count(logical=False) the False value gets you the number of physical cores, a True value gets the number of virtual threads. This will use up all the threads and push CPU usage to 100% (on winders). For all the redlining of the CPU, I am unsure if it is really much faster. I compared the Llama print times after the query and saw little difference. Maybe it doesn't show up there.
On ingestion, it can get a lot better. Just enabling the GPU in the embedding LLM makes it about 7 times faster. Overlapping and threading the various actions can bring that num…