Why ingest and then chat are using only 6 cores? #610

pikor69 · 2023-06-04T15:35:24Z

pikor69
Jun 4, 2023

I was thinking about upgrading the CPU to 16 cores / 32 threads model. But I've noticed that only 6 cores are used in my tests (out of 12).
Is there a hard limit? Does it depend on a model? Configuration?

Answered by johnbrisbin

Jun 4, 2023

@sime2408,

For privateGPT.py you can set the thread count as high as you like using this parameter to LllamaCpp: Add n_threads=psutil.cpu_count(logical=False) the False value gets you the number of physical cores, a True value gets the number of virtual threads. This will use up all the threads and push CPU usage to 100% (on winders). For all the redlining of the CPU, I am unsure if it is really much faster. I compared the Llama print times after the query and saw little difference. Maybe it doesn't show up there.

On ingestion, it can get a lot better. Just enabling the GPU in the embedding LLM makes it about 7 times faster. Overlapping and threading the various actions can bring that num…

View full answer

sime2408 · 2023-06-04T15:55:58Z

sime2408
Jun 4, 2023

Maybe this line has to be multiplied by 2??

with Pool(processes=os.cpu_count()) as pool:
so
with Pool(processes=os.cpu_count() * 2) as pool:

On Intel, it's the case you need to check how much to increase: the number of CPU cores by the number of threads supported by each core.
On AMD is similar. So, the number of threads can be calculated by multiplying the number of CPU cores by 2.

I just tried, looks to me faster 😃

2 replies

pikor69 Jun 4, 2023
Author

Funny thing :). Now ingest is using all threads, but not to full potential. The overall CPU utilisation is the same at ~58%. Scratching my head.

farrukhms Dec 12, 2023

In which file is this parameter included? I have tried to find the file but unable to find any such file.

with Pool(processes=os.cpu_count()) as pool:

johnbrisbin · 2023-06-04T18:40:36Z

johnbrisbin
Jun 4, 2023

There is a fundamental issue:
The running code has to be designed to run some number of tasks in parallel. Some tasks are easily parallelize (graphic ops) some are not (chains of sequential operations).
No matter what your CPU or configuration allows, it will never use more than that number. Most of what we are doing is on a pretty linear path, so the number of threads used will not challenge a ThreadRipper. In ingest it is possible to overlap several operations and get the CPU usage pretty high (if GPU is not enabled), but that only lasts for a short period until the initial loading and chunking is done, then it comes down to waiting on embedding generation and db insertion. DB insertion is almost single threaded.

All that said, it will get better as we improve the overlap of operations and enable effective GPU usage. Some tools need specific arguments (often very lightly documented) to enable usage of more CPU and/or GPU usage resources. Those things need to be done carefully to avoid breaking things for the bulk of users who don't have the fastest CPU or most expensive GPU.

0 replies

sime2408 · 2023-06-04T18:52:33Z

sime2408
Jun 4, 2023

@johnbrisbin I agree Chroma is single-threaded, but at least load of documents:

maybe we can try out a somewhat similar solution but with threading. To make your Python code use more cores, you need to design your program in a way that you divide your task into independent subtasks. Then, you can use multi-threading or multi-processing to execute these subtasks in parallel on different cores. However, keep in mind that multi-threading in Python can be tricky due to the Global Interpreter Lock (GIL), which allows only one thread to execute Python bytecodes at a time. So, for CPU-bound tasks, multi-threading may not provide any speedup and could even slow down your program. Multi-processing does not have this issue because each process has its own Python interpreter and memory space, but communication between processes can be slower than between threads, and starting a new process is slower than starting a new thread.

import concurrent


def load_documents(source_dir: str, ignored_files: List[str] = []) -> List[Document]:
    """
    Loads all documents from the source documents directory, ignoring specified files
    """
    all_files = []
    for ext in LOADER_MAPPING:
        all_files.extend(
            glob.glob(os.path.join(source_dir, f"**/*{ext}"), recursive=True)
        )
    filtered_files = [file_path for file_path in all_files if file_path not in ignored_files]

    with ThreadPoolExecutor(max_workers=os.cpu_count()) as executor:
        results = []
        with tqdm(total=len(filtered_files), desc='Loading new documents', ncols=80) as pbar:
            future_to_file = {executor.submit(load_single_document, file_path): file_path for file_path in filtered_files}
            for future in concurrent.futures.as_completed(future_to_file):
                file_path = future_to_file[future]
                try:
                    data = future.result()
                except Exception as exc:
                    print(f'{file_path} generated an exception: {exc}')
                else:
                    results.extend(data)
                    pbar.update()

    return results

1 reply

johnbrisbin Jun 4, 2023

@sime2408,

For privateGPT.py you can set the thread count as high as you like using this parameter to LllamaCpp: Add n_threads=psutil.cpu_count(logical=False) the False value gets you the number of physical cores, a True value gets the number of virtual threads. This will use up all the threads and push CPU usage to 100% (on winders). For all the redlining of the CPU, I am unsure if it is really much faster. I compared the Llama print times after the query and saw little difference. Maybe it doesn't show up there.

On ingestion, it can get a lot better. Just enabling the GPU in the embedding LLM makes it about 7 times faster. Overlapping and threading the various actions can bring that number up to 12x.

For threading injestion, 2 or three threads per process work pretty well because the thread yields on I/O and others run. After that, you have to be using multiprocessing to get much further. The current use of multiprocess uses way too many processes. On winders, each uses ~400MB, limit it to 4 or the number of virtual threads whichever is smaller.

The process relaunch is slow, so you only want to do it once. Use multiprocess manager queues to pass file names between processes and threads, allowing long runtimes for processes and threads.

But a big winner is to ditch the abysmally slow loaders used for more performant ones. For example, PyMuPDF running it its own process on the main thread is faster (clock time) than PDFMiner running on 4 processes and 12 threads (which was the best I could get before switching loaders).

But first, turn on the GPU for ingestion (if you have something NVidia). The rest just shows up when I load my library of 2000 avg 1MB books and PDF files.

There are instructions for enabling the GPU here and probably in the current ReadMe. #425

Answer selected by pikor69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why ingest and then chat are using only 6 cores? #610

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Why ingest and then chat are using only 6 cores? #610

Uh oh!

pikor69 Jun 4, 2023

Replies: 3 comments · 3 replies

Uh oh!

sime2408 Jun 4, 2023

Uh oh!

pikor69 Jun 4, 2023 Author

Uh oh!

farrukhms Dec 12, 2023

Uh oh!

johnbrisbin Jun 4, 2023

Uh oh!

Uh oh!

sime2408 Jun 4, 2023

Uh oh!

Uh oh!

johnbrisbin Jun 4, 2023

pikor69
Jun 4, 2023

Replies: 3 comments 3 replies

sime2408
Jun 4, 2023

pikor69 Jun 4, 2023
Author

johnbrisbin
Jun 4, 2023

sime2408
Jun 4, 2023