Skip to content

Commit c1e1f2a

Browse files
authored
Fix distributed processing and various fixes (#89)
* simplified process_distributed.sh * removed useless parameter --mmore-folder * fixed distributed processing * by default, processing should still not be distributed * Fixed documentation of distributed processing * by default, processing should not be using the fast mode * removed the acknowledgements line * small logic fixes * reformatting with black and deleted leftover TODOs * sorting imports * changed docs with pyment
1 parent 007a41e commit c1e1f2a

25 files changed

Lines changed: 166 additions & 157 deletions

README.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,3 @@ Don't hesitate to star the project :star: if you find it interesting! (you would
129129
## License
130130

131131
This project is licensed under the Apache 2.0 License, see the [LICENSE :mortar_board:](LICENSE) file for details.
132-
133-
## Acknowledgements
134-
135-
This project is part of the [**OpenMeditron**](https://huggingface.co/OpenMeditron) initiative developed in [LiGHT](https://www.light-laboratory.org/) lab at EPFL/Yale/CMU Africa in collaboration with the [**SwissAI**](https://www.swiss-ai.org/) initiative. Thank you Scott Mahoney, Mary-Anne Hartley

docs/distributed_processing.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ Other important configuration options:
2929
- `output_folder`: Where processed results will be stored
3030
- `use_fast_processors`: Set to `true` for faster processing (may reduce accuracy)
3131

32-
### 2. Install Dependencies on All Nodes
32+
### 2. Install Dependencies on all Nodes
3333

3434
On each node, run:
3535

@@ -51,7 +51,7 @@ pip install -e .
5151
#### Step 1: Start the Master Node (Rank 0)
5252

5353
```bash
54-
bash scripts/process_distributed.sh --mmore-folder /path/to/mmore --config-path /path/to/config.yaml --rank 0
54+
bash scripts/process_distributed.sh --config-file /path/to/config.yaml --rank 0
5555
```
5656

5757
The master node will:
@@ -64,14 +64,16 @@ The master node will:
6464
On each additional node, run:
6565

6666
```bash
67-
bash scripts/process_distributed.sh --mmore-folder /path/to/mmore --config-path /path/to/config.yaml --rank 1
67+
bash scripts/process_distributed.sh --config-path /path/to/config.yaml --rank 1
6868
```
6969

70-
Replace `rank 1` with a unique rank number for each node (1, 2, 3, etc.).
70+
Replace `rank 1` with a unique rank number for each node (1, 2, 3, etc.). The node should be ready in a matter of 5 seconds.
7171

7272
#### Step 3: Begin Processing
7373

74-
Once all nodes are running, return to the master node and type `go` when prompted to start the processing.
74+
Once all nodes are running, return to the master node and type `go`. The master node proceeds to crawl the input folder, split the workload among connected nodes and make them start their work.
75+
76+
The dask server will be automatically shut down by the master node at the end of the processing. This will also shut down the dask workers on all the connected nodes.
7577

7678
## Monitoring Progress
7779

examples/process/config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ dispatcher_config:
55
distributed: false
66
dashboard_backend_url: null
77
extract_images: true
8-
#scheduler_file: scheduler-file.json #put absolute path!
8+
scheduler_file: /mmore/scheduler-file.json #put absolute path!
99
process_batch_sizes:
1010
- URLProcessor: 40
1111
- DOCXProcessor: 100

examples/rag/evaluation/rag_evaluator_example.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
import argparse
2-
from ....src.mmore.rag.evaluator import EvalConfig, RAGEvaluator
3-
from ....src.mmore.rag.llm import LLMConfig, LLM
4-
from ....src.mmore.index.indexer import DBConfig
52

63
from dotenv import load_dotenv
4+
5+
from ....src.mmore.rag.evaluator import RAGEvaluator
6+
77
load_dotenv()
88

99
MOCK_EVALUATOR_CONFIG = './examples/rag/evaluation/rag_eval_example_config.yaml'

scripts/process_distributed.sh

Lines changed: 14 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -2,28 +2,22 @@
22

33
# Default values
44
CONFIG_PATH=""
5-
MMORE_FOLDER=$(pwd)
65
RANK=""
76

87
# Helper function to show usage
98
usage() {
10-
echo "Usage: $0 --mmore-folder <path> --config-path <path> --rank <value>"
9+
echo "Usage: $0 --mmore-folder <path> --config-file <path> --rank <value>"
1110
echo ""
1211
echo "Required arguments:"
13-
echo " --mmore-folder Absolute path to the mmore folder."
14-
echo " --config_path Absolute path to the config.yaml file."
12+
echo " --config-file Absolute path to the config.yaml file."
1513
echo " --rank Node rank."
1614
exit 1
1715
}
1816

1917
# Parse command-line arguments
2018
while [[ $# -gt 0 ]]; do
2119
case $1 in
22-
--mmore-folder)
23-
MMORE_FOLDER="$2"
24-
shift 2
25-
;;
26-
--config-path)
20+
--config-file)
2721
CONFIG_PATH="$2"
2822
shift 2
2923
;;
@@ -40,31 +34,13 @@ done
4034

4135

4236
# Check required arguments
43-
if [[ -z "$MMORE_FOLDER" || -z "$CONFIG_PATH" || -z "$RANK" ]]; then
37+
if [[ -z "$CONFIG_PATH" || -z "$RANK" ]]; then
4438
echo "Error: Missing required arguments."
4539
usage
4640
fi
4741

4842
# Update and install dependencies
4943
echo "Updating system and installing dependencies..."
50-
sudo apt-get update
51-
sudo apt-get install -y --no-install-recommends \
52-
nano curl ffmpeg libsm6 libxext6 chromium-browser libnss3 libgconf-2-4 libxi6 libxrandr2 \
53-
libxcomposite1 libxcursor1 libxdamage1 libxfixes3 libxrender1 libasound2 libatk1.0-0 \
54-
libgtk-3-0 libreoffice libjpeg-dev
55-
56-
# Install Rye
57-
echo "Setting up UV"
58-
curl -LsSf https://astral.sh/uv/install.sh | sh
59-
60-
# Navigate to the project directory
61-
echo "Navigating to mmore folder: $MMORE_FOLDER"
62-
cd "$MMORE_FOLDER" || { echo "Directory $MMORE_FOLDER does not exist! Exiting."; exit 1; }
63-
export PATH="$HOME/.local/bin:$PATH"
64-
65-
# Sync Rye to install dependencies
66-
echo "Syncing UV (installing dependencies)"
67-
pip install -e '.[process]'
6844

6945
# Extract the distributed configuration from the YAML file
7046
distributed=$(grep -A3 'dispatcher_config:' "$CONFIG_PATH" | grep 'distributed:' | awk '{print $2}')
@@ -73,47 +49,47 @@ scheduler_file=$(grep 'scheduler_file:' "$CONFIG_PATH" | awk '{print $2}')
7349

7450
# Configure environment variables
7551
echo "Setting up environment variables"
76-
export PATH="/.venv/bin:$PATH"
7752
export DASK_DISTRIBUTED__WORKER__DAEMON=False
7853

7954
# Dask part of the script
80-
source .venv/bin/activate
8155

8256
if [ "$distributed" = "true" ]; then
83-
pip list | grep dask
8457
echo "Distributed mode enabled"
8558
# Start the Dask scheduler if the current node is the MASTER (rank 0)
8659
if [ "$RANK" -eq 0 ]; then
8760
echo "Starting the scheduler because it is the MASTER node (rank 0)"
88-
dask -h
89-
dask scheduler --scheduler-file "$scheduler_file" &
61+
dask scheduler --scheduler-file "$scheduler_file" &> dask_scheduler.log &
62+
SCHEDULER_PID=$!
9063
fi
9164

9265
# Start the Dask worker
9366
echo "Starting the worker of every node"
94-
dask worker --scheduler-file "$scheduler_file" &
67+
dask worker --scheduler-file "$scheduler_file" &> "dask_scheduler_worker_$RANK.log" &
9568
fi
9669

9770

9871
# Run the end-to-end test if the current node is the MASTER (rank 0)
9972
if [ "$RANK" -eq 0 ]; then
10073
echo "Running the end-to-end test in the MASTER node (rank 0)"
101-
echo "Command to execute: python \"$MMORE_FOLDER/src/mmore/run_process.py\" --config_file \"$CONFIG_PATH\""
74+
echo "Command to execute: python -m mmore process --config-file \"$CONFIG_PATH\""
10275
echo "Should maybe exit here and wait until all the workers are ready!"
10376
echo "Type 'go' to execute the command, or type 'exit' to stop and run it manually later."
10477

10578
# waiting for the user to type 'go' or 'exit'
10679
while true; do
10780
read -r user_input
10881
if [ "$user_input" = "go" ]; then
109-
python "$MMORE_FOLDER/src/mmore/run_process.py" --config_file "$CONFIG_PATH"
82+
echo "Starting processing"
83+
python -m mmore process --config-file "$CONFIG_PATH"
11084
break
11185
elif [ "$user_input" = "exit" ]; then
11286
echo "Exiting without running the command. You can run it manually later:"
113-
echo "python \"$MMORE_FOLDER/src/mmore/run_process.py\" --config_file \"$CONFIG_PATH\""
87+
echo "python -m mmore process --config-file \"$CONFIG_PATH\""
11488
exit 0
11589
else
11690
echo "Invalid input. Type 'go' to run the command or 'exit' to stop."
11791
fi
11892
done
119-
fi
93+
94+
kill -9 $SCHEDULER_PID
95+
fi

scripts/runai/entrypoint.sh

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
#!/bin/bash
22

3-
SCRIPT="run_process.py"
3+
SCRIPT="process"
44
while getopts e:s:c:p: flag
55
do
66
case "${flag}" in
@@ -12,7 +12,7 @@ done
1212

1313
# Going to repo dir
1414
if [ -z "$REPO_PATH" ]; then
15-
REPO_PATH="/mnt/mlo/scratch/homes/$(whoami)/mmore"
15+
REPO_PATH="/mmore" # change to the actual repo path
1616
fi
1717
cd $REPO_PATH
1818

@@ -21,8 +21,7 @@ set -o allexport
2121
source .env
2222
set +o allexport
2323

24-
# TODO: Update when final version is released (Install libraries)
25-
pip install -e '.[rag]'
24+
pip install -e .
2625

2726
echo "Start time: $(date)"
2827

scripts/setup.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,5 @@ sudo apt install -y ffmpeg libsm6 libxext6 chromium-browser libnss3 libgconf-2-4
1212
# Install UV
1313
curl -LsSf https://astral.sh/uv/install.sh | sh
1414
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrcuv sync
15+
uv venv
1516
source .venv/bin/activate

src/mmore/__main__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
from .cli import main
22

33
if __name__ == "__main__":
4-
main()
4+
main()

0 commit comments

Comments
 (0)