Finetune model by ivanleomk · Pull Request #44 · 567-labs/fastllm

ivanleomk · 2024-02-12T15:40:23Z

	🚀 This PR description was created by Ellipsis for commit `4533fef`.

Summary:

This PR introduces a new model fine-tuning script and a performance graph generation script, along with model performance data.

Key points:

Added finetune.py for fine-tuning a model on the Quora dataset.
Added graph.py for generating performance graphs.
Included model_perf.json and model_perf_dense.json containing model performance data.

Generated with ❤️ by ellipsis.dev

… results that used just the base model.

ivanleomk · 2024-02-12T15:41:04Z

+        print(
+            f"Iteration {iteration+1}: {curr_eval}, eval: ( {best_eval_perf}, {time_step} )"
+        )
+        if curr_eval > best_eval_perf:


Early Stopping condition is here

ivanleomk · 2024-02-12T15:41:35Z

Had to write some custom code for an early stopping condition since the sentence transformers lib doesn't seem to support training till we see no improvement in test loss

jxnl · 2024-02-12T15:43:15Z

+
+    best_eval = (evaluator(model, output_path=None), 0)
+    print(f"Initial Eval: {best_eval}")
+    for iteration in range(6):


parameterize this

jxnl · 2024-02-12T15:43:35Z

+            print(f"Previous Eval: {best_eval_perf}, new Eval : {curr_eval}")
+            best_eval = (curr_eval, iteration)
+
+    predictions, test_labels = generate_prediction_labels(


why not generate it per epoch? then we have more data

jxnl · 2024-02-12T15:43:52Z

+        model.fit(
+            train_objectives=[(train_dataloader, train_loss)],
+            evaluator=evaluator,
+            epochs=1,


why epochs=1 rather than for n_epochs

jxnl · 2024-02-12T15:44:27Z

+    import json
+
+    PERCENTAGES = [0.7, 0.9]
+    DATASET_SIZE = 261317


why is this 261317

jxnl · 2024-02-12T15:44:50Z

+    except FileNotFoundError:
+        model_perf = {}
+
+    for item in finetune_model.map(PERCENTAGES):


instead of doing percentage just do

[1000, 2000, ...]

and jsut double it and plot on log

jxnl · 2024-02-12T15:46:24Z

+        "recall": 0.8528957528957529,
+        "AUC": 0.8779285474909108
+    },
+    "0.7": {


{
"accuracy": 0.891340875994175,
"precision": 0.7265252425587897,
"recall": 0.8528957528957529,
"AUC": 0.8779285474909108,
"model": ...
"epoch": ...
"data_size": ...
...
}

and just make an array

Then you can just

pd.from_jsonlines()

ellipsis-dev

Changes requested. Reviewed entire PR up to commit 4533fef

Reviewed 454 lines of code across 4 files in 1 minute(s) and 26 second(s).

See details

Skipped files: 6 (please contact us to request support for these files)
Confidence threshold: 50%
Drafted 1 additional comments.
Workflow ID: wflow_fiizYlIVVBMUV0za

View 1 draft comments

These comments were drafted by Ellipsis, but were filtered out of the final review. They're included here so you can see our internal thought process and help you configure your ellipsis.yaml.

Drafted 1 additional comments

Comment at `applications/finetune-quora-embeddings/finetune.py:124`

Consider optimizing this function by using matrix operations to calculate cosine similarity for all pairs at once, instead of using a for loop. This could significantly improve performance for large datasets.

Something look wrong? You can customize Ellipsis by editing the ellipsis.yaml for this repository.

Generated with ❤️ by ellipsis.dev

ellipsis-dev · 2024-02-12T17:47:21Z

+            id2_exists = id2 in embedding_ids
+            print(f"id1 exists: {id1_exists}, id2 exists: {id2_exists}")
+            print(f"Encountered an exception: {e}")
+            raise e


Consider improving the exception handling in this function. Instead of simply printing and re-raising the exception, it would be better to log the exception and continue processing the remaining data. This would prevent the entire process from stopping due to a single error.

ellipsis-dev · 2024-02-12T17:47:21Z

+@stub.function(
+    image=image, gpu=GPU_CONFIG, volumes={DATASET_DIR: DATASET_VOLUME}, timeout=86400
+)
+def finetune_model(dataset_pct: float):


Consider breaking down this function into smaller, more manageable functions. This would improve readability and maintainability of the code.

ellipsis-dev

Changes requested. Reviewed entire PR up to commit 4533fef

Reviewed 454 lines of code across 4 files in 1 minute(s) and 42 second(s).

See details

Skipped files: 6 (please contact us to request support for these files)
Confidence threshold: 50%
Drafted 1 additional comments.
Workflow ID: wflow_0IVKtfFqnWVYjnCD

View 1 draft comments

These comments were drafted by Ellipsis, but were filtered out of the final review. They're included here so you can see our internal thought process and help you configure your ellipsis.yaml.

Drafted 1 additional comments

Comment at `applications/finetune-quora-embeddings/finetune.py:220`

Consider adding more detailed logging to this function. For example, you could log the progress of the training process, such as the current epoch and the loss at each step. This would make it easier to monitor the training process and diagnose any issues.

Something look wrong? You can customize Ellipsis by editing the ellipsis.yaml for this repository.

Generated with ❤️ by ellipsis.dev

ellipsis-dev · 2024-02-12T17:47:53Z

+            id2_exists = id2 in embedding_ids
+            print(f"id1 exists: {id1_exists}, id2 exists: {id2_exists}")
+            print(f"Encountered an exception: {e}")
+            raise e


Consider improving the exception handling here. Instead of just printing the exception and re-raising it, it would be better to log the exception and continue processing the remaining data. This would prevent the entire process from stopping due to a single error.

ellipsis-dev · 2024-02-12T17:47:53Z

+    """
+    import numpy as np
+
+    texts = []


Consider optimizing this function by creating the 'texts' and 'ids' lists in a single pass over the 'text_to_embed' generator. This would be more efficient than the current implementation, which iterates over the generator twice.

ellipsis-dev · 2024-02-12T17:47:53Z

+        start_index = idx * batch_size
+        end_index = min((idx + 1) * batch_size, len(embeddings_first_sentence))
+
+        batch_embeddings_sentence_1 = np.stack(


Consider optimizing this function by creating the 'batch_embeddings_sentence_1' and 'batch_embeddings_sentence_2' arrays once outside the loop and updating them in each iteration. This would reduce the memory footprint and potentially improve performance.

ivanleomk added 2 commits February 11, 2024 00:47

Refactored Justin's script for fine-tuning and generated some initial…

147f211

… results that used just the base model.

Current Status

4533fef

ivanleomk commented Feb 12, 2024

View reviewed changes

jxnl reviewed Feb 12, 2024

View reviewed changes

jxnl added the ellipsis:review label Feb 12, 2024

jxnl marked this pull request as ready for review February 12, 2024 17:45

ellipsis-dev Bot reviewed Feb 12, 2024

View reviewed changes

Modified finetune.py and updated the generted benchmarks with the values

3e60115

ellipsis-dev Bot removed the ellipsis:review label May 3, 2024

Conversation

ivanleomk commented Feb 12, 2024 • edited by ellipsis-dev Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivanleomk commented Feb 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

Drafted 1 additional comments

Comment at applications/finetune-quora-embeddings/finetune.py:124

Uh oh!

ellipsis-dev Bot Feb 12, 2024

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev Bot Feb 12, 2024

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

Drafted 1 additional comments

Comment at applications/finetune-quora-embeddings/finetune.py:220

Uh oh!

ellipsis-dev Bot Feb 12, 2024

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev Bot Feb 12, 2024

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev Bot Feb 12, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ivanleomk commented Feb 12, 2024 •

edited by ellipsis-dev Bot

Loading

Comment at `applications/finetune-quora-embeddings/finetune.py:124`

Comment at `applications/finetune-quora-embeddings/finetune.py:220`