Skip to content

Finetune model#44

Open
ivanleomk wants to merge 3 commits into
mainfrom
finetune-model
Open

Finetune model#44
ivanleomk wants to merge 3 commits into
mainfrom
finetune-model

Conversation

@ivanleomk

@ivanleomk ivanleomk commented Feb 12, 2024

Copy link
Copy Markdown
Contributor
Ellipsis 🚀 This PR description was created by Ellipsis for commit 4533fef.

Summary:

This PR introduces a new model fine-tuning script and a performance graph generation script, along with model performance data.

Key points:

  • Added finetune.py for fine-tuning a model on the Quora dataset.
  • Added graph.py for generating performance graphs.
  • Included model_perf.json and model_perf_dense.json containing model performance data.

Generated with ❤️ by ellipsis.dev

print(
f"Iteration {iteration+1}: {curr_eval}, eval: ( {best_eval_perf}, {time_step} )"
)
if curr_eval > best_eval_perf:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early Stopping condition is here

@ivanleomk

Copy link
Copy Markdown
Contributor Author

Had to write some custom code for an early stopping condition since the sentence transformers lib doesn't seem to support training till we see no improvement in test loss


best_eval = (evaluator(model, output_path=None), 0)
print(f"Initial Eval: {best_eval}")
for iteration in range(6):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameterize this

print(f"Previous Eval: {best_eval_perf}, new Eval : {curr_eval}")
best_eval = (curr_eval, iteration)

predictions, test_labels = generate_prediction_labels(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not generate it per epoch? then we have more data

model.fit(
train_objectives=[(train_dataloader, train_loss)],
evaluator=evaluator,
epochs=1,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why epochs=1 rather than for n_epochs

import json

PERCENTAGES = [0.7, 0.9]
DATASET_SIZE = 261317

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this 261317

except FileNotFoundError:
model_perf = {}

for item in finetune_model.map(PERCENTAGES):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing percentage just do

[1000, 2000, ...]

and jsut double it and plot on log

"recall": 0.8528957528957529,
"AUC": 0.8779285474909108
},
"0.7": {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
"accuracy": 0.891340875994175,
"precision": 0.7265252425587897,
"recall": 0.8528957528957529,
"AUC": 0.8779285474909108,
"model": ...
"epoch": ...
"data_size": ...
...
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and just make an array

Then you can just

pd.from_jsonlines()

@jxnl jxnl marked this pull request as ready for review February 12, 2024 17:45

@ellipsis-dev ellipsis-dev Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes requested. Reviewed entire PR up to commit 4533fef

Reviewed 454 lines of code across 4 files in 1 minute(s) and 26 second(s).

See details
  • Skipped files: 6 (please contact us to request support for these files)
  • Confidence threshold: 50%
  • Drafted 1 additional comments.
  • Workflow ID: wflow_fiizYlIVVBMUV0za
View 1 draft comments

These comments were drafted by Ellipsis, but were filtered out of the final review. They're included here so you can see our internal thought process and help you configure your ellipsis.yaml.

Drafted 1 additional comments

Comment at applications/finetune-quora-embeddings/finetune.py:124

Consider optimizing this function by using matrix operations to calculate cosine similarity for all pairs at once, instead of using a for loop. This could significantly improve performance for large datasets.


Something look wrong? You can customize Ellipsis by editing the ellipsis.yaml for this repository.

Generated with ❤️ by ellipsis.dev

id2_exists = id2 in embedding_ids
print(f"id1 exists: {id1_exists}, id2 exists: {id2_exists}")
print(f"Encountered an exception: {e}")
raise e

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider improving the exception handling in this function. Instead of simply printing and re-raising the exception, it would be better to log the exception and continue processing the remaining data. This would prevent the entire process from stopping due to a single error.

@stub.function(
image=image, gpu=GPU_CONFIG, volumes={DATASET_DIR: DATASET_VOLUME}, timeout=86400
)
def finetune_model(dataset_pct: float):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider breaking down this function into smaller, more manageable functions. This would improve readability and maintainability of the code.

@ellipsis-dev ellipsis-dev Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes requested. Reviewed entire PR up to commit 4533fef

Reviewed 454 lines of code across 4 files in 1 minute(s) and 42 second(s).

See details
  • Skipped files: 6 (please contact us to request support for these files)
  • Confidence threshold: 50%
  • Drafted 1 additional comments.
  • Workflow ID: wflow_0IVKtfFqnWVYjnCD
View 1 draft comments

These comments were drafted by Ellipsis, but were filtered out of the final review. They're included here so you can see our internal thought process and help you configure your ellipsis.yaml.

Drafted 1 additional comments

Comment at applications/finetune-quora-embeddings/finetune.py:220

Consider adding more detailed logging to this function. For example, you could log the progress of the training process, such as the current epoch and the loss at each step. This would make it easier to monitor the training process and diagnose any issues.


Something look wrong? You can customize Ellipsis by editing the ellipsis.yaml for this repository.

Generated with ❤️ by ellipsis.dev

id2_exists = id2 in embedding_ids
print(f"id1 exists: {id1_exists}, id2 exists: {id2_exists}")
print(f"Encountered an exception: {e}")
raise e

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider improving the exception handling here. Instead of just printing the exception and re-raising it, it would be better to log the exception and continue processing the remaining data. This would prevent the entire process from stopping due to a single error.

"""
import numpy as np

texts = []

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider optimizing this function by creating the 'texts' and 'ids' lists in a single pass over the 'text_to_embed' generator. This would be more efficient than the current implementation, which iterates over the generator twice.

start_index = idx * batch_size
end_index = min((idx + 1) * batch_size, len(embeddings_first_sentence))

batch_embeddings_sentence_1 = np.stack(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider optimizing this function by creating the 'batch_embeddings_sentence_1' and 'batch_embeddings_sentence_2' arrays once outside the loop and updating them in each iteration. This would reduce the memory footprint and potentially improve performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants