-
Notifications
You must be signed in to change notification settings - Fork 35
feat: parrallel evaluations #238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much cleaner code than what we had before. I'm not sure if the parallelism will actually happen, for sure the evals will run concurrently, but the device is pinned to one cuda gpu as I read the code. Will the GPU context-switch? or am I thinning wrong?
# Process evaluation results and update scores (old evaluation logic) | ||
for uid in evaluation_uids: | ||
result = eval_results.get(uid) | ||
if result is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion, might have removed some of the git effort if you would inverse the for loop
if result is None:
continue
....code goes here, not wrapped in if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think this makes the code so much readable as well. I had this in another PR that was not merged. But I think it helps ALOT to follow the happy path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taking it out of evaluations, eventually we want it in something like weights ,and to also separate scoring from evals.
neurons/validator.py
Outdated
f"validator/weights/{eval_uid}": self.weights[ | ||
eval_uid | ||
].item(), | ||
f"validator/slash/{uid}/score_before": old_score, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we're double logging this in in line 871?
neurons/validator.py
Outdated
}, | ||
step=self.global_step, | ||
) | ||
tplr.logger.info( | ||
f"{tplr.P(self.sync_window, tplr.T() - scoring_start)} Computed scores and weights" | ||
self.metrics_logger.log( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we're double logging this in in line 879?
Yes you read right. We currently utilise the H100, so the expectation is that we do perform all the evals on 1 core, and not impose higher infra costs on validators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see some issues. Is this working right now in the current state? Looks like we should do ruff format as well. But when we get this in it would be so much easier to follow how the evaluation is done honestly.
@@ -42,9 +42,10 @@ | |||
"catch_up_threshold": 15, | |||
"catch_up_batch_size": 5, | |||
"catch_up_timeout": 300, | |||
"uids_per_window": 2, | |||
"uids_per_window": 4, | |||
"hparams.parallel_eval_uids": 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why hparams before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also curious what the reason to lower minimum peers was?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing locally , should have put this in draft it wasnt ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why hparams before?
i dont understand this , can you please rephrase your question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. Why prefix parallel eval uids with hparams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its how it currently works , we have a max uids to evals in the window, this tell us how many of them should be parrallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not call it parallel_eval_uids
instead of hparams.parallel_eval_uids
?
neurons/validator.py
Outdated
input_ids = torch.tensor(batch, dtype=torch.long).to( | ||
model_own_data_eval.device | ||
# Evaluate sync metrics. | ||
sync_result = await self.evaluate_miner_sync(uid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we can move the sync scoring to the evaluation module as well?
# Process evaluation results and update scores (old evaluation logic) | ||
for uid in evaluation_uids: | ||
result = eval_results.get(uid) | ||
if result is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think this makes the code so much readable as well. I had this in another PR that was not merged. But I think it helps ALOT to follow the happy path.
self.weights = torch.zeros_like(self.final_moving_avg_scores) | ||
evaluated_mask = torch.zeros_like( | ||
self.final_moving_avg_scores, dtype=torch.bool | ||
) | ||
evaluated_mask[list(self.evaluated_uids)] = True | ||
positive_mask = ( | ||
self.final_moving_avg_scores > 0 | ||
) & evaluated_mask | ||
|
||
if positive_mask.any(): | ||
self.weights[positive_mask] = min_power_normalization( | ||
self.final_moving_avg_scores[positive_mask], | ||
power=self.hparams.power_normalisation, | ||
) | ||
weight_sum = self.weights.sum().item() | ||
tplr.logger.debug(f"Weight sum: {weight_sum}") | ||
if abs(weight_sum - 1.0) > 1e-6: | ||
tplr.logger.warning( | ||
f"Weights sum to {weight_sum}, expected close to 1.0" | ||
) | ||
else: | ||
tplr.logger.info( | ||
"No positive scores found, all weights set to 0" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the function you added in evaluation
?
src/tplr/evaluation.py
Outdated
outputs = model(input_ids=input_ids, labels=labels) | ||
total_loss += outputs.loss.item() | ||
count += 1 | ||
del input_ids, labels, outputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think the del
and torch.cuda.empty_cache()
will help on anything practical. mem release doesn't work as synchronous.
Description
TODO
Related Issue(s)
Type of Change
Branch Naming
Commit Messages
Code Quality
Testing
Documentation
If this is a breaking change
Screenshots/Examples
Additional Notes