Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
04d4970
Add loki settings
nurbal Oct 11, 2024
862ef96
config update : loki settings
nurbal Oct 14, 2024
a695a60
add loki logger to CLI main
nurbal Oct 15, 2024
940cf30
wip
nurbal Oct 25, 2024
321a9d3
wip
nurbal Oct 30, 2024
77449e4
remove some print
nurbal Nov 4, 2024
149d213
fix strange syntax ??
nurbal Nov 5, 2024
def75d5
lint
nurbal Nov 5, 2024
88f1dc3
fix test config
nurbal Nov 5, 2024
1a0a280
wip
nurbal Nov 5, 2024
5cc2451
lint
nurbal Nov 5, 2024
f261fad
move OTLP config to `LoggingConfig`
nurbal Nov 12, 2024
7e54cf4
Merge commit '67f000a6888c803f6169bab52b1978b3a264e7c7' into SARC-368…
nurbal Nov 12, 2024
f2480c1
update test `test_tracer_with_multiple_clusters_and_dates_and_prometh…
nurbal Nov 12, 2024
e61bac9
lint
nurbal Nov 12, 2024
d06482d
fix config files
nurbal Nov 14, 2024
1c20277
update
nurbal Nov 18, 2024
6b2e8eb
Merge commit '6cd17ec3689f5941ae59d4563858b546397543e8' into SARC-368…
nurbal Nov 19, 2024
af92b35
Merge commit '85bd044a1315e48acece43b9007379f4a5aac582' into SARC-368…
nurbal Nov 27, 2024
b0b2be6
update poetry.lock
nurbal Nov 27, 2024
478c896
Merge commit '87ad1b11e95eb157fba66bd108bf5a7d99227cb7' into SARC-368…
nurbal Dec 9, 2024
8c6a480
attempt to add a functionnal test to OTPL / Loki with a HTTPServer
nurbal Jan 9, 2025
c5977e5
update poetry.lock
nurbal Jan 9, 2025
1075a84
Merge branch 'master' into SARC-368-loki-connect
nurbal Jan 9, 2025
3ab2c68
Add test_loki_logging_handler
nurbal Jan 27, 2025
0a6c9a7
Merge branch 'master' into SARC-368-loki-connect
nurbal Jan 27, 2025
f27d6c6
Merge branch 'master' into SARC-368-loki-connect
nurbal Jan 27, 2025
aea577e
some comments on test_loki_logging_handler
nurbal Feb 10, 2025
ea53b3f
fix imports position ni the file
nurbal Feb 10, 2025
f77ee66
take `verbose` command-line parameter into account
nurbal Feb 10, 2025
6b569f8
black
nurbal Feb 10, 2025
38a5c9a
update config file
nurbal Feb 10, 2025
486472f
update sarc-dev.json
nurbal Feb 10, 2025
3b4afe1
black
nurbal Feb 11, 2025
ab430fe
fix logging level priority between command-line and config file
nurbal Feb 13, 2025
b6ca44f
black
nurbal Feb 13, 2025
939dd07
ADD sanity check for jobs from users without @mila.quebec email address
nurbal Feb 13, 2025
aefeb94
black
nurbal Feb 16, 2025
aa3f1f3
fix tests
nurbal Feb 16, 2025
aba7973
fix lint
nurbal Feb 16, 2025
6bb94b5
Merge branch 'fix_account_matching' into deployed
nurbal Feb 16, 2025
d023399
fix some return typings
nurbal Feb 16, 2025
ad08c52
lint
nurbal Feb 16, 2025
9594441
Merge branch 'fix_account_matching' into deployed
nurbal Feb 17, 2025
374f71b
Merge branch 'master' into fix_account_matching
nurbal Feb 21, 2025
cdbe3ed
Merge commit '36a5364defdba77e5b5e4d39dd69957c7e07087b' into deployed
nurbal Mar 27, 2025
13674d1
Use uv, python 3.11 and mongod 8.0.x
abergeron Mar 26, 2025
9124db2
ensure that the version of mongo in podman is somewhat fixed
abergeron Mar 26, 2025
334a730
Better use of uv with tox
abergeron Mar 26, 2025
f74408f
Udpate readme for uv
abergeron Mar 26, 2025
1cac255
Replace the duplicate readme with a symlink
abergeron Mar 26, 2025
8ae3b59
Fix references to poetry in docs and scripts
abergeron Mar 26, 2025
f989bed
hopefully make test env faster to install
abergeron Mar 26, 2025
79b0765
Add mention to install pandoc to the readme
abergeron Mar 27, 2025
2bdafec
update deployment doc to set python version to 3.11
nurbal Apr 7, 2025
3930e0d
Merge commit '2bdafec94fe07fe9375f8574ff5a9af3dad863b1' into deployed
nurbal Apr 7, 2025
4eb0deb
Merge branch 'master' into fix_account_matching
nurbal Apr 8, 2025
8194de7
Merge branch 'master' into fix_account_matching
nurbal Apr 25, 2025
bef7379
add unit test
nurbal Apr 25, 2025
6ccef54
lint
nurbal Apr 25, 2025
724ca94
Revert "lint"
nurbal Apr 25, 2025
292826a
lint
nurbal Apr 25, 2025
d41723f
Merge branch 'master' into fix_account_matching
nurbal Apr 29, 2025
8bf1ca5
Merge branch 'master' into deployed
nurbal May 1, 2025
ba1934e
Merge branch fix_account_matching into deployed
nurbal May 1, 2025
f6a35fc
Merge branch 'master' into fix_account_matching
nurbal May 1, 2025
b123cd0
Merge branch 'fix_account_matching' into deployed
nurbal May 2, 2025
ce17e6d
fix pyproject.toml
nurbal May 5, 2025
15c1d50
Merge commit '49eaa1b749e60a66740bde3519acfef6c6663f8e' into deployed
nurbal May 8, 2025
7af486f
Merge branch 'master' into deployed
nurbal May 15, 2025
84e8124
Update sarc-prod.yaml
nurbal May 15, 2025
ac67a5c
Update sarc-prod.yaml
nurbal May 15, 2025
496db41
replace print by logging messages during account matching...
nurbal May 16, 2025
ff0188e
Merge branch 'account_matching_log_error_messages' into deployed
nurbal May 16, 2025
553296e
parametrise check_cluster_response() with cluster name (optional)
nurbal Dec 17, 2024
1c4898a
added `--once` parameter to `health check` command
nurbal Dec 17, 2024
c7b66e1
fix `check_cluster_response`
nurbal Feb 3, 2025
c503464
add some sample checks functions
nurbal Feb 3, 2025
b22618e
added `--write` optionnal parameter to `sarc health check --once` and…
nurbal Feb 3, 2025
759247e
wip
nurbal Feb 6, 2025
803feab
add UsersInJobsCheck HealthCheck
nurbal Feb 17, 2025
44d8cf0
set default logging level to WARNING (previously: DEBUG)
nurbal Feb 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions config/sarc-prod.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ sarc:
sshconfig: "~/.ssh/config"
clusters:
mila:
host: mila
host: localhost
timezone: America/Montreal
accounts:
sacct_bin: "/opt/slurm/bin/sacct"
Expand All @@ -31,7 +31,7 @@ sarc:
start_date: '2022-04-01'
billing_is_gpu: true
narval:
host: narval.computecanada.ca
host: robot.narval.alliancecan.ca
timezone: America/Montreal
accounts:
- rrg-bengioy-ad_gpu
Expand All @@ -44,12 +44,12 @@ sarc:
duc_storage_command: duc ls -d /project/.duc_databases/rrg-bengioy-ad.sqlite /project/rrg-bengioy-ad
diskusage_report_command: diskusage_report --project --all_users
prometheus_url: https://mila-thanos.calculquebec.ca
prometheus_headers_file: ../../SARC_secrets/secrets/drac_prometheus/headers.json
prometheus_headers_file: ../SARC_secrets/secrets/drac_prometheus/headers.json
start_date: '2022-04-01'
rgu_start_date: '2023-11-28'
gpu_to_rgu_billing: ../../SARC_secrets/secrets/gpu_to_rgu_billing_narval.json
beluga:
host: beluga.computecanada.ca
host: robot.beluga.alliancecan.ca
timezone: America/Montreal
accounts:
- rrg-bengioy-ad_gpu
Expand All @@ -62,12 +62,12 @@ sarc:
duc_storage_command: duc ls -d /project/.duc_databases/rrg-bengioy-ad.sqlite /project/rrg-bengioy-ad
diskusage_report_command: diskusage_report --project --all_users
prometheus_url: https://mila-thanos.calculquebec.ca
prometheus_headers_file: ../../SARC_secrets/secrets/drac_prometheus/headers.json
prometheus_headers_file: ../SARC_secrets/secrets/drac_prometheus/headers.json
start_date: '2022-04-01'
rgu_start_date: '2024-04-03'
gpu_to_rgu_billing: ../../SARC_secrets/secrets/gpu_to_rgu_billing_beluga.json
graham:
host: graham.computecanada.ca
host: robot.graham.alliancecan.ca
timezone: America/Toronto
accounts:
- rrg-bengioy-ad_gpu
Expand All @@ -84,7 +84,7 @@ sarc:
rgu_start_date: '2024-04-03'
gpu_to_rgu_billing: ../../SARC_secrets/secrets/gpu_to_rgu_billing_graham.json
cedar:
host: cedar.computecanada.ca
host: robot.cedar.alliancecan.ca
timezone: America/Vancouver
accounts:
- rrg-bengioy-ad_gpu
Expand Down
27 changes: 16 additions & 11 deletions sarc/account_matching/make_matches.py
Original file line number Diff line number Diff line change
Expand Up @@ -334,19 +334,24 @@ def _manual_matching(DLD_data, DD_persons, override_matches_mila_to_cc):
drac_account_username,
) in override_matches_mila_to_cc.items():
if mila_email_username not in DD_persons:
raise ValueError(
f'"{mila_email_username}" is not found in the actual sources.'
"This was supplied to `override_matches_mila_to_cc` in the `make_matches.py` file, "
msg = (
f'"{mila_email_username}" is not found in the actual sources.\n'
f"This was supplied to `override_matches_mila_to_cc` in the `make_matches.py` file, "
f"but there are not such entries in LDAP.\n"
"Someone messed up the manual matching by specifying a Mila email username that does not exist."
f"Someone messed up the manual matching by specifying a Mila email username that does not exist, or not ANYMORE."
)
# Note that `matching[drac_account_username]` is itself a dict
# with user information from CC. It's not just a username string.
if drac_account_username in matching:
assert isinstance(matching[drac_account_username], dict)
DD_persons[mila_email_username][drac_source] = matching[
drac_account_username
]
# we don't want to raise an error here because it will break the pipeline
# we will just log the error and move on
logging.error(msg)
# raise ValueError(msg)
else:
# Note that `matching[drac_account_username]` is itself a dict
# with user information from CC. It's not just a username string.
if drac_account_username in matching:
assert isinstance(matching[drac_account_username], dict)
DD_persons[mila_email_username][drac_source] = matching[
drac_account_username
]


def _make_matches_status_report(DLD_data, DD_persons):
Expand Down
120 changes: 120 additions & 0 deletions sarc/alerts/checks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
import random
from dataclasses import dataclass
from datetime import timedelta

from sarc.alerts.common import CheckResult, HealthCheck
from sarc.alerts.db_sanity_checks.users_accounts import check_users_in_jobs
from sarc.alerts.usage_alerts.cluster_response import check_cluster_response
from sarc.alerts.usage_alerts.cluster_scraping import check_nb_jobs_per_cluster_per_time


# this is a simple check that will fail 50% of the time
# it uses a custom result class to add more context to the result
@dataclass
class HelloWorldResult(CheckResult):
custom_comment: str = ""


@dataclass
class HelloWorldCheck(HealthCheck):
__result_class__ = HelloWorldResult

def check(self):
if random.random() < 0.5:
return self.fail(
custom_comment="Hello, HealthMonitor World! You were chosen randomly to fail..."
)
return self.ok(custom_comment="Hello, HealthMonitor World!")


# this is a simple check that will fail 50% of the time
# it uses the statuses dictionnary to add more context information to the result
@dataclass
class HelloWorld2Check(HealthCheck):
example_additionnal_param: str = "default_value"

def check(self):
random_number = random.random()
if random_number < 0.5:
return self.fail(
statuses={
"comment": "Hello, HealthMonitor World! You were chosen randomly to fail...",
"random_number": random_number,
"example_additionnal_param": self.example_additionnal_param,
}
)
return self.ok(
statuses={
"comment": "Hello, HealthMonitor World!",
"random_number": random_number,
"example_additionnal_param": self.example_additionnal_param,
}
)


# cheks if the cluster responded in the last `days` days
@dataclass
class ClusterResponseCheck(HealthCheck):
days: int = 7

def check(self):
cluster_name = self.parameters["cluster_name"]
days = self.days
# days = 7
if check_cluster_response(
time_interval=timedelta(days=days), cluster_name=cluster_name
):
return self.ok
return self.fail(
statuses={
"comment": f" Cluster {cluster_name} has not been scraped in the last {days} days."
}
)


@dataclass
class ClusterJobScrapingCheck(HealthCheck):
time_interval: int = 7
time_unit: int = 1
stddev: int = 2
verbose: bool = False

def check(self):
time_interval = timedelta(days=self.time_interval)
time_unit = timedelta(days=self.time_unit)
cluster_name = self.parameters["cluster_name"]
nb_stddev = self.stddev
verbose = self.verbose
if check_nb_jobs_per_cluster_per_time(
time_interval=time_interval,
time_unit=time_unit,
cluster_names=[cluster_name],
nb_stddev=nb_stddev,
verbose=verbose,
):
return self.ok
return self.fail(
statuses={
"comment": f"Cluster {cluster_name} has not enough jobs scrapped",
"time_interval": time_interval,
"time_unit": time_unit,
"stddev": nb_stddev,
}
)


@dataclass
class UsersInJobsCheck(HealthCheck):
time_interval: int = 7 # days

def check(self):
time_interval = timedelta(days=self.time_interval)
missing_users = check_users_in_jobs(time_interval=time_interval)
if not missing_users:
return self.ok
return self.fail(
statuses={
"comment": f"Missing users in jobs: {missing_users}",
"time_interval": time_interval,
}
)
15 changes: 13 additions & 2 deletions sarc/alerts/usage_alerts/cluster_response.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@
logger = logging.getLogger(__name__)


def check_cluster_response(time_interval: timedelta = timedelta(days=7)):
def check_cluster_response(
time_interval: timedelta = timedelta(days=7), cluster_name=None
):
"""
Check if we scraped clusters recently.
Log a warning for each cluster not scraped since `time_interval` from now.
Expand All @@ -24,11 +26,18 @@ def check_cluster_response(time_interval: timedelta = timedelta(days=7)):
# Get the oldest date allowed from now
oldest_allowed_date = current_date - time_interval
# Check each available cluster
for cluster in get_available_clusters():
clusters = (
[c for c in get_available_clusters() if c.cluster_name == cluster_name]
if cluster_name
else get_available_clusters()
)
result = True
for cluster in clusters:
if cluster.end_date is None:
logger.warning(
f"[{cluster.cluster_name}] no end_date available, cannot check last scraping"
)
result = False
else:
# Cluster's latest scraping date should be in `cluster.end_date`.
# NB: We assume cluster's `end_date` is stored as a date string,
Expand All @@ -44,3 +53,5 @@ def check_cluster_response(time_interval: timedelta = timedelta(days=7)):
f"oldest required: {oldest_allowed_date}, "
f"current time: {current_date}"
)
result = False
return result
5 changes: 5 additions & 0 deletions sarc/alerts/usage_alerts/cluster_scraping.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,8 @@ def check_nb_jobs_per_cluster_per_time(
else:
cluster_names = sorted(df["cluster_name"].unique())

result = True # by default, everything's ok

# Iter for each cluster.
for cluster_name in cluster_names:
# Select only jobs for current cluster,
Expand Down Expand Up @@ -127,3 +129,6 @@ def check_nb_jobs_per_cluster_per_time(
f"minimum required for this cluster: {threshold} ({avg} - {nb_stddev} * {stddev}); "
f"time unit: {time_unit}"
)
result = False

return result
24 changes: 11 additions & 13 deletions sarc/cli/health/check.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

import gifnoc

from sarc.alerts.common import CheckStatus
from sarc.alerts.common import CheckStatus, config
from sarc.alerts.runner import CheckRunner
from sarc.config import config

Expand All @@ -16,6 +16,7 @@
class HealthCheckCommand:
config: Path = None
once: bool = False
write: bool = False

name: str = None

Expand All @@ -24,24 +25,21 @@ def execute(self) -> int:
with gifnoc.use(self.config):
if self.name:
# only run one check, once (no CheckRunner)
check = hcfg.checks[self.name]
results = check(write=False)
pprint(results)
for k, status in results.statuses.items():
print(f"{status.name} -- {k}")
print(f"{results.status.name}")
check = config.checks[self.name]
results = check(write=self.write)
if results.status == CheckStatus.OK:
print(f"Check '{check.name}' succeeded.")
else:
print(f"Check '{check.name}' failed.")
pprint(results)
elif self.once:
# run all checks, once (no CheckRunner)
for check in [c for c in hcfg.checks.values() if c.active]:
results = check(write=False)
for check in [c for c in config.checks.values() if c.active]:
results = check(write=self.write)
if results.status == CheckStatus.OK:
print(f"Check '{check.name}' succeeded.")
else:
print(f"Check '{check.name}' failed.")
pprint(results)
for k, status in results.statuses.items():
print(f"{status.name} -- {k}")
print(f"{results.status.name}")
else:
try:
runner = CheckRunner(directory=hcfg.directory, checks=hcfg.checks)
Expand Down
2 changes: 1 addition & 1 deletion sarc/logging.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,6 @@ def setupLogging(verbose_level: int = 0):
handlers=[logging.StreamHandler()],
format="%(asctime)-15s::%(levelname)s::%(name)s::%(message)s",
level=verbose_levels.get(
verbose_level, logging.DEBUG
verbose_level, logging.WARNING
), # Default log level, if not specidied in config
)
7 changes: 4 additions & 3 deletions sarc/users/supervisor.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import logging
import re
from dataclasses import dataclass, field
from itertools import chain
Expand Down Expand Up @@ -136,17 +137,17 @@ def make_list(errors):

def show_error(msg, array):
if len(array) > 0:
print(f"{msg} {make_list(array)}")
logging.error(f"{msg} {make_list(array)}")

show_error(" Missing supervisors:", self.no_supervisors)
show_error(" Too many supervisors:", self.too_many_supervisors)
show_error(" Prof and Student:", self.prof_and_student)

if self.unknown_supervisors:
print(f" Unknown supervisors: {self.unknown_supervisors}")
logging.warning(f" Unknown supervisors: {self.unknown_supervisors}")

if self.unknown_group:
print(f" Unknown group: {self.unknown_group}")
logging.warning(f" Unknown group: {self.unknown_group}")


def _extract_supervisors_from_groups(
Expand Down
Loading