Skip to content

Commit 7c733d1

Browse files
lebricesatyaog
andauthored
Fix --persist mila code bug and intermittent connection errors [MT-78] (#101)
* Fix --persist bug and unneeded cd $SCRATCH Signed-off-by: Fabrice Normandin <[email protected]> * Fix broken tests, add integration test dir Signed-off-by: Fabrice Normandin <[email protected]> * Fix check_passwordless bug, set banner_timeout=60 Signed-off-by: Fabrice Normandin <[email protected]> * Remove unneeded mark and test run Signed-off-by: Fabrice Normandin <[email protected]> * Centralize references to connect_kwargs Signed-off-by: Fabrice Normandin <[email protected]> * Remove outdated todo Signed-off-by: Fabrice Normandin <[email protected]> * Fix bug in check_disk_quota and add test Signed-off-by: Fabrice Normandin <[email protected]> * Remove duplicate (moved) test_slurm_remote.py file Signed-off-by: Fabrice Normandin <[email protected]> * Add/improve integration test for `mila code` Signed-off-by: Fabrice Normandin <[email protected]> * Hide the 'which lfs' command Signed-off-by: Fabrice Normandin <[email protected]> * Fix typing error in test_commands.py Signed-off-by: Fabrice Normandin <[email protected]> * Fix bug and misleading type for `alloc` argument `alloc` needs to be a list of strings, but it was typed as `Sequence[str]`, which allows `str` to be passed (since `str`s are sequences of `str`s). This changes it to `list[str]` which is stricter and correct. Incidentally, there was an undetected bug in the regression test at `tests/integration/test_code_command.py::test_code` because I was passing the allocation flags (a string) as salloc. Signed-off-by: Fabrice Normandin <[email protected]> * Change test fixture, adjust tests Signed-off-by: Fabrice Normandin <[email protected]> * Adjust the way we fetch SLURM accounts in tests Signed-off-by: Fabrice Normandin <[email protected]> * Fix unused imports in conftest.py Signed-off-by: Fabrice Normandin <[email protected]> * Fix typing error in python 3.8 Signed-off-by: Fabrice Normandin <[email protected]> * Fix other type error in python 3.8 Signed-off-by: Fabrice Normandin <[email protected]> * Apply suggestions from code review Co-authored-by: satyaog <[email protected]> * Add a `currently_in_a_test` function Signed-off-by: Fabrice Normandin <[email protected]> * Fix same issue `mila serve` commands Signed-off-by: Fabrice Normandin <[email protected]> --------- Signed-off-by: Fabrice Normandin <[email protected]> Co-authored-by: satyaog <[email protected]>
1 parent b823319 commit 7c733d1

File tree

18 files changed

+561
-279
lines changed

18 files changed

+561
-279
lines changed

.github/workflows/build.yml

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ jobs:
3333
python -m pip install --upgrade pip
3434
pip install poetry
3535
- name: Set up Python ${{ matrix.python-version }}
36-
uses: actions/setup-python@v4
36+
uses: actions/setup-python@v5
3737
with:
3838
python-version: ${{ matrix.python-version }}
3939
cache: poetry
@@ -59,9 +59,6 @@ jobs:
5959
- name: Test with pytest
6060
run: poetry run pytest --cov=milatools --cov-report=xml --cov-append
6161

62-
- name: Test with pytest (with -s flag)
63-
run: poetry run pytest --cov=milatools --cov-report=xml --cov-append -s
64-
6562
- name: Upload coverage reports to Codecov
6663
uses: codecov/codecov-action@v3
6764
with:
@@ -96,7 +93,7 @@ jobs:
9693
options: --health-cmd="mysqladmin ping" --health-interval=10s --health-timeout=5s --health-retries=3
9794

9895
steps:
99-
- uses: actions/checkout@v3
96+
- uses: actions/checkout@v4
10097

10198
# NOTE: Replacing this with our customized version of
10299
# - uses: koesterlab/setup-slurm-action@v1
@@ -120,7 +117,7 @@ jobs:
120117
ssh -o 'StrictHostKeyChecking no' localhost id
121118
122119
- name: Set up Python ${{ matrix.python-version }}
123-
uses: actions/setup-python@v3
120+
uses: actions/setup-python@v5
124121
with:
125122
python-version: ${{ matrix.python-version }}
126123

@@ -131,7 +128,7 @@ jobs:
131128
poetry install --with=dev
132129
133130
- name: Launch integration tests
134-
run: poetry run pytest tests/cli/test_slurm_remote.py --cov=milatools --cov-report=xml --cov-append -s -vvv --log-level=DEBUG
131+
run: poetry run pytest tests/integration --cov=milatools --cov-report=xml --cov-append -s -vvv --log-level=DEBUG
135132
timeout-minutes: 3
136133
env:
137134
SLURM_CLUSTER: localhost

milatools/cli/commands.py

Lines changed: 80 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,8 @@
5151
MilatoolsUserError,
5252
SSHConnectionError,
5353
T,
54+
cluster_to_connect_kwargs,
55+
currently_in_a_test,
5456
get_fully_qualified_hostname_of_compute_node,
5557
get_fully_qualified_name,
5658
make_process,
@@ -492,7 +494,7 @@ def code(
492494
persist: bool,
493495
job: str | None,
494496
node: str | None,
495-
alloc: Sequence[str],
497+
alloc: list[str],
496498
cluster: Cluster = "mila",
497499
):
498500
"""Open a remote VSCode session on a compute node.
@@ -525,13 +527,12 @@ def code(
525527
if command is None:
526528
command = os.environ.get("MILATOOLS_CODE_COMMAND", "code")
527529

528-
if remote.hostname != "graham": # graham doesn't use lustre for $HOME
529-
try:
530-
check_disk_quota(remote)
531-
except MilatoolsUserError:
532-
raise
533-
except Exception as exc:
534-
logger.warning(f"Unable to check the disk-quota on the cluster: {exc}")
530+
try:
531+
check_disk_quota(remote)
532+
except MilatoolsUserError:
533+
raise
534+
except Exception as exc:
535+
logger.warning(f"Unable to check the disk-quota on the cluster: {exc}")
535536

536537
vscode_extensions_folder = Path.home() / ".vscode/extensions"
537538
if vscode_extensions_folder.exists() and no_internet_on_compute_nodes(cluster):
@@ -618,6 +619,8 @@ def code(
618619
"The editor was closed. Reopen it with <Enter>"
619620
" or terminate the process with <Ctrl+C>"
620621
)
622+
if currently_in_a_test():
623+
break
621624
input()
622625

623626
except KeyboardInterrupt:
@@ -714,7 +717,7 @@ def serve_list(purge: bool):
714717

715718

716719
class StandardServerArgs(TypedDict):
717-
alloc: Sequence[str]
720+
alloc: list[str]
718721
"""Extra options to pass to slurm."""
719722

720723
job: str | None
@@ -922,7 +925,7 @@ def _standard_server(
922925
name: str | None,
923926
node: str | None,
924927
job: str | None,
925-
alloc: Sequence[str],
928+
alloc: list[str],
926929
port_pattern=None,
927930
token_pattern=None,
928931
):
@@ -1074,54 +1077,83 @@ def _standard_server(
10741077
proc.kill()
10751078

10761079

1077-
def _get_disk_quota_usage(
1078-
remote: Remote, print_command_output: bool = True
1080+
def _parse_lfs_quota_output(
1081+
lfs_quota_output: str,
10791082
) -> tuple[tuple[float, float], tuple[int, int]]:
1080-
"""Checks the disk quota on the $HOME filesystem on the mila cluster.
1081-
1082-
Returns whether the quota is exceeded, in terms of storage space or number of files.
1083-
"""
1083+
"""Parses space and # of files (usage, limit) from the output of `lfs quota`."""
1084+
lines = lfs_quota_output.splitlines()
1085+
1086+
header_line: str | None = None
1087+
header_line_index: int | None = None
1088+
for index, line in enumerate(lines):
1089+
if (
1090+
len(line_parts := line.strip().split()) == 9
1091+
and line_parts[0].lower() == "filesystem"
1092+
):
1093+
header_line = line
1094+
header_line_index = index
1095+
break
1096+
assert header_line
1097+
assert header_line_index is not None
1098+
1099+
values_line_parts: list[str] = []
1100+
# The next line may overflow to two (or maybe even more?) lines if the name of the
1101+
# $HOME dir is too long.
1102+
for content_line in lines[header_line_index + 1 :]:
1103+
additional_values = content_line.strip().split()
1104+
assert len(values_line_parts) < 9
1105+
values_line_parts.extend(additional_values)
1106+
if len(values_line_parts) == 9:
1107+
break
10841108

1085-
# NOTE: This is what the output of the command looks like on the Mila cluster:
1086-
#
1087-
# $ lfs quota -u $USER /home/mila
1088-
# Disk quotas for usr normandf (uid 1471600598):
1089-
# Filesystem kbytes quota limit grace files quota limit grace
1090-
# /home/mila 101440844 0 104857600 - 936140 0 1048576 -
1091-
# uid 1471600598 is using default block quota setting
1092-
# uid 1471600598 is using default file quota setting
1093-
#
1094-
home_disk_quota_output = remote.get_output(
1095-
"lfs quota -u $USER $HOME", hide=not print_command_output
1096-
)
1097-
lines = home_disk_quota_output.splitlines()
1109+
assert len(values_line_parts) == 9, values_line_parts
10981110
(
10991111
_filesystem,
11001112
used_kbytes,
1101-
_quota1,
1113+
_quota_kbytes,
11021114
limit_kbytes,
1103-
_grace1,
1115+
_grace_kbytes,
11041116
files,
1105-
_quota2,
1117+
_quota_files,
11061118
limit_files,
1107-
_grace2,
1108-
) = lines[2].strip().split()
1119+
_grace_files,
1120+
) = values_line_parts
11091121

1110-
used_gb = float(int(used_kbytes.strip()) / (1024) ** 2)
1111-
max_gb = float(int(limit_kbytes.strip()) / (1024) ** 2)
1122+
used_gb = int(used_kbytes.strip()) / (1024**2)
1123+
max_gb = int(limit_kbytes.strip()) / (1024**2)
11121124
used_files = int(files.strip())
11131125
max_files = int(limit_files.strip())
11141126
return (used_gb, max_gb), (used_files, max_files)
11151127

11161128

11171129
def check_disk_quota(remote: Remote) -> None:
1118-
cluster = (
1119-
"mila" # todo: if we run this on CC, then we should use `diskusage_report`
1120-
)
1121-
# todo: Check the disk-quota of other filesystems if needed.
1122-
filesystem = "$HOME"
1130+
cluster = remote.hostname
1131+
1132+
# NOTE: This is what the output of the command looks like on the Mila cluster:
1133+
#
1134+
# Disk quotas for usr normandf (uid 1471600598):
1135+
# Filesystem kbytes quota limit grace files quota limit grace
1136+
# /home/mila/n/normandf
1137+
# 95747836 0 104857600 - 908722 0 1048576 -
1138+
# uid 1471600598 is using default block quota setting
1139+
# uid 1471600598 is using default file quota setting
1140+
1141+
# Need to assert this, otherwise .get_output calls .run which would spawn a job!
1142+
assert not isinstance(remote, SlurmRemote)
1143+
if not remote.get_output("which lfs", hide=True):
1144+
logger.debug("Cluster doesn't have the lfs command. Skipping check.")
1145+
return
1146+
11231147
logger.debug("Checking disk quota on $HOME...")
1124-
(used_gb, max_gb), (used_files, max_files) = _get_disk_quota_usage(remote)
1148+
1149+
home_disk_quota_output = remote.get_output("lfs quota -u $USER $HOME", hide=True)
1150+
if "not on a mounted Lustre filesystem" in home_disk_quota_output:
1151+
logger.debug("Cluster doesn't use lustre on $HOME filesystem. Skipping check.")
1152+
return
1153+
1154+
(used_gb, max_gb), (used_files, max_files) = _parse_lfs_quota_output(
1155+
home_disk_quota_output
1156+
)
11251157
logger.debug(
11261158
f"Disk usage: {used_gb:.1f} / {max_gb} GiB and {used_files} / {max_files} files"
11271159
)
@@ -1144,7 +1176,7 @@ def check_disk_quota(remote: Remote) -> None:
11441176
if used_gb >= max_gb or used_files >= max_files:
11451177
raise MilatoolsUserError(
11461178
T.red(
1147-
f"ERROR: Your disk quota on the {filesystem} filesystem is exceeded! "
1179+
f"ERROR: Your disk quota on the $HOME filesystem is exceeded! "
11481180
f"({reason}).\n"
11491181
f"To fix this, login to the cluster with `ssh {cluster}` and free up "
11501182
f"some space, either by deleting files, or by moving them to a "
@@ -1153,24 +1185,20 @@ def check_disk_quota(remote: Remote) -> None:
11531185
)
11541186
if max(size_ratio, files_ratio) > 0.9:
11551187
warning_message = (
1156-
f"WARNING: You are getting pretty close to your disk quota on the $HOME "
1188+
f"You are getting pretty close to your disk quota on the $HOME "
11571189
f"filesystem: ({reason})\n"
11581190
"Please consider freeing up some space in your $HOME folder, either by "
11591191
"deleting files, or by moving them to a more suitable filesystem.\n"
11601192
+ freeing_up_space_instructions
11611193
)
1162-
# TODO: Perhaps we could use the logger or the warnings package instead of just
1163-
# printing?
1164-
# logger.warning(UserWarning(warning_message))
1165-
# warnings.warn(UserWarning(T.yellow(warning_message)))
1166-
print(UserWarning(T.yellow(warning_message)))
1194+
logger.warning(UserWarning(warning_message))
11671195

11681196

11691197
def _find_allocation(
11701198
remote: Remote,
11711199
node: str | None,
11721200
job: str | None,
1173-
alloc: Sequence[str],
1201+
alloc: list[str],
11741202
cluster: Cluster = "mila",
11751203
job_name: str = "mila-tools",
11761204
):
@@ -1179,11 +1207,11 @@ def _find_allocation(
11791207

11801208
if node is not None:
11811209
node_name = get_fully_qualified_hostname_of_compute_node(node, cluster=cluster)
1182-
return Remote(node_name)
1210+
return Remote(node_name, connect_kwargs=cluster_to_connect_kwargs.get(cluster))
11831211

11841212
elif job is not None:
11851213
node_name = remote.get_output(f"squeue --jobs {job} -ho %N")
1186-
return Remote(node_name)
1214+
return Remote(node_name, connect_kwargs=cluster_to_connect_kwargs.get(cluster))
11871215

11881216
else:
11891217
alloc = ["-J", job_name, *alloc]

milatools/cli/local.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
import paramiko.ssh_exception
1010
from typing_extensions import deprecated
1111

12-
from .utils import CommandNotFoundError, T, shjoin
12+
from .utils import CommandNotFoundError, T, cluster_to_connect_kwargs, shjoin
1313

1414
logger = get_logger(__name__)
1515

@@ -76,7 +76,13 @@ def display(split_command: list[str] | tuple[str, ...] | str) -> None:
7676

7777
def check_passwordless(host: str) -> bool:
7878
try:
79-
with fabric.Connection(host) as connection:
79+
connect_kwargs_for_host = {"allow_agent": False}
80+
if host in cluster_to_connect_kwargs:
81+
connect_kwargs_for_host.update(cluster_to_connect_kwargs[host])
82+
with fabric.Connection(
83+
host,
84+
connect_kwargs=connect_kwargs_for_host,
85+
) as connection:
8086
results: fabric.runners.Result = connection.run(
8187
"echo OK",
8288
in_stream=False,

0 commit comments

Comments
 (0)