Skip to content

Server crashes when training model before resetting it #4

@tischi

Description

@tischi

(see title)

Here is the error:

GPU is available
auto_bg_thresh: 0
c_ratio: 0.3
class_weights: [1, 10, 10]
crop_box: None
crop_size: [16, 128, 128]
dataset_name: elephant-demo
debug: False
device: cuda
false_weight: 3
is_livemode: False
is_pad: False
keep_axials: (True, True, True, False)
log_dir: /workspace/logs/seg_log
lr: 5e-05
model_path: /workspace/models/seg.pth
n_crops: 3
n_epochs: 3
output_prediction: False
p_thresh: None
patch_size: None
r_max: None
r_min: None
rotation_angle: 0
scale_factor_base: 0
scales: [2.48, 0.3119629, 0.3119629]
timepoint: 0
use_median: None
zpath_input: /workspace/datasets/elephant-demo/imgs.zarr
zpath_seg_label: /workspace/datasets/elephant-demo/seg_labels.zarr
zpath_seg_label_vis: /workspace/datasets/elephant-demo/seg_labels_vis.zarr
zpath_seg_output: /workspace/datasets/elephant-demo/seg_outputs.zarr
[2021-05-11 18:00:15,663] ERROR in app: Exception on /train/seg [POST]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "./main.py", line 544, in train_seg
    config.device)
  File "/usr/local/lib/python3.7/site-packages/elephant/models.py", line 313, in load_seg_models
    checkpoint = torch.load(model_path)
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 525, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 212, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 193, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/workspace/models/seg.pth'
[pid: 10865|app: 0|req: 1/1] 127.0.0.1 () {40 vars in 519 bytes} [Tue May 11 18:00:15 2021] POST /train/seg => generated 290 bytes in 46 msecs (HTTP/1.1 500) 2 headers in 99 bytes (1 switches on core 0)
127.0.0.1 - - [11/May/2021:18:00:15 +0000] "POST /train/seg HTTP/1.1" 500 290 "-" "unirest-java/3.1.00" "-"
[pid: 10864|app: 0|req: 1/2] 127.0.0.1 () {40 vars in 517 bytes} [Tue May 11 18:00:38 2021] POST /reset/seg => generated 40 bytes in 4 msecs (HTTP/1.1 500) 2 headers in 90 bytes (1 switches on core 0)
127.0.0.1 - - [11/May/2021:18:00:38 +0000] "POST /reset/seg HTTP/1.1" 500 40 "-" "unirest-java/3.1.00" "-"
[pid: 10864|app: 0|req: 2/3] 127.0.0.1 () {40 vars in 517 bytes} [Tue May 11 18:00:47 2021] POST /reset/seg => generated 40 bytes in 0 msecs (HTTP/1.1 500) 2 headers in 90 bytes (1 switches on core 0)
127.0.0.1 - - [11/May/2021:18:00:47 +0000] "POST /reset/seg HTTP/1.1" 500 40 "-" "unirest-java/3.1.00" "-"

It would be nice if this was handled in a way that does not crash the server (within Elephant-client it says now that training is in progress and one cannot do anything).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions