MLexperiments-research-computervision-climatechange/devops at master · Tianguistengo/MLexperiments-research-computervision-climatechange · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
"Today, one of the big factors that can affect the accuracy of deployed models is if the data
being used to generate predictions differs from data used to train the model.

For example, changing economic conditions could drive new interest rates affecting home purchasing predictions.
------This is called concept drift,------
whereby the patterns the model uses to make predictions no longer apply.
SageMaker Model Monitor automatically detects concept drift in deployed models and provides detailed alerts
that help identify the source of the problem.
All models trained in SageMaker automatically emit key metrics that can be collected and viewed in SageMaker Studio.
From inside SageMaker Studio you can configure data to be collected, how to view it, and when to receive alerts.


---
What is a data lake?

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale.
You can store your data as-is, without having to first structure the data, and run different types
of analytics—from dashboards and visualizations to big data processing,
real-time analytics, and machine learning to guide better decisions.


-- devops sagemaker ML
https://aws.amazon.com/devops/what-is-devops/?nc1=f_cc
https://aws.amazon.com/sagemaker/


---

https://mlinproduction.com/why-is-it-important-to-monitor-machine-learning-models/

In traditional software, we explicitly plan for and express how a system should react in specific circumstances.
This is accomplished by writing automated tests to ensure applications perform as expected.

In contrast, the power of machine learning lies in how it generalizes from historical experience
and reacts to new unseen data without us explicitly describing each case.


Since we can’t explicitly test for all of the possible cases a machine learning system will see,
we need to continuously monitor that system to ensure it’s operating effectively.


Monitoring machine learning systems is about monitoring the quality of decision making that the system enables. T
he quality of predictions is a function of many things including:

    1. The quality of the data fed to models at inference time.
    Monitoring an ML system isn’t just about monitoring a model; it also involves monitoring all the data sources fed to the model.

    2. Modeling assumptions remaining relatively constant –
    Models learn relationships between inputs and outputs from historical data. Since the real world is dynamic,
    these relationships are constantly changing.
    This means that model performance naturally degrades with time.
    Models must adapt to changing conditions, but detecting these changes depends on robust, continuous monitoring.

    3. The robustness and stability of predictions –
    It’s understood that input features to machine learning models are not independent.
    Hence, changes in any part of the system, including hyper-parameters, learning settings,
    sampling methods, convergence thresholds, and data selection, can cause unpredictable changes to model output.
    This is known as CACE: Changing Anything Changes Everything.

 ---------

 Pickle
 converts a Python object to a bitstream
 and allows it to be stored to disk and reloaded at a later time.

 It provides a good format to store machine learning models
 provided that their intended applications are also built-in python.


ONNX,the Open Neural Network Exchange Format,
is an open format that supports the storing and porting of predictive models across libraries and languages.
Most deep learning libraries support it, and sklearn also has a library extension to convert their model to ONNX’s format.

---

Publish/subscribe messaging, or pub/sub messaging,
is a form of asynchronous service-to-service communication
used in serverless and microservices architectures.

In a pub/sub model, any message published to a topic is immediately received by all of the subscribers to the topic.


---

https://mlinproduction.com/lessons-learned-from-15-years-of-monitoring-machine-learning-in-production/

    Lesson #1 – Monitoring performance metrics isn’t enough. You need to monitor input distributions.

After a week of tedious investigations, they discovered a huge drift in the data of the incoming requests,
due to campaigns encouraging new target audiences for loans. The model issued predictions for this new audience, despite the fact that
1) no labels existed for the newly targeted segments
and 2) its data distribution differed so significantly from the training dataset that the model’s predictions were irrelevant.

You need to measure and monitor all relevant aspects of your model to detect issues at the right time, including:

    -the model input distribution and its drift level
    -metrics describing model inference, as these could be proxies for detecting strange model behavior, which can lead to unexpected results!


    Lesson #2 – Automatic retraining is not (necessarily) the solution


----

from Medium -Viktor B

Reproducibility.
Key to providing a high-quality service is the ability to easily retrain a model with updated versions of the dataset.
This means dataset versioning and automation for model retraining is needed.
Reproducibility is also needed to exactly recreate decisions made by production models in the development environment
to understand why certain decisions were made
and analyze cases where the model does not behave as expected.

QA.
All releases to production have to be quality assessed from --both-- ML and engineering viewpoints.
This limits the possibility to quickly iterate and evaluate experimental workflows in the production environment,
but is important for data security and integrity.

Release documentation.
Releases are documented in order to trace model versions,
serving application and underlying container versions,
datasets used for model training and evaluation.
This is a requirement for compliant release management
and also helps to analyze suspicious decisions made by models in production.

ML depends on a stable infrastructure and tooling for running training, labeling and evaluation jobs


-------

Regularity in checkpointing

You can save a considerable amount of time during your deep learning experiments
if you set up model checkpointing correctly.

Generally, model checkpointing refers to >> saving your network model during the training process.

It can vary strategically. For example, one very common model checkpointing strategy is
to record where the validation loss stops to decrease
during the model training process
along with the training loss
and save the weights of the corresponding model.

This is known as model checkpointing.

As you keep prototyping through your deep learning experiments,
you would be able to reuse these checkpointed models
and this will definitely prevent you from having to repeat a training process(s)
all over from scratch which might have taken hours/days.


-------


Continuous Integration (CI)

is a development practice where developers
integrate code into a shared repository frequently,
preferably several times a day.

Each integration can then be verified by
an automated build
and automated tests.

While automated testing is not strictly part of CI it is typically implied.


---

wiki

Build

In software development, a build is
the process of
converting source code files into
standalone software artifact(s)
that can be run on a computer, or the result of doing so.