@@ -5,22 +5,21 @@ Secure persistence with skops
5
5
6
6
.. warning ::
7
7
8
- This feature is very early in development, which means the API is
9
- unstable and it is **not secure ** at the moment. Therefore, use the same
10
- caution as you would for ``pickle ``: Don't load from sources that you
11
- don't trust. In the future, more security will be added.
8
+ This feature is heavily under development, which means the API is unstable
9
+ and there might be security issues at the moment. Therefore, use caution
10
+ when loading files from sources you don't trust.
12
11
13
12
Skops offers a way to save and load sklearn models without using :mod: `pickle `.
14
- The ``pickle `` module is not secure, but with skops, you can securely save and
15
- load sklearn models without using ``pickle ``.
13
+ The ``pickle `` module is not secure, but with skops, you can [more] securely
14
+ save and load models without using ``pickle ``.
16
15
17
16
``Pickle `` is the standard serialization format for sklearn and for Python in
18
- general. One of the main advantages of `` pickle `` is that it can be used for
19
- almost all Python code but this flexibility also makes it inherently insecure.
20
- This is because loading certain types of objects requires the ability to run
21
- arbitrary code, which can be misused for malicious purposes. For example, an
22
- attacker can use it to steal secrets from your machine or install a virus. As
23
- the `Python docs
17
+ general (`` cloudpickle `` and `` joblib `` use the same format). One of the main
18
+ advantages of `` pickle `` is that it can be used for almost all Python objects
19
+ but this flexibility also makes it inherently insecure. This is because loading
20
+ certain types of objects requires the ability to run arbitrary code, which can
21
+ be misused for malicious purposes. For example, an attacker can use it to steal
22
+ secrets from your machine or install a virus. As the `Python docs
24
23
<https://docs.python.org/3/library/pickle.html#module-pickle> `__ say:
25
24
26
25
.. warning ::
@@ -31,26 +30,43 @@ the `Python docs
31
30
untrusted source, or that could have been tampered with.
32
31
33
32
In contrast to ``pickle ``, the :func: `skops.io.dump ` and :func: `skops.io.load `
34
- functions cannot be used to save arbitrary Python code, but they bypass
35
- ``pickle `` and are thus more secure.
33
+ functions have a more limited scope, while preventing users from running
34
+ arbitrary code or loading unknown and malicious objects.
35
+
36
+ When loading a file, :func: `skops.io.load `/:func: `skops.io.loads ` will traverse
37
+ the input, check for known and unknown types, and will only construct those
38
+ objects if they are trusted, either by default or by the user.
39
+
40
+ .. note ::
41
+ You can try out converting your existing pickle files to the skops format
42
+ using this Space on Hugging Face Hub:
43
+ `pickle-to-skops <https://huggingface.co/spaces/scikit-learn/pickle-to-skops >`__.
36
44
37
45
Usage
38
46
-----
39
47
40
48
The code snippet below illustrates how to use :func: `skops.io.dump ` and
41
- :func: `skops.io.load `:
49
+ :func: `skops.io.load `. Note that one needs `XGBoost
50
+ <https://xgboost.readthedocs.io/en/stable/> `__ installed to run this:
42
51
43
52
.. code :: python
44
53
45
- from sklearn.linear_model import LogisticRegression
54
+ from xgboost.sklearn import XGBClassifier
55
+ from sklearn.model_selection import GridSearchCV, train_test_split
56
+ from sklearn.datasets import load_iris
46
57
from skops.io import dump, load
47
58
48
- clf = LogisticRegression(random_state = 0 , solver = " liblinear" )
49
- clf.fit(X_train, y_train)
50
- dump(clf, " my-logistic-regression.skops" )
59
+ X, y = load_iris(return_X_y = True )
60
+ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4 )
61
+ param_grid = {" tree_method" : [" exact" , " approx" , " hist" ]}
62
+ clf = GridSearchCV(XGBClassifier(), param_grid = param_grid).fit(X_train, y_train)
63
+ print (clf.score(X_test, y_test))
64
+ 0.9666666666666667
65
+ dump(clf, " my-model.skops" )
51
66
# ...
52
- loaded = load(" my-logistic-regression.skops" , trusted = True )
53
- loaded.predict(X_test)
67
+ loaded = load(" my-model.skops" , trusted = True )
68
+ print (loaded.score(X_test, y_test))
69
+ 0.9666666666666667
54
70
55
71
# in memory
56
72
from skops.io import dumps, loads
@@ -64,28 +80,35 @@ using :func:`skops.io.get_untrusted_types`:
64
80
.. code :: python
65
81
66
82
from skops.io import get_untrusted_types
67
- unknown_types = get_untrusted_types(file = " my-logistic-regression .skops" )
83
+ unknown_types = get_untrusted_types(file = " my-model .skops" )
68
84
print (unknown_types)
85
+ [' numpy.float64' , ' numpy.int64' , ' sklearn.metrics._scorer._passthrough_scorer' ,
86
+ ' xgboost.core.Booster' , ' xgboost.sklearn.XGBClassifier' ]
87
+
88
+ Note that everything in the above list is safe to load. We already have many
89
+ types included as trusted by default, and some of the above values might be
90
+ added to that list in the future.
69
91
70
92
Once you check the list and you validate that everything in the list is safe,
71
93
you can load the file with ``trusted=unknown_types ``:
72
94
73
95
.. code :: python
74
96
75
- loaded = load(" my-logistic-regression .skops" , trusted = unknown_types)
97
+ loaded = load(" my-model .skops" , trusted = unknown_types)
76
98
77
99
At the moment, we support the vast majority of sklearn estimators. This
78
100
includes complex use cases such as :class: `sklearn.pipeline.Pipeline `,
79
- :class: `sklearn.model_selection.GridSearchCV `, classes using Cython code, such
80
- as :class: `sklearn.tree.DecisionTreeClassifier `, and more. If you discover an
81
- sklearn estimator that does not work, please open an issue on the skops `GitHub
82
- page <https://github.com/skops-dev/skops/issues> `_ and let us know.
83
-
84
- In contrast to ``pickle ``, skops cannot persist arbitrary Python code. This
85
- means if you have custom functions (say, a custom function to be used with
101
+ :class: `sklearn.model_selection.GridSearchCV `, classes using objects defined in
102
+ Cython such as :class: `sklearn.tree.DecisionTreeClassifier `, and more. If you
103
+ discover an sklearn estimator that does not work, please open an issue on the
104
+ skops `GitHub page <https://github.com/skops-dev/skops/issues >`__ and let us
105
+ know.
106
+
107
+ At the moment, ``skops `` cannot persist arbitrary Python code. This means if
108
+ you have custom functions (say, a custom function to be used with
86
109
:class: `sklearn.preprocessing.FunctionTransformer `), it will not work. However,
87
- most ``numpy `` and ``scipy `` functions should work. Therefore, you can actually
88
- save built-in functions like ``numpy.sqrt ``.
110
+ most ``numpy `` and ``scipy `` functions should work. Therefore, you can save
111
+ objects having references to functions such as ``numpy.sqrt ``.
89
112
90
113
Supported libraries
91
114
-------------------
@@ -96,32 +119,29 @@ most types from **numpy** and **scipy** should be supported, such as (sparse)
96
119
arrays, dtypes, random generators, and ufuncs.
97
120
98
121
Apart from this core, we plan to support machine learning libraries commonly
99
- used be the community. So far, those are :
122
+ used be the community. So far, we have tested the following libraries :
100
123
101
124
- `LightGBM <https://lightgbm.readthedocs.io/ >`_ (scikit-learn API)
102
125
- `XGBoost <https://xgboost.readthedocs.io/en/stable/ >`_ (scikit-learn API)
103
126
- `CatBoost <https://catboost.ai/en/docs/ >`_
104
127
105
128
If you run into a problem using any of the mentioned libraries, this could mean
106
129
there is a bug in skops. Please open an issue on `our issue tracker
107
- <https://github.com/skops-dev/skops/issues> `_ (but please check first if a
130
+ <https://github.com/skops-dev/skops/issues> `__ (but please check first if a
108
131
corresponding issue already exists).
109
132
110
133
Roadmap
111
134
-------
112
-
113
- Currently, it is still possible to run insecure code when using skops
114
- persistence. For example, it's possible to load a save file that evaluates
115
- arbitrary code using :func: `eval `. However, we have concrete plans on how to
116
- mitigate this, so please stay updated.
117
-
118
- On top of trying to support persisting all relevant sklearn objects, we plan on
119
- making persistence extensible for other libraries. As a user, this means that
120
- if you trust a certain library, you will be able to tell skops to load code
121
- from that library. As a library author, there will be a clear path of what
122
- needs to be done to add secure persistence to your library, such that skops can
123
- save and load code from your library.
124
-
125
- To follow what features are currently planned, filter for the `"persistence"
126
- label <https://github.com/skops-dev/skops/labels/persistence> `_ in our GitHub
127
- issues.
135
+ There needs to be more testing to harden the loader and make sure we don't run
136
+ arbitrary code when it's not intended. However, the safety mechanisms already
137
+ in place should prevent most cases of abuse.
138
+
139
+ At the moment, persisting and loading arbitrary C extension types is not
140
+ possible, unless a python object wraps around them and handles persistence and
141
+ loading via ``__getstate__ `` and ``__setstate__ ``. We plan to develop an API
142
+ which would help third party libraries to make their C extension types
143
+ ``skops `` compatible.
144
+
145
+ You can check on our `"issue tracker
146
+ <https://github.com/skops-dev/skops/labels/persistence> `__ which features are
147
+ planned for the near future.
0 commit comments