|
1 |
| -# Pyconstruct |
2 |
| -A Python library for declarative, constrained, structured-output prediction. |
| 1 | +Pyconstruct |
| 2 | +=========== |
3 | 3 |
|
4 |
| -## Coming soon |
| 4 | +<div align="center"> |
| 5 | + <img height="300px" src="docs/_static/images/pyconstruct.png"><br><br> |
| 6 | +</div> |
| 7 | + |
| 8 | +**Pyconstruct** is a Python library for declarative, constrained, |
| 9 | +structured-output prediction. When using Pyconstruct, the problem specification |
| 10 | +can be encoded in MiniZinc, a high-level constraint programming language. This |
| 11 | +means that domain knowledge can be declaratively included in the inference |
| 12 | +procedure as constraints over the optimization variables. |
| 13 | + |
| 14 | +Sounds complicated? A simple example will clear up the doubts! |
| 15 | + |
| 16 | + |
| 17 | +Getting started |
| 18 | +--------------- |
| 19 | + |
| 20 | +In the following example we will implement a simple OCR (Optical Character |
| 21 | +Recognition) model in few lines of MiniZinc. |
| 22 | + |
| 23 | +First of all, lets fetch the data. Pyconstruct has a utility for getting some |
| 24 | +standard datasets:: |
| 25 | + |
| 26 | +```python |
| 27 | + from pyconstruct import datasets |
| 28 | + ocr = datasets.load('ocr') |
| 29 | +``` |
| 30 | + |
| 31 | +The first time the dataset is loaded it will actually be fetched from the web |
| 32 | +and stored locally. You can now see the description of the dataset:: |
| 33 | + |
| 34 | +```python |
| 35 | + print(ocr.descr) |
| 36 | +``` |
| 37 | + |
| 38 | +By default, structured objects are represented as Python dictionaries in |
| 39 | +Pyconstruct. Each objects as several "attributes", identified with some string. |
| 40 | +Each attribute value may be any basic Python data type: strings, integers, |
| 41 | +floats, list or other dictionaries. In OCR, for instance, inputs `X` are |
| 42 | +represented as dictionaries containing two attributes: an integer containing the |
| 43 | +`length` of the word; a list of `16x8` matrices (`numpy.ndarray`) containing the |
| 44 | +bitmap images of each character in the word. The targets (labels) are also |
| 45 | +structured objects containig a single attribute `sequence`, a list of integers |
| 46 | +representing the letters associated to each image in the word. For instance:: |
| 47 | + |
| 48 | +```python |
| 49 | + print(ocr.data[0]) |
| 50 | + print(ocr.targets[0]) |
| 51 | +``` |
| 52 | + |
| 53 | +After getting the data, we can start coding our problem. First of all, in |
| 54 | +Pyconstruct there are three main kinds of objects to interact with: Domains, |
| 55 | +Models and Learners. At a high level: a Domain defines the attributes and the |
| 56 | +constraints of the structured objects; a Model is an object contaning some |
| 57 | +parameters that can be used to make inference over a Domain; a Learner is an |
| 58 | +algorithm that can learn a Model from data. A Domain is also responsible of |
| 59 | +solving inference problems with respect to some Model, so the two classes are |
| 60 | +interdependent, but in general a Domain can be made working for different |
| 61 | +Models. |
| 62 | + |
| 63 | +Several Models and Learners are already defined by Pyconstruct. All that is |
| 64 | +required for start training a model, apart from the data, is a Domain encoded in |
| 65 | +MiniZinc which defines how the attributes of the objects interact, which are the |
| 66 | +constraints and the features of the objects. To do so, we need to create a |
| 67 | +`ocr.pmzn` file:: |
| 68 | + |
| 69 | +```HTML+Django |
| 70 | + {% from 'pyconstruct.pmzn' import n_features, features, domain, solve %} |
| 71 | +
|
| 72 | + {{ n_features('16 * 8 * 26') }} |
| 73 | +
|
| 74 | + {% call domain(problem) %} |
| 75 | +
|
| 76 | + int: length; |
| 77 | + array[1 .. length, 1 .. 16, 1 .. 8] of var {0, 1}: images; |
| 78 | +
|
| 79 | + array[1 .. length] of var 1 .. 26: sequence; |
| 80 | +
|
| 81 | +
|
| 82 | + {% call features(feature_type='int') %} |
| 83 | + [ |
| 84 | + sum(e in 1 .. length)(images[e, i, j] * (sequence[e] == s)) |
| 85 | + | i in 1 .. 16, j in 1 .. 8, s in 1 .. 26 |
| 86 | + ] |
| 87 | + {% endcall %} |
| 88 | +
|
| 89 | + {% endcall %} |
| 90 | +
|
| 91 | + {{ solve(problem, model, discretize=True) }} |
| 92 | +``` |
| 93 | + |
| 94 | +That's it! Now we can instantiate a `Domain` with our new `ocr.pmzn` file:: |
| 95 | + |
| 96 | +```python |
| 97 | + from pyconstruct import Domain |
| 98 | + ocr_dom = Domain('ocr.pmzn') |
| 99 | +``` |
| 100 | + |
| 101 | +If you know MiniZinc, the above code will probably look a bit odd. That is |
| 102 | +because Pyconstruct by default uses a superset of MiniZinc defined by the PyMzn |
| 103 | +library. Essentially, that is MiniZinc with some tempating provided by the |
| 104 | +Jinja2 library. Check out PyMzn for an explanation on how to use fully it. Here |
| 105 | +we'll explain the basics. |
| 106 | + |
| 107 | +The first line |
| 108 | +`{% from 'pyconstruct.pmzn' import n_features, features, domain, solve %}` |
| 109 | +imports few useful macros from the `pyconstruct.pmzn` file. |
| 110 | + |
| 111 | +The second line `{{ n_features('16 * 8 * 26') }}` calls the `n_features` macro, |
| 112 | +which compiles into:: |
| 113 | + |
| 114 | +```HTML+Django |
| 115 | + int: N_FEATURES = 16 * 8 * 26; |
| 116 | + set of int: FEATURES 1 .. N_FEATURES; |
| 117 | +``` |
| 118 | + |
| 119 | +The MiniZinc code enclosed in the tags |
| 120 | +`{% call domain(problem) %} ... {% endcall %}` is processed on the basis of the |
| 121 | +value of `problem` the domain is called with. The variable `problem` is usually |
| 122 | +passed to the domain by an internal call of Pyconstruct through PyMzn. In this |
| 123 | +block goes the domain definition, including the variables and parameters of the |
| 124 | +objects, the constraints and the features. Notice that we have two MiniZinc |
| 125 | +parameters `length` and `images`, which match the attributes of the input |
| 126 | +objects of the OCR dataset, and one optimization variable `sequence` which |
| 127 | +matches the attribute of the output objects of the OCR dataset. This is valid |
| 128 | +for any problem: the examples are the inputs that are provided as dzn data, |
| 129 | +whereas the targets are the outputs of the model, which translate into |
| 130 | +optimization variables when solving inference. |
| 131 | + |
| 132 | +Inside the domain call we also call the `features` macro, which compiles into:: |
| 133 | + |
| 134 | +```HTML+Django |
| 135 | + array[FEATURES] of var int: phi = [ |
| 136 | + sum(e in 1 .. length)(images[e, i, j] * (sequence[e] == s)) |
| 137 | + | i in 1 .. 16, j in 1 .. 8, s in 1 .. 26 |
| 138 | + ]; |
| 139 | +``` |
| 140 | + |
| 141 | +These are typical features used in OCR, for each symbol `s` and each pixel `(i, |
| 142 | +j)` in the images containing the number of times in the sequence the `(i, j)` |
| 143 | +pixel is active for characters labeled with symbol `s`. |
| 144 | + |
| 145 | +The last line calls the `solve` macro, which compiles to a different solve |
| 146 | +statement depending on the `problem` and `model`. Possible values for `problem` |
| 147 | +are, for instance, `map` to find the object with highest score (dot product |
| 148 | +between weights and features) or `phi` to compute the feature vector given an |
| 149 | +input and an output object. The `model` is a dictionary containing the model's |
| 150 | +parameters, such as the weights `w` for a `LinearModel`. Also this object is |
| 151 | +usually passed to the domain by Pyconstruct. |
| 152 | + |
| 153 | +The above model is actually a partial example of the complete `ocr` domain |
| 154 | +available in Pyconstruct out-of-the-box. You can load the domain by simply:: |
| 155 | + |
| 156 | +```python |
| 157 | + ocr_dom = Domain('ocr') |
| 158 | +``` |
| 159 | + |
| 160 | +After defining the domain, using the predefined one or the `ocr.pmzn` file, we |
| 161 | +can start learning by instantiating a learner, say a `StructuredPerceptron`, and |
| 162 | +fitting the data:: |
| 163 | + |
| 164 | +```python |
| 165 | + from pyconstruct import StructuredPerceptron |
| 166 | + sp = StructuredPerceptron(domain=ocr_dom) |
| 167 | + sp.fit(ocr.data, ocr.targets) |
| 168 | +``` |
| 169 | + |
| 170 | +This will take a while... If you need a quick benchmark, Pyconstruct contains |
| 171 | +pretrained models for many domains and learners (link). |
| 172 | + |
| 173 | + |
| 174 | +Install |
| 175 | +------- |
| 176 | +Pyconstruct can be installed through `pip`: |
| 177 | + |
| 178 | +```bash |
| 179 | +pip install pyconstruct |
| 180 | +``` |
| 181 | + |
| 182 | +Or by downloading the code from Github and running the following from the |
| 183 | +downloaded directory: |
| 184 | + |
| 185 | +```bash |
| 186 | +python setup.py install |
| 187 | +``` |
| 188 | + |
| 189 | +After installing Pyconstruct you will need to install **MiniZinc** as well. |
| 190 | +Download the latest release of MiniZincIDE and follow the instructions. |
| 191 | + |
| 192 | +Authors |
| 193 | +------- |
| 194 | +This project is developed at the SML research group at the University of Trento |
| 195 | +(Italy). Main developers and maintainers: |
| 196 | + |
| 197 | +* Paolo Dragone |
| 198 | +* Stefano Teso (now at KU Leuven) |
| 199 | +* Andrea Passerini |
5 | 200 |
|
0 commit comments