Skip to content

System configuration docs update #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 36 additions & 13 deletions docs/docs/dev/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,38 +8,61 @@ The default configuration file is located at [label_sleuth/config.json](https://

A custom configuration file can be applied by passing the `--config_path` parameter to the "start_label_sleuth" command. In that case the following command can be used to invoke Label Sleuth:

```
```text
python -m label_sleuth.start_label_sleuth --config_path <path_to_my_configuration_json>
```

Alternatively, it is possible to override specific configuration parameters at startup by appending them to the "start_label_sleuth" command. For example, to set up the system to work with text data in Arabic, one can set the system language by using the following command:

```
```text
python -m label_sleuth.start_label_sleuth --language Arabic
```


## Parameters

The following parameters can be set in the configuration file:
The parameters that can be set in the configuration file are presented in three different tables: general configuration, binary workspace configuration and multiclass workspace configuration.

### General configuration

The following table lists the general parameters:

| Parameter | Description |
|-----------|-------------|
| `first_model_positive_threshold` | Number of elements that must be assigned a positive label for the category in order to trigger the training of a classification model. <br /> <br /> _See also:_ The [training invocation](model_training.md#training-invocation) documentation. |
| `first_model_negative_threshold` | Number of elements that must be assigned a negative label for the category in order to trigger the training of a classification model. <br /> <br /> _See also:_ The [training invocation](model_training.md#training-invocation) documentation. |
| `changed_element_threshold` | Number of changes in user labels for the category -- relative to the last trained model -- that are required to trigger the training of a new model. A change can be a assigning a label (positive or negative) to an element, or changing an existing label. Note that both, `first_model_positive_threshold` and `first_model_negative_threshold`, must also be met for the training to be triggered. <br /> <br /> _See also:_ The [training invocation](model_training.md#training-invocation) documentation. |
| `training_set_selection_strategy` | Strategy to be used from [TrainingSetSelectionStrategy](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/training_set_selector/train_set_selector_api.py). A TrainingSetSelectionStrategy determines which examples will be sent to the classification models at training time - these will not necessarily be identical to the set of elements labeled by the user. For currently supported implementations see the [get_training_set_selector()](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/training_set_selector/training_set_selector_factory.py) function. <br /> <br /> _See also:_ The [training set selection](model_training.md#training-set-selection) documentation. |
| `model_policy` | Policy to be used from [ModelPolicies](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/models/core/model_policies.py). A [ModelPolicy](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/models/policy/model_policy.py) determines which type of classification model(s) will be used, and _when_ (e.g. always / only after a specific number of iterations / etc.). <br /> <br /> _See also:_ The [model selection](model_training.md#model-selection) documentation. |
| `active_learning_strategy` | Strategy to be used from [ActiveLearningCatalog](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/active_learning/core/catalog.py). An [ActiveLearner](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/active_learning/core/active_learning_api.py) module implements the strategy for recommending the next elements to be labeled by the user, aiming to increase the efficiency of the annotation process. <br /> <br /> _See also:_ The [active learning](active_learning.md) documentation. |
| `precision_evaluation_size` | Sample size to be used for estimating the precision of the current model when the precision evaluation function is invoked. <br /><br /> _Defaults to_ `20`. |
| `apply_labels_to_duplicate_texts` | Specifies how to treat elements with identical texts. If `true`, assigning a label to an element will also assign the same label to other elements which share the exact same text; if `false`, the label will only be assigned to the specific element labeled by the user. <br /><br /> _Defaults to_ `true`. |
| `language` | Specifies the chosen system-wide language. This determines some language-specific resources that will be used by models and helper functions (e.g., stop words). The list of supported languages can be found [here](languages.md). We welcome contributions of additional languages. <br /><br /> _Defaults to_ `ENGLISH`. |
| `login_required` | Specifies whether or not using the system will require user authentication. If `true`, the configuration file must also include a `users` parameter. <br /><br /> _Defaults to_ `false`. |
| `users` | Only relevant if `login_required` is `true`. Specifies the pre-defined login information in the following format: <pre>"users":[<br>&nbsp;{<br>&nbsp;&nbsp;&nbsp;"username": "<predefined_username1>",<br>&nbsp;&nbsp;&nbsp;"token":"<randomly_generated_token1>",<br>&nbsp;&nbsp;&nbsp;"password":"<predefined_user1_password>"<br>&nbsp;}<br>] </pre> * The list of usernames is static and currently all users have access to all the workspaces in the system. |
| `users` | Only relevant if `login_required` is `true`. Specifies the pre-defined login information in the following format: <pre>"users":[<br>&nbsp;{<br>&nbsp;&nbsp;&nbsp;"username": "<predefined_username1>",<br>&nbsp;&nbsp;&nbsp;"token":"<randomly_generated_token1>",<br>&nbsp;&nbsp;&nbsp;"password":"<predefined_user1_password>"<br>&nbsp;}<br>] </pre> * The list of usernames is static and currently all users have access to all the workspaces in the system. <br /><br /> _Defaults to an empty list_.|
| `main_panel_elements_per_page` | Number of elements per page in the main panel, i.e., document view. <br /><br /> _Defaults to_ `500`. |
| `sidebar_panel_elements_per_page` | Number of elements per page in the sidebar panels that use pagination. <br /><br /> _Defaults to_ `50`. |
| `snippet_max_token_length` | Max number of tokens after which a text snippet shown on the right panels is cut off.<br /><br /> _Defaults to_ `100`. |
| `right_to_left` | If `true` the text on the UI starts from the right of the page and continues to the left. <br /><br /> _Defaults to_ `false`. |
| `max_dataset_length` | The number of rows limit of the csv files. <br /><br /> _Defaults to_ `10000000`. |
| `max_document_name_length` | The max number of chars of document names. <br /><br /> _Defaults to_ `60`. |
| `cpu_workers` | The max number of CPUs to use for running jobs. <br /><br /> _Defaults to_ `100`. |
| `cpu_workers` | The max number of CPUs to use for running jobs. <br /><br /> _Defaults to_ `100`. |

### Binary workspaces configuration

The following table lists the parameters that are used to configure binary workspaces. In the configuration json file, these parameters have to be under the `binary_flow` object. If set as a command parameter, they have to be prefixed with the `binary_flow.` string (e.g., `--binary_flow.model_policy`).

| Parameter | Description |
|-----------|-------------|
| `first_model_positive_threshold` | Number of elements that must be assigned a positive label for the category in order to trigger the training of a classification model. <br /> <br /> _See also:_ The [training invocation](model_training.md#training-invocation) documentation. |
| `first_model_negative_threshold` | Number of elements that must be assigned a negative label for the category in order to trigger the training of a classification model. <br /> <br /> _See also:_ The [training invocation](model_training.md#training-invocation) documentation. |
| `changed_element_threshold` | Number of changes in user labels for the category -- relative to the last trained model -- that are required to trigger the training of a new model. A change can be a assigning a label (positive or negative) to an element, or changing an existing label. Note that both, `first_model_positive_threshold` and `first_model_negative_threshold`, must also be met for the training to be triggered. <br /> <br /> _See also:_ The [training invocation](model_training.md#training-invocation) documentation. |
| `training_set_selection_strategy` | Strategy to be used from [TrainingSetSelectionStrategy](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/training_set_selector/train_set_selector_api.py). A TrainingSetSelectionStrategy determines which examples will be sent to the classification models at training time - these will not necessarily be identical to the set of elements labeled by the user. For currently supported implementations see the [get_training_set_selector()](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/training_set_selector/training_set_selector_factory.py) function. <br /> <br /> _See also:_ The [training set selection](model_training.md#training-set-selection) documentation. |
| `model_policy` | Policy to be used from [ModelPolicies](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/models/core/model_policies.py). A [ModelPolicy](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/models/policy/model_policy.py) determines which type of classification model(s) will be used, and _when_ (e.g. always / only after a specific number of iterations / etc.). <br /> <br /> _See also:_ The [model selection](model_training.md#model-selection) documentation. |
| `active_learning_strategy` | Strategy to be used from [ActiveLearningCatalog](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/active_learning/core/catalog.py). An [ActiveLearner](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/active_learning/core/active_learning_api.py) module implements the strategy for recommending the next elements to be labeled by the user, aiming to increase the efficiency of the annotation process. <br /> <br /> _See also:_ The [active learning](active_learning.md) documentation. |
| `precision_evaluation_size` | Sample size to be used for estimating the precision of the current model when the precision evaluation function is invoked. <br /><br /> _Defaults to_ `20`. |

### Multiclass workspaces configuration

The following table lists the parameters that are used to configure binary workspaces. In the configuration json file, these parameters have to be under the `multiclass_flow` object. If set as a command parameter, they have to be prefixed with the `multiclass_flow.` string (e.g., `--multiclass_flow.model_policy`).

| Parameter | Description |
|-----------|-------------|
| `per_class_labeling_threshold` | Number of elements that must be assigned label in order to trigger the training of a classification model. <br /> <br /> _See also:_ The [training invocation](model_training.md#training-invocation) documentation. |
| `changed_element_threshold` | Number of changes in user labels for the category -- relative to the last trained model -- that are required to trigger the training of a new model. A change can be a assigning a label (positive or negative) to an element, or changing an existing label. Note that both, `first_model_positive_threshold` and `first_model_negative_threshold`, must also be met for the training to be triggered. <br /> <br /> _See also:_ The [training invocation](model_training.md#training-invocation) documentation. |
| `training_set_selection_strategy` | Strategy to be used from [TrainingSetSelectionStrategy](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/training_set_selector/train_set_selector_api.py). A TrainingSetSelectionStrategy determines which examples will be sent to the classification models at training time - these will not necessarily be identical to the set of elements labeled by the user. For currently supported implementations see the [get_training_set_selector()](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/training_set_selector/training_set_selector_factory.py) function. <br /> <br /> _See also:_ The [training set selection](model_training.md#training-set-selection) documentation. |
| `model_policy` | Policy to be used from [ModelPolicies](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/models/core/model_policies.py). A [ModelPolicy](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/models/policy/model_policy.py) determines which type of classification model(s) will be used, and _when_ (e.g. always / only after a specific number of iterations / etc.). <br /> <br /> _See also:_ The [model selection](model_training.md#model-selection) documentation. |
| `active_learning_strategy` | Strategy to be used from [ActiveLearningCatalog](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/active_learning/core/catalog.py). An [ActiveLearner](https://github.com/label-sleuth/label-sleuth/blob/main/label_sleuth/active_learning/core/active_learning_api.py) module implements the strategy for recommending the next elements to be labeled by the user, aiming to increase the efficiency of the annotation process. <br /> <br /> _See also:_ The [active learning](active_learning.md) documentation. |
| `accuracy_evaluation_size` | Sample size to be used for estimating the accuracy of the current model when the precision evaluation function is invoked. <br /><br /> _Defaults to_ `30`. |