GitHub - christian-monch/dump-things-server: A simple service to store graphs

Dump Things Service

This is an implementation of a service that allows to store and retrieve data that is structured according to given schemata.

Data is stored in collections. Each collection has a name and an associated schema. All data records in the collection have to adhere to the given schema.

The canonical format for schemas is LinkML. The service supports schemas that are based on Datalad's Thing schema, i.e. on https://concepts.datalad.org/s/things/v1/. It assumes that the classes of stored records are subclasses of Thing, and inherit the properties pid and schema_type from the Thing-baseclass.

The general workflow in the service is as follows. We distinguish between two areas of a collection, an incoming are and a curated area. Data written to a collection is stored in a collection-specific incoming area. A curation process, which is outside the scope of the service, moves data from the incoming area of a collection to the curated area of the collection.

To submit a record to a collection, a token is required. The token defines read- and write- permissions for the incoming areas of collections and read-permissions for the curated area of a collection. A token can carry permissions for multiple collections. In addition, the token carries a submitter ID. It also defines a token specific zone in the incoming area. So any read- and write-operations on an incoming area are actually restricted to the token-specific zone in the incoming area. Multiple tokens can share the same zone. That allows multiple submitters to work together when storing records in the service.

The service provides a HTTP-based API to store and retrieve data objects, and to verify token capabilities.

Installing the service

The service is available via pypi, and can be installed by pip. Execute the command pip install dump-things-service to install the service.

Running the service

After installation the service can be started via the command dump-things-service. The basic service configuration is done via command line parameters and configuration files.

The following command line parameters are supported:

<storage root>: (mandatory) the path of a directory that serves as anchor for all relative paths given in the configuration files. Unless -c/--config is provided, the service will search the configuration file in <storage root>/.dumpthings.yaml.
--host <IP-address>: The IP-address on which the service should accept connections (default: 0.0.0.0).
--port <port>: The port on which the service should accept connections (default: 8000).
-c/--config <config-file>: provide a path to the configuration file. The configuration file in <storage root>/.dumpthings.yaml will be ignored, if it exists at all.
--origins <origin>: add a CORS origin hosts (repeat to add multiple CORS origin URLs).`
--root-path <path>: Set the ASGI 'root_path' for applications submounted below a given URL path.
--sort-by <field>: By default result records are sorted by the field pid. This parameter allows overriding the sort field. The parameter can be repeated to define secondary, tertiary, etc. sorting fields. If a given field is not present in the record, the record will be sorted behind all records that possess the field.

Configuration file

The service is configured via a configuration file that defines collections, paths for incoming and curated data for each collection, as well as token properties. Token properties include a submitter identification and for each collection an incoming zone specifier, permissions for reading and writing of the incoming zone and permission for reading the curated data of the collection.

A "formal" definition of the configuration file is provided by the class GlobalConfig in the file dumpthings-server/config.py.

Configurations are read in YAML format. The following is an example configuration file that illustrates all options:

type: collections     # has to be "collections"
version: 1            # has to be 1

# All collections are listed in "collections"
collections:

  # The following entry defines the collection "personal_records"
  personal_records:
    # The token, as defined below, that is used if no token is provided by a client.
    # All tokens that are provided by the client will be OR-ed with the default token.
    # That means all permissions in the default token will be added to the client provided
    # token. In this way the default token will always be less or equally powerful as the
    # client provided token.
    default_token: no_access

    # The path to the curated data of the collection. This path should contain the
    # ".dumpthings.yaml"-configuration for  collections that is described
    # here: <https://concepts.datalad.org/dump-things/>.
    # A relative path is interpreted relative to the storage root, which is provided on
    # service start. An absolute path is interpreted as an absolute path.
    curated: curated/personal_records

    # The path to the incoming data of the collection.
    # Different collections should have different curated- and incoming-paths
    incoming: /tmp/personal_records/incoming

    # Optionally a list of classes that should receive store- or validate-endpoints,
    # if this list is present, all other classes defined in the schema will be ignored,
    # i.e., they will not receive store- and validation-endpoints. The classes listed
    # here must be in the schema.
    use_classes:
      - Organization
      - Person
      - Project
      - Agent

    # Optionally a list of classes that will be ignored when store- or validate-endpoints
    # are created. If `use_classes` is present, the entries of this list will further reduce
    # the classes that will receive endpoints. If `use_classes` is not present, the entries
    # of this list will reduce the classes from the schema, the will receive endpoints.
    # The classes listed here must be listed in `use_classes` if that is defined. If
    # `use_classes` is not defined, they must be listed in the schema.
    ignore_classes:
      - Person
      - Project

  # The following entry defines the collection "rooms_and_buildings"
  rooms_and_buildings:
    default_token: basic_access
    curated: curated/rooms_and_buildings
    incoming: incoming/rooms_and_buildings

  # The following entry defines the collection "fixed_data", which does not
  # support data uploading, because there is no token that allows uploads to 
  # "fixed_data".
  fixed_data:
    default_token: basic_access
    # If not upload is supported, the "incoming"-entry is not necessary.
    curated: curated/fixed_data_curated

# All tokens are listed in "tokens"
tokens:
  
  # The following entry defines the token "basic_access". This token allows read-only
  # access to the two collections: "rooms_and_buildings" and "fixed_data".
  basic_access:

    # The value of "user-id" will be added as an annotation to each record that is
    # uploaded with this token.
    user_id: anonymous

    # The collections for which the token holds rights are defined in "collections"
    collections:

      # The rights that "basic_access" carries for the collection "rooms_and_buildings"
      # are defined here.
      rooms_and_buildings:
        # Access modes are defined here:
        # <https://github.com/christian-monch/dump-things-server/issues/67#issuecomment-2834900042>
        mode: READ_CURATED

        # A token and collection-specific label, that defines "zones" in which incoming
        # records are stored. Multiple tokens can share the same zone, for example if
        # many clients with individual tokens work together to build a collection.
        # (Since this token does not allow write access, "incoming_label" is ignored and
        # left empty here (TODO: it should not be required in this case)).
        incoming_label: ''

      # The rights that "basic_access" carries for the collection "fixed_data"
      # are defined here.
      fixed_data:
        mode: READ_CURATED
        incoming_label: ''

  # The following entry defines the token "no_access". This token does not allow
  # any access and is used as a default token for the collection "personal_records".
  no_access:
    user_id: nobody

    collections:
      personal_records:
        mode: NOTHING
        incoming_label: ''

  # The following entry defines the token "admin". It gives full access rights to
  # the collection "personal_records".
  admin:
    user_id: Admin
    collections:
      personal_records:
        mode: WRITE_COLLECTION
        incoming_label: 'admin_posted_records'

  # The following entry defines the token "contributor_bob". It gives full access
  # to "rooms_and_buildings" for a user with the id "Bob".
  contributor_bob:
    user_id: Bob
    collections:
      rooms_and_buildings:
        mode: WRITE_COLLECTION
        incoming_label: new_rooms_and_buildings
        
  # The following entry defines the token "contributor_alice". It gives full access
  # to "rooms_and_buildings" for a user with the id "Alice". Bob and Alice share the
  # same incoming-zone, i.e. "new_rooms_and_buildings". That means they can read
  # incoming records that the other one posted.
  contributor_alice:
    user_id: Alice
    collections:
      rooms_and_buildings:
      mode: WRITE_COLLECTION
      incoming_label: new_rooms_and_buildings

  # The following entry defines a hashed token because the key `hashed` is set
  # to `True`. A hashed token has the structure
  # `<id>-<sha256>`. It will match an incoming token if the incoming token has
  # the structure `<id>-<content>` and if sha256(`<content>`) equals `<sha256>`.
  # In this example, if the client present sthe token `bob-hello`, he will be
  # granted access because `sha256('hello')` equals
  # `2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824`
  bob-2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824:
    hashed: True
    collections:
      rooms_and_buildings:
      mode: WRITE_COLLECTION
      incoming_label: bob
    
  #

Backends

The service currently supports the following backends for storing records:

record_dir: this backend stores records as YAML-files in a directory structure that is defined here. It reads the backend configuration from a "record collection configuration file" as described here.
sqlite: this backend stores records in a SQLite database. There is an individual database file, named __sqlite-records.db, for each curated area and incoming area.
record_dir+stl: here stl stands for "schema-type-layer". This backend stores records in the same format as record_dir, but adds special treatment for the schema_type attribute in records. It removes schema_type-attributes from the top-level mapping of a record before storing it as YAML-file. When records are read from this backend, a schema_type attribute is added back into the record, using a schema to determine the correct class-URI. In other words, all records stored with this backend will have no schema_type-attribute in the top-level, and all records read with this backend will have a schema_type attribute in the top-level.
sqlite+stl: This backend stores records in the same format as sqlite, but adds the same special treatment for the schema_type attribute as record_dir+stl.

Backends can be defined per collection in the configuration file. The backend will be used for the curated area and for the incoming areas of the collection. If no backend is defined for a collection, the record_dir+stl-backend is used by default. The +stl-backends can be useful if an endpoint returns records of multiple classes, because it allows clients to determine the class of each result record.

The service guarantees that backends of all types can co-exist independently in the same directory, i.e., there are no name collisions in files that are used for different backends (as long as no class name starts with . or _)).

The following configuration snippet shows how to define a backend for a collection:

...
collections:
  collection_with_default_record_dir+stl_backend:
    # This is a collection with the default backend, i.e. `record_dir+stl` and
    # the default authentication, i.e. config-based authentication.
    default_token: anon_read
    curated: collection_1/curated

  collection_with_forgejo_authentication_source:
    # This is a collection with the default backend, i.e. `record_dir+stl` and
    # a forgejo-based authentication source. That means it will use a forgejo
    # instance to determine the permissions of a token for this collection.
    # The instance is also used to determine the user-id and the incoming label.
    # In the case of forgejo, the user-id and the incoming label are the
    # forgejo login associated with the token.

    # We still need the name of a default token. If the token is defined in this
    # config file, its properties will be determined by the
    # config file. If the token is not defined in the config file, its
    # properties will be determined by the authentication sources. In this
    # example by the forgejo-instance at `https://forgejo.example.com`.
    # If there is more than one authentication source, they will be tried
    # in the order they are defined in the config file.
    default_token: anon_read    # We still need a default token
    curated: collection_2/curated

    # Token permissions, user-ids (for record annotations), and incoming
    # label can be determined by multiple authentication sources.
    # If no source is defined, `config` will be used, which reads token
    # information from the config file.
    # This example explicitly defines `config` and a second authentication
    # source, a `forgejo` authentication source.
    auth_sources:
      - type: forgejo   # requires `user`-read and `organization`-read permissions on token
        # The API-URL of the forgejo instance that should be used
        url: https://forgejo.example.com/api/v1
        # An organization
        organization: data_handling
        # A team in the organization. The authorization of the team
        # determines the permissions of the token
        team: data_entry_personal
        # `label_type` determines how an incoming label is created for
        # a Forgejo token. If `label_type` is `team`, the incoming label
        # will be `forgejo-team-<organization>-<team>`. If `label_type`
        # is `user`, the incoming label will be 
        # `forgejo-user-<user-login>`
        label_type: team
        # An optional repository. The token will only be authorized
        # if the team has access to the repository. Note: if `repo`
        # is set, the token must have at least repository read
        # permissions.
        repo: reference-repository

      # Fallback to the config file.
      - type: config    # check tokens from the configuration file

      # Multiple authorization sources are allowed. They will be tried in the
      # order defined in the config file. If an authorization source returns
      # permissions for a token, those permissions will be used and no other
      # authorization sources will be queried.
      # The default authorization source is `config`, which reads the token 
      # permissions, user-id, and incoming
  
  collection_with_explicit_record_dir+stl_backend:
    default_token: anon_read
    curated: collection_3/curated
    backend:
      # The record_dir-backend is identified by the
      # type: "record_dir". No more attributes are
      # defined for this backend.
      type: record_dir+stl

  collection_with_sqlite_backend:
    default_token: anon_read
    curated: collection_4/curated
    backend:
      # The sqlite-backend is identified by the
      # type: "sqlite". It requires a schema attribute
      # that holds the URL of the schema that should
      # be used in this backend.
      type: sqlite
      schema: https://concepts.inm7.de/s/flat-data/unreleased.yaml

Authentication and authorization

To authenticate and authorize a user based on tokens, dumpthing-service uses authentication sources. There are currently two authentication sources: the configuration file and a Forgejo-based authentication source. Authentication sources can be configured per collection. If no authentication source is configured, the collection uses the configuration file.

If authentication sources are configured, they will be tried in order until a token is authenticated. If an authentication source is listed twice, the second instance will be ignored.

Authentication sources can be defined individually for each collection. The collection-level key auth_sources should contain a list of authentication source configurations. Authentication sources are tried in order until a token is successfully authenticated. If no authentication source authenticates the token, the token will be rejected.

If no authentication source is defined, the configuration file will be used to authenticate tokens. If an identical authentication source is defined multiple times, the first instance will be queried, all other instances will be ignored. Authentication sources are identical if the content of their keys match. If an identical authentication source is listed multiple time in the configuration, the service will issue a warning about Ignoring duplicate authentication provider....

These authentication sources are available:

config: use the configuration file to
forgejo: use a Forgejo-instance to authenticate tokens

All authentication source configurations contain the key type. Additional keys are authentication source type-specific.

The following configuration snippet contains an example for authentication source configuration:

collections:
  collection_with_config_and_forgejo_auth_sources:
      # Token permissions, user-ids (for record annotations), and incoming
      # label can be determined by multiple authentication sources.
      # If no source is defined, `config` will be used, which reads token
      # information from the config file.
      # This example explicitly defines `config` and a second authentication
      # source, a `forgejo` authentication source.
      auth_sources:
        - type: forgejo   # requires `user`-read and `organization`-read permissions on token
          # The API-URL of the forgejo instance that should be used
          url: https://forgejo.example.com/api/v1
          # An organization
          organization: data_handling
          # A team in the organization. The authorization of the team
          # determines the permissions of the token
          team: data_entry_personal
          # `label_type` determines how an incoming label is created for
          # a Forgejo token. If `label_type` is `team`, the incoming label
          # will be `forgejo-team-<organization>-<team>`. If `label_type`
          # is `user`, the incoming label will be 
          # `forgejo-user-<user-login>`
          label_type: team
          # An optional repository. The token will only be authorized
          # if the team has access to the repository. Note: if `repo`
          # is set, the token must have at least repository read
          # permissions.
          repo: reference-repository
    
        # Fallback to the config file.
        - type: config    # check tokens from the configuration file
    
        # Multiple authorization sources are allowed. They will be tried in the
        # order defined in `auth_sources`. If an authorization source returns
        # permissions for a token, those permissions will be used and no other
        # authorization sources will be queried.
        # The default authorization source is `config`, which reads the token 
        # permissions, user-id, and incoming

...

Config-based authentication

collections:
  collection_with_config_authentication:
    default_token: anon_read
    curated: collection_5/curated
    auth_sources:
      - type: <must be 'config'>    # check tokens from the configuration file

...

The configuration file will be used to authenticate tokens

Forgejo-based authentication

collections:
  collection_with_forgejo_authentication:
    default_token: anon_read
    curated: collection_5/curated
    auth_sources:
      - type: <must be 'forgejo'>
        url: <Forgejo API-URL>
        organization: <organization name>
        team: <team_name>
        label_type: <'team' or 'user'>
        repository: <repository name>  # Optional
...

The defined Forgejo-instance will be used to authenticate a token

The user ID is the email of the user.

If label_type is set to team, the incoming label is forgejo-team-<organization-name>-<team-name>. If label_type is set to user, the incoming label is forgejo-user-<user-login>

The permissions will be fetched from the units repo.code and repo.actions of the team definition. The following mapping is used:

`repo.code`	curated_read	incoming_read	incoming_write	curated_right	zones_access
`none`	`False`	`False`	`False`	`False`	`False`
`read`	`True`	`True`	`False`	`False`	`False`
`write`	`True`	`True`	`True`	`False`	`False`

`repo.actions`	curated_read	incoming_read	incoming_write	curated_right	zones_access
`none`	`False`	`False`	`False`	`False`	`False`
`read`	`False`	`False`	`False`	`False`	`False`
`write`	`True`	`True`	`True`	`True`	`True`

A Forgejo authentication source can authenticate Forgejo-tokens that have at least the following Read-permissions:

User: this is required to determine user-related information, i.e. user-email and user login name.
Organization: this is required to determine the membership of a user to a team in an organization.
(Only if repository is set in the configuration) Repository : required to determine a team's access to the repository.

Submission annotation tag

The service annotates submitted records with a submitter id and a timestamp. Annotations consist of an annotation tag, defining the class of the annotation, and an annotation value. By default the service will use the class http://purl.obolibrary.org/obo/NCIT_C54269 for the submitter id and the class http://semanticscience.org/resource/SIO_001083 for submission time. (Both tags will be converted into CURIEs if the schema of the collection defines an appropriate prefix.)

The default annotation tag classes can be overridden in the configuration on a per collection basis. To override the defaults tags, add a submission_tags-attribute to a collection definition. The submission_tags-attribute should contain a mapping that maps either submitter_id_tag, or submitter_time_tag or both to an IRI or a CURIE. If the schema defines a matching prefix, IRIs are automatically converted to CURIEs before storing the record. The service validates that the prefix of a CURIE is defined in the schema of the collection.

type: collections
version: 1
collections:
  collection_1:
    default_token: basic_access
    curated: curated
    incoming: contributions
    submission_tags:
      submitter_id_tag: schema:user_id
      submission_time_tag: schema:time

  ...

The service currently supports the following backends for storing records:

record_dir: this backend stores records as YAML-files in a directory structure that is defined here. It reads the backend configuration from a "record collection configuration file" as described here.
sqlite: this backend stores records in a SQLite database. There is an individual database file, named __sqlite-records.db, for each curated area and incoming area.
record_dir+stl: here stl stands for "schema-type-layer". This backend stores records in the same format as record_dir, but adds special treatment for the schema_type attribute in records. It removes schema_type-attributes from the top-level mapping of a record before storing it as YAML-file. When records are read from this backend, a schema_type attribute is added back into the record, using a schema to determine the correct class-URI. In other words, all records stored with this backend will have no schema_type-attribute in the top-level, and all records read with this backend will have a schema_type attribute in the top-level.
sqlite+stl: This backend stores records in the same format as sqlite, but adds the same special treatment for the schema_type attribute as record_dir+stl.

Backends can be defined per collection in the configuration file. The backend will be used for the curated area and for the incoming areas of the collection. If no backend is defined for a collection, the record_dir+stl-backend is used by default. The +stl-backends can be useful if an endpoint returns records of multiple classes, because it allows clients to determine the class of each result record.

The service guarantees that backends of all types can co-exist independently in the same directory, i.e., there are no name collisions in files that are used for different backends (as long as no class name starts with . or _)).

The following configuration snippet shows how to define a backend for a collection:

...
collections:
  collection_with_default_record_dir+stl_backend:
    # This is a collection with the default backend, i.e. `record_dir+stl` and
    # the default authentication, i.e. config-based authentication.
    default_token: anon_read
    curated: collection_1/curated

  collection_with_forgejo_authentication_source:
    # This is a collection with the default backend, i.e. `record_dir+stl` and
    # a forgejo-based authentication source. That means it will use a forgejo
    # instance to determine the permissions of a token for this collection.
    # The instance is also used to determine the user-id and the incoming label.
    # In the case of forgejo, the user-id and the incoming label are the
    # forgejo login associated with the token.

    # We still need the name of a default token. If the token is defined in this
    # config file, its properties will be determined by the
    # config file. If the token is not defined in the config file, its
    # properties will be determined by the authentication sources. In this
    # example by the forgejo-instance at `https://forgejo.example.com`.
    # If there is more than one authentication source, they will be tried
    # in the order they are defined in the config file.
    default_token: anon_read    # We still need a default token
    curated: collection_2/curated

    # Token permissions, user-ids (for record annotations), and incoming
    # label can be determined by multiple authentication sources.
    # If no source is defined, `config` will be used, which reads token
    # information from the config file.
    # This example explicitly defines `config` and a second authentication
    # source, a `forgejo` authentication source.
    auth_sources:
      - type: forgejo   # requires `user`-read and `organization`-read permissions on token
        # The API-URL of the forgejo instance that should be used
        url: https://forgejo.example.com/api/v1
        # An organization
        organization: data_handling
        # A team in the organization. The authorization of the team
        # determines the permissions of the token
        team: data_entry_personal
        # `label_type` determines how an incoming label is created for
        # a Forgejo token. If `label_type` is `team`, the incoming label
        # will be `forgejo-team-<organization>-<team>`. If `label_type`
        # is `user`, the incoming label will be 
        # `forgejo-user-<user-login>`
        label_type: team
        # An optional repository. The token will only be authorized
        # if the team has access to the repository. Note: if `repo`
        # is set, the token must have at least repository read
        # permissions.
        repo: reference-repository

      # Fallback to the config file.
      - type: config    # check tokens from the configuration file

      # Multiple authorization sources are allowed. They will be tried in the
      # order defined in the config file. If an authorization source returns
      # permissions for a token, those permissions will be used and no other
      # authorization sources will be queried.
      # The default authorization source is `config`, which reads the token 
      # permissions, user-id, and incoming
  
  collection_with_explicit_record_dir+stl_backend:
    default_token: anon_read
    curated: collection_3/curated
    backend:
      # The record_dir-backend is identified by the
      # type: "record_dir". No more attributes are
      # defined for this backend.
      type: record_dir+stl

  collection_with_sqlite_backend:
    default_token: anon_read
    curated: collection_4/curated
    backend:
      # The sqlite-backend is identified by the
      # type: "sqlite". It requires a schema attribute
      # that holds the URL of the schema that should
      # be used in this backend.
      type: sqlite
      schema: https://concepts.inm7.de/s/flat-data/unreleased.yaml

Command line parameters:

The service supports the following command line parameters:

<storage root>: this is a mandatory parameter that defines the directory that serves as root for relative curated- and incoming-paths. Unless the -c/--config option is given, the configuration is loaded from <storage root>/.dumpthings.yaml.
--host: (optional): the IP address of the host the service should run on
--port: the port number the service should listen on
-c/--config: if set, the service will read the configuration from the given path. Otherwise it will try to read the configuration from <storage root>/.dumpthings.yaml.
--log-level: set the log level for the service, allowed values are ERROR, WARNING, INFO, DEBUG. The default-level is WARNING.
--root-path: set the ASGI root_path for applications sub-mounted below a given URL path.

The service can be started with the following command:

dump-things-service

In this example the service will run on the network location 0.0.0.0:8000 and provide access to the stores under /data-storage/store.

To run the service on a specific host and port, use the command line options --host and --port, for example:

dump-things-service /data-storage/store --host 127.0.0.1 --port 8000

Endpoints

Most endpoints require a collection. These correspond to the names of the "data record collection"-directories (for example myschema-v3-fmta in Dump Things Service) in the stores.

The service provides the following user endpoints (in addition to user-endpoints there exist endpoints for curators, to view them check the /docs-path in an installed service):

POST /<collection>/record/<class>: an object of type <class> (defined by the schema associated with <collection>) can be posted to this endpoint. It will be stored in the incoming area for this collection and the user defined by the provided token. In order to POST an object to the service, you MUST provide a valid token in the HTTP-header X-DumpThings-Token with write permissions. The endpoint supports the query parameter format, to select the format of the posted data. It can be set to json (the default) or to ttl (Terse RDF Triple Language, a.k.a. Turtle). If the json-format is selected, the content-type should be application/json. If the ttl-format is selected, the content-type should be text/turtle.
The service supports extraction of inlined records as described in Dump Things Service. On success, the endpoint will return a list of all stored records. This might be more than one record if the posted object contains inlined records.
POST /<collection>/validate/record/<class>: an object of type <class> (defined by the schema associated with <collection>) can be posted to this endpoint. It will validate the posted data. In order to POST an object to the service, you MUST provide a valid token in the HTTP-header X-DumpThings-Token with write permissions. The endpoint supports the query parameter format, to select the format of the posted data. It can be set to json (the default) or to ttl (Terse RDF Triple Language, a.k.a. Turtle). If the json-format is selected, the content-type should be application/json. If the ttl-format is selected, the content-type should be text/turtle.
The service supports extraction of inlined records as described in Dump Things Service. On success, the endpoint will return a list of all stored records. This might be more than one record if the posted object contains inlined records.
GET /<collection>/records/<class>: retrieve all readable objects from collection <collection> that are of type <class> or any of its subclasses. Objects are readable if the default token for the collection allows reading of objects or if a token is provided that allows reading of objects in the collection. Objects from incoming spaces will take precedence over objects from curated spaces, i.e. if there are two objects with identical pid in the curated space and in the incoming space, the object from the incoming space will be returned. The endpoint supports the query parameter format, which determines the format of the query result. It can be set to json (the default) or to ttl, The endpoint supports the query parameter matching, which is interpreted by sqlite-backends and ignored by record_dir-backends. If given, the endpoint will only return records for which the JSON-string representation matches the matching parameter. Matching supports the wildcard character % which matches any characters. For example, to search for Alice anywhere in the JSON-string representation of the record the matching parameter should be set to %Alice% or %alice% (matching is not case-sentitive). The result is a list of JSON-records or ttl-strings, depending on the selected format.
GET /<collection>/records/p/<class>: this endpoint (ending on .../p/<class>) provides the same functionality as the endpoint GET /<collection>/records/<class> (without .../p/...) but supports result pagination. In addition to the query parameters format and matching, it supports the query parameters page and size. The page-parameter defines the page number to retrieve, starting with 1. The size-parameter defines how many records should be returned per page. If no size-parameter is given, the default value of 50 is used. Each response will also contain the total number of records and the total number of pages in the result. The response is a JSON object with the following structure:

{
 "items": [ <JSON-record or ttl-string> ],
 "total": <total number of records in the result>,
 "page": <current page number>,
 "size": <number of records per page>,
 "pages": <number of pages in the result>
}

GET /<collection>/record?pid=<pid>: retrieve an object with the pid <pid> from the collection <collection>, if the provided token allows reading. If the provided token allows reading of incoming and curated spaces, objects from incoming spaces will take precedence. The endpoint supports the query parameter format, which determines the format of the query result. It can be set to json (the default) or to ttl,
GET /server: this endpoint provides information about the server. The response is a JSON object with the following structure:

{
  "version": "<version of the server>"
}

GET /<collection>/records/: retrieve all readable objects from collection <collection>. Objects are readable if the default token for the collection allows reading of objects or if a token is provided that allows reading of objects in the collection. Objects from incoming spaces will take precedence over objects from curated spaces, i.e. if there are two objects with identical pid in the curated space and in the incoming space, the object from the incoming space will be returned. The endpoint supports the query parameter format, which determines the format of the query result. It can be set to json (the default) or to ttl, The endpoint supports the query parameter matching, which is interpreted by sqlite-backends and ignored by record_dir-backends. If given, the endpoint will only return records for which the JSON-string representation matches the matching parameter. The result is a list of JSON-records or ttl-strings, depending on the selected format.
GET /<collection>/records/p/: this endpoint (ending on .../p/) provides the same functionality as the endpoint GET /<collection>/records/ (without .../p/) but supports result pagination. In addition to the query parameters format and matching, it supports the query parameters page and size. The page-parameter defines the page number to retrieve, starting with 1. The size-parameter defines how many records should be returned per page. If no size-parameter is given, the default value of 50 is used. Each response will also contain the total number of records and the total number of pages in the result. The response is a JSON object with the following structure:

{
 "items": [ <JSON-record or ttl-string> ],
 "total": <total number of records in the result>,
 "page": <current page number>,
 "size": <number of records per page>,
 "pages": <number of pages in the result>
}

DELETE /<collection>/record?pid=<pid>: delete an object with the pid <pid> from the incoming area of the collection <collection>, if the provided token allows writing to the incoming area. The result is either True if the object was deleted or False if the object did not exists or was not deleted.
GET /docs: provides information about the API of the service, i.e. about all endpoints.

Curation endpoints

The service support a set of curation-endpoints that give direct access to the curated area as well as to existing incoming areas. This access requires a CURATOR-token. Details about the curation-endpoints can be found in this issue.

Tips & Tricks

Using the same backend for incoming and curated areas

The service can be configured in such a way that incoming records are immediately available in the curated area. To achieve this, the final path of the incoming zone must be the same as the curated area, for example:

type: collections
version: 1

collections:
  datamgt:
    default_token: anon_read
    curated: datamgt/curated
    incoming: datamgt

tokens:
  anon_read:
    user_id: anonymous
    collections:
      datamgt:
        mode: READ_CURATED
        incoming_label: ""

  trusted-submitter-token:
    user_id: trusted_submitter
    collections:
      datamgt:
        mode: WRITE_COLLECTION
        incoming_label: "curated"

In this example the curated area is datamgt/curated and the incoming area for the token trusted-submitter-token is datamgt plus the incoming zone curated, i.e., datamgt/curated which is exactly the curated area defined for collection_1.

Migrating from `record_dir` (or `record_dir+stl`) to `sqlite`

The command dump-things-copy-store can be used to copy a collection from a record_dir (or record_dir+stl) store to a sqlite store. The command expects a source and a destination store. Both are given in the format <backend>:<directory-path>, where <backend> is one of record_dir, record_dir+stl, sqlite, or sqlite+stl, and <path> is the path to the directory of the store.

For example, to migrate a collection from a record_dir-backend at the directory <path-to-data>/penguis/curated to a sqlite backend in the same directory, the following command can be used:

> dump-things-copy-store \
    record_dir:<path-to-data>/penguis/curated  \
    sqlite:<path-to-data>/penguis/curated

For example, to migrate from a record_dir+stl backend, the command is similar, but a schema has to be supplied via the -s/--schema command line parameter. For example:

> dump-things-copy-store \
    --schema https://concepts.inm7.de/s/flat-data/unreleased.yaml \
    record_dir+stl:<path-to-data>/penguis/curated  \
    sqlite:<path-to-data>/penguis/curated

(Note: a record_dir:<path> can be used to copy without the schema type layer from a record_dir+stl backend. But in this case the copied records will not have a schema_type attribute, because the record_dir backend does not "put it back in", unlike a record_dir+stl backend.)

If the source backend is a record_dir or record_dir+stl backend and the store was manually modified outside the service (for example, by adding or removing files), it is recommended to run the command dump-things-rebuild-index on the source store before copying. This ensures that the index is up to date and all records are copied.

If any backend is a record_dir+stl backend, a schema has to be supplied via the -s/--schema command line parameter. The schema is used to determine the schema_type attribute of the records that are copied.

Maintenance commands

dump-things-rebuild-index: this command rebuilds the persistent index of a record_dirstore. This should be done after the record_dir store was modified outside the service, for example, by manually adding or removing files in the directory structure of the store.
dump-things-copy-store: this command copies a collection that is stored in a source store to a destination store. For example, to copy a collection from a record_dir store at the directory <path-to-data>/penguis/curated to a sqlite store in the same directory, the following command can be used:
```
> dump-things-copy-store \
    record_dir:<path-to-data>/penguis/curated  \
    sqlite:<path-to-data>/penguis/curated
```
The copy command will add the copied records to any existing record in the destination store. Note: when records are copied from a record-dir store, the index is used to locate the records in the source store. If the index is not up-to-date, the copied records might not be complete. In this case, it is recommended to run dump-things-rebuild-index on the source store before copying.
dump-things-pid-check: this command checks the pids in all collections of a store to verify that they can be resolved (if they are in CURIE form). This is useful to validate the proper definition of prefixes after schema-changes.
dump-things-create-merged-schema: this command creates a new schema that statically contains all schemas that the original schema imported. The new schema is fully self contained and does not reference any other schemas anymore.

If things go wrong

Delete a record manually

If a schema was changed, for example a prefix-definition changed, the service might not be able anymore to delete a record. In this case the record can be deleted manually if you have access to the storage root.

To delete the record, open a shell and navigate (cd) to the directory where the store is located. The location can be determined from the configuration file. Depending on the storage backend, the next steps are different.

`record-dir` backend

Delete the record from disk by removing it, e.g. rm -f <path-to-record>

Run the command dump-things-rebuild-index

`sqlite` backend

Run the command:

> sqlite3 __sqlite-records.db

If you know the pid of the record you want to delete, enter the following on the prompt to delete the record with pid some-pid:

> delete from thing where json_extract(thing.object, '$.pid') = 'some-pid';

If you know the IRI of the record you want to delete, enter the following on the prompt to delete the record with IRI some-iri:

> delete from thing where iri = 'some-iri';

Requirements

The service requires sqlite3.

Acknowledgements

This work was funded, in part, by

Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grant TRR 379 (546006540, Q02 project)
MKW-NRW: Ministerium für Kultur und Wissenschaft des Landes Nordrhein-Westfalen under the Kooperationsplattformen 2022 program, grant number: KP22-106A

Name		Name	Last commit message	Last commit date
Latest commit History 637 Commits
.github/workflows		.github/workflows
dump_things_service		dump_things_service
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dump Things Service

Installing the service

Running the service

Configuration file

Backends

Authentication and authorization

Config-based authentication

Forgejo-based authentication

Submission annotation tag

Command line parameters:

Endpoints

Curation endpoints

Tips & Tricks

Using the same backend for incoming and curated areas

Migrating from `record_dir` (or `record_dir+stl`) to `sqlite`

Maintenance commands

If things go wrong

Delete a record manually

`record-dir` backend

`sqlite` backend

Requirements

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

christian-monch/dump-things-server

Folders and files

Latest commit

History

Repository files navigation

Dump Things Service

Installing the service

Running the service

Configuration file

Backends

Authentication and authorization

Config-based authentication

Forgejo-based authentication

Submission annotation tag

Command line parameters:

Endpoints

Curation endpoints

Tips & Tricks

Using the same backend for incoming and curated areas

Migrating from record_dir (or record_dir+stl) to sqlite

Maintenance commands

If things go wrong

Delete a record manually

record-dir backend

sqlite backend

Requirements

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Migrating from `record_dir` (or `record_dir+stl`) to `sqlite`

`record-dir` backend

`sqlite` backend

Packages