Provide alternate formats of the data file

In its current format, the data file is useful for loading in its entirety, or in subsets. However there are other potential use cases which are not served most efficiently by the provided format.

Whilst it is possible to extract and filter the data in parallel, including with comparative operations to other records (e.g to answer a complex query such as "find me DOIs that were published by Organisation X, and some creators do not have a name identifier, and then for each repository that any of these records exist in, show me a breakdown of what percentage of that repository's records match this criteria"), this process is time consuming and resource-intensive.

There have been user queries about providing the data file in formats more suited to these kind of analysis-based use cases, such as [Apache Parquet](https://parquet.apache.org/), and other users have already taken steps such as loading the data file into services such as [Google BigTable](https://cloud.google.com/bigtable), precisely to deal with the problems involved in meeting non “data loading” use cases, so we should investigate whether it’s feasible and desirable for us to make this approach easier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide alternate formats of the data file #23

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Provide alternate formats of the data file #23

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions