Skip to content

Provide alternate formats of the data file #23

@digitaldogsbody

Description

@digitaldogsbody

In its current format, the data file is useful for loading in its entirety, or in subsets. However there are other potential use cases which are not served most efficiently by the provided format.

Whilst it is possible to extract and filter the data in parallel, including with comparative operations to other records (e.g to answer a complex query such as "find me DOIs that were published by Organisation X, and some creators do not have a name identifier, and then for each repository that any of these records exist in, show me a breakdown of what percentage of that repository's records match this criteria"), this process is time consuming and resource-intensive.

There have been user queries about providing the data file in formats more suited to these kind of analysis-based use cases, such as Apache Parquet, and other users have already taken steps such as loading the data file into services such as Google BigTable, precisely to deal with the problems involved in meeting non “data loading” use cases, so we should investigate whether it’s feasible and desirable for us to make this approach easier.

Metadata

Metadata

Labels

enhancementNew feature or requestproductRequires input from product before being donequestionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions