-
Notifications
You must be signed in to change notification settings - Fork 0
Description
In its current format, the data file is useful for loading in its entirety, or in subsets. However there are other potential use cases which are not served most efficiently by the provided format.
Whilst it is possible to extract and filter the data in parallel, including with comparative operations to other records (e.g to answer a complex query such as "find me DOIs that were published by Organisation X, and some creators do not have a name identifier, and then for each repository that any of these records exist in, show me a breakdown of what percentage of that repository's records match this criteria"), this process is time consuming and resource-intensive.
There have been user queries about providing the data file in formats more suited to these kind of analysis-based use cases, such as Apache Parquet, and other users have already taken steps such as loading the data file into services such as Google BigTable, precisely to deal with the problems involved in meeting non “data loading” use cases, so we should investigate whether it’s feasible and desirable for us to make this approach easier.