The Datastore gives us a way to store and retrieve immutable files and directories. It follows the Nexus model of publishing software, but for data.
The datastore stores items, which can be files or directories. Every item belongs to a group, and has a name and a version. The group is just a string, but per convention it should be of the form org.allenai.corpora. The name is also just a string, but it should be CamelCase, for example WebSentences. The version is just an integer.
When you request an item from the datastore, it will download the item from S3 and put it into the cache, which is a file or directory on the local file system. The path it returns is a path to that file or directory. If it's already there, it skips the download and simply returns the path.
Datastores have names. Currently, we have the public datastore, and the private one. public is world-accessible, while private is limited to AI2. This is not a feature of the datastore, just a result of the bucket configuration in S3. The default datastore is public.
The Datastore has a command line tool. Click to go to its documentation.
To get a file from the default datastore, simply call this:
// Get version 4 of GreedyParserModel.json in the
// group org.allenai.parsers.poly-parser
val path: java.nio.file.Path =
Datastore.filePath(
"org.allenai.parsers.poly-parser",
"GreedyParserModel.json",
4)
To get a directory, call this:
// Get version 1 of the WordNet directory in the
// group org.allenai.otter
val path: java.nio.file.Path =
Datastore.directoryPath(
"org.allenai.otter",
"WordNet",
1)
You can do anything with the resulting path except write to it.
To access a non-default datastore, for example the private one, call it like this:
val path: java.nio.file.Path =
Datastore("private").directoryPath(
"org.allenai.otter",
"WordNet",
1)
There is no way to automatically get the latest version from the datastore. This is by design. If you depend on the "latest" version of an item, your results are not reproducible, because someone might publish a new version and thus change what your code does.
There are two main ways to write to the datastore, one for files, and one for directories:
// publish BigModel.json under the name
// "GreedyParserModel.json", version 4
Datastore.publishFile(
"BigModel.json",
"org.allenai.parsers.poly-parser",
"GreedyParserModel.json",
4,
false)
// publish the wordnet directory under the
// name "WordNet", version 1, and do so privately
Datastore("private").publishDirectory(
"wordnet",
"org.allenai.otter",
"WordNet",
1,
false)
The datastore client needs to be authenticated with AWS. This happens using Amazon's default methods. In detail, this is what that means:
Create a file in ~/.aws/credentials, with the following content:
[default]
aws_access_key_id = <MYACCESSKEY>
aws_secret_access_key = <mysecretaccesskey>
Please replace <MYACCESSKEY> and <mysecretaccesskey> as appropriate. You can get these credentials in the AWS Console, under IAM/Users. Click on your username in the list of users, and then "Manage Access Keys" in the Security Credentials -> Access Credentials section. You should be able to add a key pair there.
There are access key files in the ops AWS keystore, located in /opt/ops/var/s3/ops-keystore/aws/datastore on production hosts. This includes a shell script you can source to pull in the environment variables needed. Simply add the following to your service script:
source /opt/ops/var/s3/ops-keystore/aws/datastore/credentials.sh
Since the datastore is just delegating authentication to the Amazon SDK, all the possibilities from the SDK work.
You can also go completely manual and create the datastore with a access key and secret key pair. To do this, create a datastore like this:
val datastore = Datastore("<myaccesskey>", "<mysecretkey>")
val privateDatastore = Datastore("private", "<myaccesskey>", "<mysecretkey>")
The cache location depends on the operating system. On the Mac, it's in ~/Library/Caches/org.allenai.datastore/. Everywhere else, it's in ~/.ai2/datastore. There are two ways to override this. In order of precedence:
- The
AI2_DATASTORE_DIRenvironment variable. - The
org.allenai.datastore.dirJava system property.
The datastore is completely thread-safe. Similarly, two processes (and by extension two threads as well) requesting the same item at the same time will not fall over each other, and will not download the same file twice.
To achieve this, it assumes that temporary files are created on the same file system where the cache lives. This is the case in virtuall all instances. However, if that is not the case, due to a change in cache location, or by virtue of a really quirky setup, it will no longer be safe.
When the requested file or directory is in the cache, the datastore calls return immediately. When they're not, the call blocks while it receives the file. This may cause problems with Akka, if Akka decides to kill a thread that doesn't return within a timeout. The workaround is to warm up the cache, and request all files you ever want to request before initializing the Actor system.
It's entirely possibly, encouraged even, to look at the bucket in S3. The bucket name is always <datastorename>.store.dev.allenai.org. The S3 keys are of the schema <group>/<name>-v<version> for files without extensions, <group>/<name>-v<version>.<ext> for files with extensions, and <group>/<name>-d<version>.zip for directories.