-
Notifications
You must be signed in to change notification settings - Fork 1
Solr API
LabCAS uses the Solr search engine in order to store, search, and retrieve metadata for the science data in the EDRN Cancer Biomarker Commons. LabCAS also provides an API that lets authenticated EDRN users run searches on the Solr API. This document will help you get started with this API.
The intent of this documentation is not to replace Solr's documentation. You are encouraged to read the Solr Common Query Parameters documentation to learn how to construct queries for Solr. Some example queries will be given within this document. Please note LabCAS uses Solr version 6.6.
This document uses Postman to make queries to the LabCAS Solr API, as it takes care of formulating URLs, quoting parameters, and so forth. Postman can also generate code for Python, C#, Java, etc., as well as for the curl
command, so it serves as a nice umbrella technology.
If you prefer to write code instead of using Postman, you can craft queries for LabCAS Solr API. Two example programs are available (in Python) that demonstrate this capability:
- cibbbcd_events.py — this program extracts event IDs from the Solr "datasets" core for the LabCAS collection "Combined Imaging and Blood Biomarkers for Breast Cancer Diagnosis"
- events_by_blind.py — this program displays event IDs from the Solr "files" core given a blinded site ID as a parameter
You can read over the source code for these, or install the example programs onto your system for direct execution; see the README titled "Data Access API: Examples" and the source code for more information.
The remainder of this document will show how to use the Solr API directly using Postman.
First, download and install Postman. Postman is free software. There is also a web version, but for this document we'll use the desktop version.
Launch Postman for the first time, and in the lower-right, in the bottom status bar, click "⌂ Vault". The "vault" is where we'll store your EDRN username and password.
The first time you do this, you'll be prompted to encrypt your vault. That's a good security measure, so go ahead and click the Encrypt button. You'll get a "vault key" which you can use to unlock your vault in the future. You can save this key (a long hexadecimal number) in a safe place such as your password manager. Finally, press "Open Vault".
In the table of vault secrets, click "Add new secret" and name it edrn_username
. For the value, put in your EDRN username. Under "Allowed Domains", enter https://edrn-labcas.jpl.nasa.gov
.
Repeat this, but for the next secret call it edrn_password
. For the value, put in your EDRN password. Use the same "Allowed Domain".
Finally, close the vault by clicking the ⤫ in the tab bar at the top.
We have created a Postman Collection that describes the LabCAS Solr API. With this, you won't have to worry about setting the URL, authorization, or query parameters.
Download the Postman Collection for the LabCAS Solr API.
Once downloaded, import it into your Postman from the "File → Import" menu.
Once you've got the Postman Collection imported, you should have a new item in your Postman Workspace, "LabCAS Solr API". You can expand the collection and see the three endpoints:
- Collections — describes the high-level science data collections in LabCAS
- Datasets — organizes the data in collections into groupings, typically associated with parts of a study (case versus control) or participants, or by other logical separation. Datasets can contain either other datasets (forming a hierarchy) or files in LabCAS
- Files — represents the metadata for individual files of scientific data, such as DICOM files. This core lets you retrieve the metadata for files. Note that download actual files is a separate API, not described here
To use these endpoints:
- Select Collections, Datasets, or Files
- Click the "Params" tab if it's not already visible
- Enter a Solr query in the
q
parameter; fill in other parameters as needed - Press "Send"
As a test, try this:
- Select "Collections"
- In the "Params" tab type "biomarker" into the
q
(meaning "show all collections with the wordbiomarker
in them") - Leave all other parameters at their defaults
- Press "Send"
In the lower half of the screen, make sure the "Pretty" and "JSON" formats are selected. You should see around 16 collections that match. Feel free to try the other options, AI formatting features, code conversions, etc.
The following are a few queries you can try:
- "Return
eventID
for files withCollectionName
ofLung Team Project 2 Images
in JSON format"- Use the "Files" endpoint
- Set
q
toCollectionName:"Lung Team Project 2 Images"
— note the quotes since the name has spaces - Set
fl
toeventID
- Set
wt
tojson
- Set
rows
to999999
— adjust this as needed, or userows
+start
to paginate
- "All details of collections with
SpecimenType
ofSerum
in XML format"- Use the "Collections" endpoint
- Set
q
toSpecimenType:Serum
- Set
rows
to99999
— adjust this as needed, or userows
+start
to paginate - Set
wt
toxml
- "Top 10
LeadPI
names andLeadPIId
IDs of all datasets withCollectionName
ofLung Team Project 2 Images
in JSON format"- Use the "Datasets" endpoint
- Set
q
toCollectionName:"Lung Team Project 2 Images"
- Set
fl
toLeadPI,LeadPIId
- Set
rows
to10
- Set
wt
tojson
- "The ID, data custodian, and data custodian email of the top 100 files with
City_of_Hope
in their IDs in CSV format"- Use the "Files" endpoint
- Set
q
toid:*City_of_Hope*
- Set
fl
toid,DataCustodian,DataCustodianEmail
- Set
rows
to100
- Set
wt
tocsv
Please note that the Postman Collection included above only includes a subset of the parameters that Solr supports. If you're confident in your programming skills, curl
command usage, etc., feel free to use advanced parameters like fq
, facet
, facet.field
, and so forth.
Consulting the Solr query documentation can be helpful in these cases, as well as custom Solr clients for your programming languages of choice.
ChatGPT or other large language models can be instrumental in helping to express the q
, fq
, etc. parameter syntax without the need of fully understanding Solr's query language.
As an example, this prompt presented to the ChatGPT "4o" model:
Write a curl command for Solr at https://https://edrn-labcas.jpl.nasa.gov/data-access-api/files/select that takes HTTP Basic username EDRNUSERNAME with password EDRNPASSWORD to return the non-empty "eventID" fields for the first 100 files where the "CollectionName" field is "Lung Team Project 2 Images"
produces a valid curl
command as of this writing.