-
Notifications
You must be signed in to change notification settings - Fork 34
📖 Update data request documentation, #1038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
the-bay-kay
wants to merge
3
commits into
e-mission:master
Choose a base branch
from
the-bay-kay:update-data-request
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 2 commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,63 +1,14 @@ | ||
# Requesting Data as a Collaborator | ||
# Requesting & Using Data as a Collaborator | ||
--- | ||
|
||
The consent document for e-mission (https://e-mission.eecs.berkeley.edu/consent) allows the platform owner (@shankari in this case) to share **de-linked** raw data with other collaborators for research. | ||
The **Transportation Secure Data Center (TSDC)** hosts data collected by OpenPATH during a variety of surveys. This data can be used to replicate previous study findings, generate new visualizations, or simply to explore the platform's capabilites. To request data from a specific program, please visit the TSDC [website](https://www.nrel.gov/transportation/secure-transportation-data/index.html). | ||
|
||
> Time-delayed subsets of individual trajectory data, associated with their UUIDs but not email addresses, may by shared with collaborators, or released as research datasets to the community from time to time. If this is done, the time delay for sharing with collaborators will be at least one month, and the time delay for releasing to the community will be at least one year. Both collaborators and researchers will be asked to agree that they will publish only aggregate, non personally identifiable results, and will not re-share the data with others. | ||
|
||
It also allows other researchers to use it to conduct studies. In this case, all data, including the **link** between the email address and the UUID will be made available to the researcher. | ||
|
||
> If this platform is being used to collect data for a study conducted by another researcher, for example, from a Transportation Engineering Department, then you will be asked to assent to a separate document outlining the data association, retention and sharing policies for that study, **in addition to the policies above**. We will make all data, including the mapping between the email address and the UUID, directly available to the lead researcher for the main study. This will allow them to associate the automatically gathered information with demographic data, and any pre and post surveys that they conduct as part of their study. The other researcher may also choose to compensate you for your time, as described in the protocol document for that study. | ||
|
||
This document provides the procedure to request access to such kinds of data. Most of the procedure is common; differences between them are labelled **linked** and **de-linked**. | ||
|
||
## Setup GPG ## | ||
|
||
We will send and receive data encrypted/signed using GPG. | ||
1. The steps for creating a GPG keypair are at https://www.gnupg.org/gph/en/manual/c14.html. | ||
1. Create a keypair and export it. | ||
1. Send me (@shankari, [email protected]) the public key via email. | ||
|
||
## Data request ## | ||
|
||
### De-linked ### | ||
Next, you need to formally request access by filling out a pdf form. | ||
|
||
1. I will send you an encrypted version of the form you need to fill out and a copy of *my* public key. | ||
1. Decrypt it using https://www.gnupg.org/gph/en/manual/x110.html. | ||
1. Fill it out and sign it physically. | ||
1. Also sign it electronically https://www.gnupg.org/gph/en/manual/x135.html | ||
1. Encrypt it using my public key https://www.gnupg.org/gph/en/manual/x110.html and send it to me | ||
|
||
If all of this works, we know that we have bi-directional encrypted communication over email. Make sure to encrypt any privacy sensitive information (e.g. subsets of data for debugging) that you send to me in the future. | ||
|
||
### Linked ### | ||
You need to send me a copy of your IRB approval and your consent document to ensure that you have permission to collect data. | ||
|
||
## Data retrieval ## | ||
|
||
### De-linked ### | ||
1. As you can see from the consent document, you can get access to data that is time-delayed by 1 months. | ||
1. I will upload an encrypted zip file with ~ 3 months of data to google drive and send you a link. | ||
|
||
Note that this data is very privacy-sensitive, so think through the answers carefully on the request form carefully and make sure that you follow them. Treat the data as you would like your data to be treated. | ||
|
||
### Linked ### | ||
1. I will upload an encrypted zip file with all your data to google drive and send you a link. | ||
|
||
|
||
### Both ### | ||
1. You need to decrypt it just like you decrypted the pdf form https://www.gnupg.org/gph/en/manual/x110.html. | ||
1. When unzipped, the data consists of multiple json files, one per user. | ||
1. The data will typically contain both raw sensed data (e.g. `background/location`) and processed data (e.g. `analysis/cleaned_trip`) | ||
1. Data formats for the json objects are at `emission/core/wrapper` (e.g. `emission/core/wrapper/location.py` and `emission/core/wrapper/cleanedtrip.py`) | ||
|
||
## Data analysis ## | ||
## Data Analysis - Server ## | ||
|
||
While it is possible to analyse the raw data, it is large, so you may want to load it into a database to work with. That will also allow you to write code that is compatible with the server, so that we can more easily incorporate your analysis into the standard e-mission server. | ||
|
||
### Install the server ### | ||
Follow the README and install e-mission server locally on your own laptop. | ||
Follow the [README](https://github.com/e-mission/e-mission-server) and install e-mission server locally on your own laptop. | ||
|
||
### Load the data ### | ||
Load the data into your local database. Since this data contains information from mutiple users, and you presumably want to retain the uuids, to correlate with other surveys that you might have performed, you should use the `load_multi_timeline_for_range.py` script. Since there are multiple files, the timeline will typically be a directory, and you should pass in the prefix. For example, if the user files are `all_users_sep_dec_2016/dump_0109c47b-e640-411e-8d19-e481c52d7130`, `all_users_sep_dec_2016/dump_026f8d13-4d7a-4f8f-8d35-0ec22b0f8f8b, ...,` you should run the following command line. | ||
|
@@ -95,16 +46,67 @@ You can also remove the data by using `bin/purge_database_json.py`, which will d | |
./e-mission-py.bash bin/debug/purge_multi_timeline_for_range.py all_users_sep_dec_2016 | ||
``` | ||
|
||
### Play with the data ### | ||
|
||
### Play with the Data ### | ||
An example ipython notebook that shows data access parameters is at | ||
https://github.com/e-mission/e-mission-server/blob/master/Timeseries_Sample.ipynb | ||
|
||
It has examples on how to access raw data, processed data, and plot points. | ||
Please use the timeseries interfaces as opposed to direct mongodb queries wherever possible. | ||
That will make it easier to migrate to other, more scalable timeseries later. | ||
|
||
Again, data formats are at | ||
https://github.com/e-mission/e-mission-server/tree/master/emission/core/wrapper | ||
|
||
Let me (@shankari) know if you have any further questions... | ||
## Alternative Analysis Methods ## | ||
|
||
There are a few ways to explore the data beyond the server. Generally, these methods require a "mongodump" file -- a collection of data, archived in `.tar.gz` format. Here are the broad steps you need to take in order to work with this data: | ||
|
||
the-bay-kay marked this conversation as resolved.
Show resolved
Hide resolved
|
||
1. **Start Docker**: Ensure you have docker installed on your machine, and a `docker-compose.yml` file saved to your chosen repository. The following command should start the development environment: | ||
```bash | ||
$ docker-compose -f [example-docker-compose].yml up | ||
``` | ||
Example docker config files can be found in the server repository [here](https://github.com/e-mission/e-mission-server/blob/d2f38bc18d5c415888451e7ad98d40325a74c999/emission/integrationTests/docker-compose.yml#L4). The general construction of a compose file is as follows: | ||
|
||
```yml | ||
version: "3" | ||
services: | ||
db: | ||
image: mongo:4.4.0 | ||
volumes: | ||
- mongo-data:/data/db | ||
networks: | ||
- emission | ||
ports: | ||
- "27017:27017" # May change depending on repo | ||
|
||
networks: | ||
emission: | ||
|
||
volumes: | ||
mongo-data: | ||
``` | ||
2. **Load your data**: There are a few ways to go about this: | ||
- Certain repositories will have a `load_mongodump.sh` script. Given the correct docker was started in the previous step, this should load all of the data for you. | ||
- Depending on the data being analyzed, loading the entire mongodump may take a _very_ long time. Ensure that docker's resources are properly increased, and ample time is set aside for the loading process. | ||
- If a portion of data is needed, the mongodump unzipped, and its individual components loaded into the docker. | ||
- First, unpack your mongo dump file by running `tar -xvf [your_mongo_dump.tar.gz]` | ||
- Navigate to the unzipped folder. Create a new directory, `./dump/Stage_database/`. Copy your data files into this new directory. | ||
- Copy the new `./dump/Stage_database` directory into your Docker's `/tmp/` directory. This can be done by dragging and dropping the directory via the Docker Desktop client, or done via the command line. | ||
- Using the following commands, connect to your docker image, | ||
```bash | ||
$ docker exec -it [your_docker_image_name] /bin/bash | ||
root@12345:/ cd tmp; mongorestore | ||
``` | ||
- More information on this approach can be found in the public dashboard [ReadMe](https://github.com/e-mission/em-public-dashboard/blob/main/README.md#large-dataset-workaround). | ||
|
||
|
||
In general, it is best to follow the instructions of the repository you are working with. There are subtle differences between them, and these instructions are intended as general guidance only. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should unify these but obviously we should keep this documentation until we do. |
||
|
||
### Public Dashboard ### | ||
This repository has several ipython notebooks that may be used to visualize raw data. For detailed instructions on working with the dashboard, please consult the repository's [ReadMe](https://github.com/e-mission/em-public-dashboard/blob/main/README.md). | ||
|
||
### Private Eval ### | ||
Like the public dashboard, this repository contains several notebooks that may be used to process raw data. Rather than focusing on visualization, these notebooks are designed to evaluated the efficacy of OpenPATH, test new algorithms, and provide some additional visualizations. Further details, including how to load data into this repository, may be found in the repository's [ReadMe](https://github.com/e-mission/e-mission-eval-private-data/blob/master/README.md) | ||
|
||
## Final Notes ## | ||
|
||
For more information on how data is formatted, feel free to explore the [emission/core/wrapper/](https://github.com/e-mission/e-mission-server/tree/master/emission/core/wrapper) portion of the server repository. | ||
|
||
Please contact @shankari if you have any further questions! |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually the deprecated method now. We sometimes internally use the user specific dumps to reproduce errors, but for external users, they either get the mongodump, or download csv files from their admin dashboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a change to specify this is for internal testing only! Let me know if I should be more specific about it being a deprecated method.
Should I add a footnote about working with CSV's? I've only worked with the
mongodump
format, but could ask around for helping writing a section on that process.