This repository is my attempt to perform some analysis on the merged raw data from the portals. My goal was to try and find some correlations between variables in the hope of creating some derived or automated annotations.
- Develop decision tree model
- Implement Hydra in project for experimentation
- Explore mutal information between features
- Explore connection between file format and assay because assay might constrain what file formats are available
- Connectino between assay and study
- Connection between file name and resource type
- Between Assay and datatype
- resource type & datasubtype Cleaning
- Replace acronyms with full spellings for more features
- Rename the
.env-template->.env - Update the
.envfile with your information.
snowflakeData.csv was taken from the SAGE.PORTAL_RAW.PORTAL_MERGE on October 31st, 2023.
The data is pulled from Snowflake and dated when it was pulled. Any columns that are completley empty are removed from the dataset. The data was cleaned using ./cleaning.ipynb. The major work was to clean up the names and differences in spellings for variables like assay. I used Levenshtien distance to determine which spellings were similar. This is ongoing work as I work through the EDA part.
I pulled terms from OLS4 as a cross reference for variables like assay types and file formats using ./ontologyScraper.py in the hope of matching our annotations with a known 3rd party ontology source.
I also derived some file formats by splitting by ".". This causes the zip files to be a larger portion of the data so further cleanup to remove zip extensions to look at the underlying file types would be a good next step.
*Proportions are only shown for values that have greater than 5% in the dataset.
Most of the files are "BAM" or "gz".
| fileFormat | count | proportion |
|---|---|---|
| BAM | 33143 | 10.10 |
| gz | 133007 | 40.53 |
| dataSubtype | count | proportion |
|---|---|---|
| raw | 124908 | 67.072 |
| processed | 51846 | 27.84 |
My initial EDA shows that the file formats vary across the assay types which seems a little odd to me. I think splitting up "RAW" vs. "Processed/Analysis" types would help differentiate which files are used in different processed.

- Programmatic access: Link




