Portal Analyitics

This repository is my attempt to perform some analysis on the merged raw data from the portals. My goal was to try and find some correlations between variables in the hope of creating some derived or automated annotations.

To Do

Develop decision tree model
Implement Hydra in project for experimentation
Explore mutal information between features
- Explore connection between file format and assay because assay might constrain what file formats are available
- Connectino between assay and study
- Connection between file name and resource type
- Between Assay and datatype
- resource type & datasubtype Cleaning
Replace acronyms with full spellings for more features

Setup

Rename the .env-template -> .env
Update the .env file with your information.

Data

snowflakeData.csv was taken from the SAGE.PORTAL_RAW.PORTAL_MERGE on October 31st, 2023.

Cleaning

The data is pulled from Snowflake and dated when it was pulled. Any columns that are completley empty are removed from the dataset. The data was cleaned using ./cleaning.ipynb. The major work was to clean up the names and differences in spellings for variables like assay. I used Levenshtien distance to determine which spellings were similar. This is ongoing work as I work through the EDA part.

I pulled terms from OLS4 as a cross reference for variables like assay types and file formats using ./ontologyScraper.py in the hope of matching our annotations with a known 3rd party ontology source.

I also derived some file formats by splitting by ".". This causes the zip files to be a larger portion of the data so further cleanup to remove zip extensions to look at the underlying file types would be a good next step.

EDA

*Proportions are only shown for values that have greater than 5% in the dataset.

File Formats

Most of the files are "BAM" or "gz".

fileFormat	count	proportion
BAM	33143	10.10
gz	133007	40.53

Study

Assay

Data Types

Data Subtypes

dataSubtype	count	proportion
raw	124908	67.072
processed	51846	27.84

Resource Types

My initial EDA shows that the file formats vary across the assay types which seems a little odd to me. I think splitting up "RAW" vs. "Processed/Analysis" types would help differentiate which files are used in different processed.

References

Snowflake Python SDK

Programmatic access: Link

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
imgs		imgs
model		model
notebooks		notebooks
scripts		scripts
templates		templates
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Portal Analyitics

To Do

Setup

Data

Cleaning

EDA

File Formats

Study

Assay

Data Types

Data Subtypes

Resource Types

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

eliteportal/portalAnalytics

Folders and files

Latest commit

History

Repository files navigation

Portal Analyitics

To Do

Setup

Data

Cleaning

EDA

File Formats

Study

Assay

Data Types

Data Subtypes

Resource Types

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages