The codebase created to support the application of artificial intelligence for predicting the subcellular localization of solute carrier transporter (SLC) proteins. We developed an iterative method that harmonizes human annotations with AI-based model outputs. A robust, modular pipeline for end-to-end SLC (solute carrier) image analysis, including data download, embedding generation, model training, and compartment-specific reporting.
- Overview
- Project Structure
- Prerequisites & Environment Setup
- Data Download
- Running the Pipeline
- Outputs & Results
- Troubleshooting & Tips
- Contact
This project provides a complete workflow to:
- Download and validate large-scale imaging data
- Generate image embeddings using a pre-trained model
- Train and evaluate models for SLC compartment classification
- Produce detailed reports and summary statistics
The pipeline is modular, robust to interruptions, and easy to resume.
├── data/ # Raw and processed data, including images and results
│ ├── file_download.sh # Robust shell script for downloading images
│ └── ...
├── src/ # Source code
│ ├── data/ # Data processing and embedding generation
│ ├── models/ # Model definitions
│ └── training/ # Training and evaluation scripts
├── main.py # Main entry point for the pipeline
├── pyproject.toml # Python dependencies
├── README.md # This file
- Python 3.10+
- Recommended: Linux/macOS with bash/zsh shell
- uv (fast Python package manager)
Install dependencies using uv:
# Install uv if not already installed
pip install uv
# Create and activate a virtual environment
uv venv .venv
source .venv/bin/activate
# Install all dependencies from pyproject.toml
uv pip install .- Prepare the file list: Place your TSV file (e.g.,
filelist_sample_HATAG.tsv) in thedata/directory. - Run the download script:
cd data bash file_download.sh 0 1000 # Download first 1000 files (adjust as needed)
- The script is robust: it skips existing files, retries failed downloads, and validates images.
- To resume or download a different range, adjust the start/end row arguments.
- For large downloads, use
screenortmuxto avoid interruption.
- Download annotated data:
- Downloaded the data directly from the Resolute website: https://dataresolute.blob.core.windows.net/public/annotation/SLC_localization.xlsx
-
Activate your environment:
source .venv/bin/activate -
Run the main analysis:
python main.py
- This will:
- Generate image embeddings
- Save
embeddings.csvandfile_list.csv - Run compartment analysis and save results in
data/compartment_results/
- This will:
-
Customizing analysis:
- Edit
main.pyto change compartments, output directories, or embedding paths as needed.
- Edit
embeddings.csv: Image embeddings for all processed imagesfile_list.csv: List of image file pathsdata/compartment_results/: Contains per-compartment reports, classification metrics, and summary tables
- Resuming downloads: The shell script skips files that already exist and only counts valid images.
- Session persistence: For long downloads, use
screenortmuxto avoid losing progress if your terminal disconnects. - Missing dependencies: Ensure all packages in
pyproject.tomlare installed. Usepip install -r requirements.txtif needed. - Custom data: Update paths in
main.pyandsrc/data/create_embeddings.pyto match your data locations.
For questions or support, please open an issue or contact the project maintainer.
Project organized according to the cookiecutter machine learning template.