- Have you ever been awed by the vastness of universe, or the formation of stars, pillars of creation, black holes and exoplanets outside Earth, like Earth?!
- Envisioned with the aim of contributing to open science discovery, bringing more clarification and data accessibility within the scientific ecosystem along with efficient outputs of trained AI and ML models on nuances of exoplanets related datasets for quicker solutions, this Project is a step in fulfilling that vision.
- AstroFIL demonstrates an end-to-end workflow for building a ML model to classify exoplanets using data from the NASA Exoplanet Archive. It integrates decentralized storage via Lighthouse (a Filecoin-based storage solution) to store datasets, trained models, and metadata. The project showcases data retrieval, preprocessing, model training, decentralized storage, and inference, all in a streamlined pipeline.
- Fosters decentralized security, real-time adaptability, efficiency and scientific collaboration.
Preview: Youtube Link
494de5ba0e214daf99716afbb5b0f5d2.mp4
- Fetch Realtime Scientific Papers: Queries latest arXiv astro-ph abstracts and extracts scientific keywords using NER (BERT).
- Generate Dynamic Dataset: Retrieves exoplanet data from NASA, synthesizes negative samples for classification, and labels accordingly.
- Train ML Model: Uses a Random Forest classifier to learn exoplanet classification based on four physical features.
- Store on Lighthouse/Filecoin: Uploads dataset, model (
.joblib
), and metadata (.json
) to Lighthouse Storage and returns IPFS CIDs. - Inference + Decentralized Retrieval: Model is reloaded from CID, and predictions are made on test data.
- 🔭 Data Source: NASA Exoplanet Archive API.
- 🧠 ML Model: Random Forest Classifier (scikit-learn).
- 📦 Decentralized Storage: Lighthouse + Filecoin/IPFS.
- 🧪 Inference Ready: Demonstrates real-time sample classification.
- 🔐 Robust Handling: Upload, download, and failure-safe CID operations.
- ♻️ Temp Management: Efficient tempfile cleanup.
- 📰 NER on ArXiv Abstracts: Keyword extraction from latest papers.
-
Before running the project, ensure you have:
-
Python 3.8+ installed.
-
A Lighthouse API key (sign up at Lighthouse Storage).
-
Fork and Clone the Repository:
git clone <repository-url> cd astro_fil
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the required dependencies
-
in
requirements.txt
:pandas numpy scikit-learn joblib requests lighthouseweb3 python-dotenv feedparser transformers torch torchvision torchaudio
pip install -r requirements.txt
- Set Up Lighthouse API Key:
- Obtain an API key from Lighthouse Storage.
- Update the API_KEY variable in the script:API_KEY = "your-lighthouse-api-key"
- Run the Script:
python -m astrofil
-
The project consists of the following key functions:
create_sample_dataset()
- Downloads a subset of exoplanet data from NASA and labels it as "confirmed."upload_to_lighthouse()
- Uploads a file to Lighthouse Storage and returns its CID.download_from_lighthouse()
- Downloads a file from Lighthouse using its CID.fetch_arxiv_astro_papers()
– Get abstracts from arXiv (astro-ph)extract_keywords()
– Use BERT NER to extract topic keywordstrain_model()
- Trains a Random Forest Classifier on the dataset and evaluates accuracy.main()
- Orchestrates the workflow: dataset creation, training, storage, and inference.
pandas
,numpy
,scikit-learn
,joblib
– ML pipelinerequests
,feedparser
– Data fetchinglighthouseweb3
– IPFS/Filecoin Storagetransformers
,pipeline
– Keyword extraction (NER)
- ArXiv paper titles + abstracts ⟶ Keywords
- Keywords drive context, tracked with dataset & metadata
- NASA exoplanet dataset ⟶ Classifier ⟶ Decentralized upload
- Run inference on downloaded model + test data
- 🌐 Multi-label: Classify gas giants, terrestrials, and neutron stars.
- 🌌 Expand features: Add stellar eccentricity, distance, and magnitude.
- 🔄 Label diversity: Add real-world unconfirmed objects.
- 🧪 AutoML: Try XGBoost or GridSearch tuning.
- 🕸️ IPFS-based UI: Build browser-based querying via CID.
Built with curiosity and cosmos in mind. Explore decentralized space research with AstroFIL 🌠😊😍