feat(models): integrate OLMoEarth embedding model#2
Open
go-bananas-wwj wants to merge 18 commits intoOpenGeoScope:mainfrom
Open
feat(models): integrate OLMoEarth embedding model#2go-bananas-wwj wants to merge 18 commits intoOpenGeoScope:mainfrom
go-bananas-wwj wants to merge 18 commits intoOpenGeoScope:mainfrom
Conversation
- OlmoEarthModel wrapper with MEAN pooling and L2-normalized cosine search - Band reordering for MajorTOM [B01..B12] -> OLMoEarth format - GeoTIFF path input support (single-file and directory-of-bands) - Register OLMoEarth in ModelManager and generate_embeddings.py MODEL_MAP - Add olmoearth-pretrain-minimal to requirements.txt
- OLMoEarth self-generated embeddings lack parquet_url/parquet_row columns - _fetch_top_k_images now falls back to any model with valid metadata - Map initialization and download_image_by_location iterate all models - Ensures Image Search and Location Search work even when the active model has no embedded download metadata
- INTER_NEAREST produces artifacts on 12-band Sentinel-2 data - INTER_CUBIC preserves spectral fidelity needed by OLMoEarth encoder
- embeddings_*/ (generated embedding directories) - *.pkl, *.npy (cache and temporary arrays) - olmoearth_*.png (test visualization outputs)
- Update modality tables: 4 -> 5 models across README, doc, CONTRIBUTING - Add OLMoEarth row with Sentinel-2 + derived maps training data - Add OLMoEarth embedding dataset link (WeijieWu/olmoearth_embdding) - Add torch>=2.8 compatibility note in Quick Start - Add OLMoEarth arXiv citation (Herzog et al., 2025)
…and fix warnings - Remove unused shapely.geometry.box import - Fix import ordering in generate_embeddings.py - Replace unused variable 'c' with '_' - Use iterable unpacking for columns list - Apply ruff format to match project style
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(models): integrate OLMoEarth embedding model
Overview
This PR adds OLMoEarth (Allen AI's Earth-system foundation model) as the 5th embedding model to EarthEmbeddingExplorer, alongside SigLIP, FarSLIP, SatCLIP, and DINOv2.
OLMoEarth is a multimodal, spatio-temporal foundation model trained on Sentinel-2 L2A and 6 derived geospatial maps (OpenStreetMap, WorldCover, SRTM DEM, etc.). It excels at capturing spectral and spatial patterns from 12-band multispectral imagery, making it a strong candidate for pure visual similarity search in remote-sensing applications.
Paper: Herzog et al., 2025 — OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation
What Changed
feat(models)models/olmoearth_model.py,core/model_manager.py,generate_embeddings.py,requirements.txtfix(search_engine, ui)core/search_engine.py,ui/callbacks.pydf_embedwhen active model lacksparquet_urlmetadatafix(majortom)MajorTOM/embedder/MajorTOM_Embedder.pyINTER_CUBICinstead ofINTER_NEARESTfor 12-band image qualitydocs(readme, doc, contributing)README.md,doc.md,CONTRIBUTING.mdchore(gitignore).gitignoreembeddings_*/,*.pkl,*.npy)style(...)core/search_engine.py,generate_embeddings.py,ui/callbacks.pyKey Technical Details
1. Band Reordering
MajorTOM stores bands as
[B01, B02, B03, B04, B05, B06, B07, B08, B8A, B09, B11, B12], but OLMoEarth expects[B02, B03, B04, B08, B05, B06, B07, B8A, B11, B12, B01, B09]. The wrapper handles this transparently in_prepare_input().2. Pooling Strategy
We use
PoolingType.MEAN(matching the officialallenai/olmoearth_ml4rs_tutorial) rather than MAX pooling. Raw embeddings have L2 norm ~10.3 and are not normalized during storage/encoding to stay consistent with the official tutorial.3. Retrieval Normalization
To avoid polar clustering (polar embeddings have systematically higher norms ~10.5 vs ~10.2), we apply
F.normalize()only insidesearch(), converting dot products to cosine similarity in[-1, 1].4. Metadata Gap Workaround
Our self-generated OLMoEarth embeddings lack
parquet_urlandparquet_rowcolumns. We added fallback logic so that_fetch_top_k_images(),get_initial_plot(), anddownload_image_by_location()can borrow metadata from any other loaded model. All 5 models share the same 248,719 samples in identical order (verified byproduct_id).Environment Compatibility
olmoearth-pretrain-minimalrequirestorch >= 2.8, < 2.9. For users with older PyTorch versions, we recommend a dedicated conda environment:Verification
ruff check .— all checks passedruff format . --check— all formattedpython -c "import app"— all 5 models load successfully (248,719 records each)Dataset
WeijieWu/olmoearth_embdding(824 MB, 248,719 rows, 768-dim float32)Major-TOM/Core-S2L2A-249k(same as other 4 models)Checklist
encode_image,search)core/model_manager.pyandgenerate_embeddings.pyrequirements.txtREADME.md,doc.md, andCONTRIBUTING.md/workspace/EEE/embeddings/)