|
| 1 | +# GitHub Copilot Instructions for GeoTessera |
| 2 | + |
| 3 | +## Project Overview |
| 4 | + |
| 5 | +GeoTessera is a Python library for accessing and working with Tessera geospatial foundation model embeddings. The library provides tools to download, process, and visualize satellite imagery embeddings from Sentinel-1 and Sentinel-2 data at 10m resolution. |
| 6 | + |
| 7 | +### Core Architecture |
| 8 | + |
| 9 | +- **Two-step workflow**: Retrieve embeddings (numpy arrays) → Export to desired format (GeoTIFF/zarr) |
| 10 | +- **Registry system**: Parquet-based metadata registry for efficient tile lookup |
| 11 | +- **0.1-degree grid**: Tiles cover ~11km × 11km, named by center coordinates |
| 12 | +- **Direct HTTP downloads**: On-demand tile fetching with automatic cleanup |
| 13 | +- **Hash verification**: SHA256 checksums ensure data integrity by default |
| 14 | + |
| 15 | +## Technology Stack |
| 16 | + |
| 17 | +### Core Dependencies |
| 18 | + |
| 19 | +- **Python**: 3.11, 3.12, or 3.13 required (3.11+ in general) |
| 20 | +- **CLI Framework**: `typer` with `rich` for interactive output |
| 21 | +- **Geospatial**: `rasterio`, `geopandas`, `rioxarray` for GIS operations |
| 22 | +- **Data Processing**: `numpy`, `pandas`, `pyarrow` (for Parquet registry) |
| 23 | +- **Visualization**: `matplotlib`, `scikit-learn` (PCA), `scikit-image` |
| 24 | +- **Storage**: `zarr`, `xarray`, `dask` for chunked data handling |
| 25 | + |
| 26 | +### Build System |
| 27 | + |
| 28 | +- **Package Manager**: Uses `uv` for dependency management (preferred) or `pip` |
| 29 | +- **Configuration**: `pyproject.toml` with setuptools backend |
| 30 | +- **Lock File**: `uv.lock` for reproducible builds |
| 31 | +- **Test Framework**: `cram` (shell-based functional testing) |
| 32 | +- **Linting**: `ruff` for code quality |
| 33 | + |
| 34 | +## Coding Standards |
| 35 | + |
| 36 | +### Python Style |
| 37 | + |
| 38 | +- Follow PEP 8 conventions |
| 39 | +- Use type hints for function signatures (e.g., `Optional[str]`, `Path`, etc.) |
| 40 | +- Use `typing_extensions.Annotated` for CLI argument annotations with `typer` |
| 41 | +- Prefer pathlib `Path` over string paths |
| 42 | +- Use f-strings for string formatting |
| 43 | + |
| 44 | +### Code Organization |
| 45 | + |
| 46 | +``` |
| 47 | +geotessera/ |
| 48 | +├── __init__.py # Package exports and version |
| 49 | +├── core.py # Main GeoTessera class |
| 50 | +├── registry.py # Parquet registry management |
| 51 | +├── cli.py # Main CLI commands |
| 52 | +├── registry_cli.py # Registry-specific CLI |
| 53 | +├── tiles.py # Tile operations |
| 54 | +├── visualization.py # Visualization functions |
| 55 | +├── web.py # Web map generation |
| 56 | +├── country.py # Country bounding box utilities |
| 57 | +└── progress.py # Progress tracking utilities |
| 58 | +``` |
| 59 | + |
| 60 | +### Key Patterns |
| 61 | + |
| 62 | +1. **Rich Console Output**: Use `rich.console.Console` and `rich.progress.Progress` for user-facing output |
| 63 | +2. **Logging**: Configure with `rich.logging.RichHandler` for pretty logs |
| 64 | +3. **Temporary Files**: Use `tempfile` for intermediate data, clean up automatically |
| 65 | +4. **Error Handling**: Provide clear error messages with context |
| 66 | +5. **CLI Structure**: Commands are typer apps with descriptive help text |
| 67 | + |
| 68 | +## Testing Guidelines |
| 69 | + |
| 70 | +### Test Framework: Cram |
| 71 | + |
| 72 | +Tests are written in `.t` files using cram (shell-based testing): |
| 73 | + |
| 74 | +```bash |
| 75 | +# Example test structure |
| 76 | +Setup environment: |
| 77 | + $ export TERM=dumb |
| 78 | + $ export TESTDIR="$CRAMTMP/test_outputs" |
| 79 | + |
| 80 | +Run command and check output: |
| 81 | + $ geotessera version |
| 82 | + [version number] |
| 83 | +``` |
| 84 | + |
| 85 | +### Test Structure |
| 86 | + |
| 87 | +- `tests/cli.t` - CLI command tests |
| 88 | +- `tests/hash.t` - Hash verification tests |
| 89 | +- `tests/viz.t` - Visualization tests |
| 90 | +- `tests/zarr.t` - Zarr format tests |
| 91 | + |
| 92 | +### Running Tests |
| 93 | + |
| 94 | +```bash |
| 95 | +# Run all tests |
| 96 | +env TERM=dumb TTY_INTERACTIVE=0 uv run cram tests -v |
| 97 | + |
| 98 | +# Run specific test file |
| 99 | +env TERM=dumb TTY_INTERACTIVE=0 uv run cram tests/cli.t -v |
| 100 | +``` |
| 101 | + |
| 102 | +### Testing Best Practices |
| 103 | + |
| 104 | +- Set `TERM=dumb` to disable ANSI output in tests |
| 105 | +- Use `$CRAMTMP` for temporary test data |
| 106 | +- Override `XDG_CACHE_HOME` for isolated caching |
| 107 | +- Check command exit codes and output patterns |
| 108 | +- Test both success and error cases |
| 109 | + |
| 110 | +## Build and Development Workflow |
| 111 | + |
| 112 | +### Initial Setup |
| 113 | + |
| 114 | +```bash |
| 115 | +# Clone repository |
| 116 | +git clone https://github.com/ucam-eo/geotessera |
| 117 | +cd geotessera |
| 118 | + |
| 119 | +# Install with uv (preferred) |
| 120 | +uv sync --locked --all-extras --dev |
| 121 | + |
| 122 | +# Or with pip |
| 123 | +pip install -e . |
| 124 | +``` |
| 125 | + |
| 126 | +### Development Commands |
| 127 | + |
| 128 | +```bash |
| 129 | +# Run tests |
| 130 | +env TERM=dumb TTY_INTERACTIVE=0 uv run cram tests -v |
| 131 | + |
| 132 | +# Run CLI locally |
| 133 | +uv run -m geotessera.cli --help |
| 134 | +python -m geotessera.cli --help |
| 135 | + |
| 136 | +# Lint code |
| 137 | +ruff check . |
| 138 | +ruff format . |
| 139 | +``` |
| 140 | + |
| 141 | +### CI/CD |
| 142 | + |
| 143 | +- GitHub Actions workflow: `.github/workflows/ci.yml` |
| 144 | +- Multi-platform testing: Ubuntu, macOS (Intel & Apple Silicon) |
| 145 | +- Python versions: 3.11, 3.12, 3.13 |
| 146 | +- Dependencies: GDAL must be installed before Python packages |
| 147 | +- Tests run with `uv run cram tests -v` |
| 148 | + |
| 149 | +## Key Concepts to Remember |
| 150 | + |
| 151 | +### Coordinate System |
| 152 | + |
| 153 | +- Tiles use WGS84 coordinates (longitude, latitude) |
| 154 | +- Tile naming: `grid_{lon}_{lat}` (e.g., `grid_0.15_52.05`) |
| 155 | +- Bounding box format: `(min_lon, min_lat, max_lon, max_lat)` |
| 156 | +- GeoTIFF exports use UTM projection from landmask tiles |
| 157 | + |
| 158 | +### Data Files |
| 159 | + |
| 160 | +1. **Embeddings**: `grid_0.15_52.05.npy` - int8 quantized arrays (H×W×128) |
| 161 | +2. **Scales**: `grid_0.15_52.05_scales.npy` - float32 scale factors |
| 162 | +3. **Landmasks**: `grid_0.15_52.05.tiff` - UTM projection + land/water masks |
| 163 | +4. **Registry**: `registry.parquet` - Parquet metadata with tile locations & hashes |
| 164 | + |
| 165 | +### Hash Verification |
| 166 | + |
| 167 | +- Enabled by default for security |
| 168 | +- Verifies embedding, scale, and landmask files |
| 169 | +- Can be disabled: `--skip-hash` flag or `GEOTESSERA_SKIP_HASH=1` |
| 170 | +- Use `verify_hashes=False` parameter in Python API |
| 171 | + |
| 172 | +### Cache Behavior |
| 173 | + |
| 174 | +- Only registry.parquet is cached (~few MB) |
| 175 | +- Embedding/landmask tiles downloaded to temp files, cleaned up immediately |
| 176 | +- Cache location: `~/.cache/geotessera` (Linux/macOS) or `%LOCALAPPDATA%/geotessera` (Windows) |
| 177 | +- Override with `--cache-dir` or `cache_dir` parameter |
| 178 | + |
| 179 | +## Common Operations |
| 180 | + |
| 181 | +### Adding a New CLI Command |
| 182 | + |
| 183 | +1. Add command function to `cli.py` with `@app.command()` decorator |
| 184 | +2. Use type hints with `typer.Option()` or `typer.Argument()` |
| 185 | +3. Add docstring for help text |
| 186 | +4. Use `rich.console.Console` for output |
| 187 | +5. Add test in appropriate `.t` file |
| 188 | + |
| 189 | +### Working with Embeddings |
| 190 | + |
| 191 | +```python |
| 192 | +# Fetch single tile |
| 193 | +embedding, crs, transform = gt.fetch_embedding(lon=0.15, lat=52.05, year=2024) |
| 194 | + |
| 195 | +# Fetch region |
| 196 | +tiles = gt.registry.load_blocks_for_region(bounds=bbox, year=2024) |
| 197 | +embeddings = gt.fetch_embeddings(tiles) |
| 198 | + |
| 199 | +# Export as GeoTIFF |
| 200 | +files = gt.export_embedding_geotiffs(bbox=bbox, output_dir="./output", year=2024) |
| 201 | +``` |
| 202 | + |
| 203 | +### Registry Operations |
| 204 | + |
| 205 | +```python |
| 206 | +# Initialize with custom registry |
| 207 | +gt = GeoTessera(registry_path="/path/to/registry.parquet") |
| 208 | + |
| 209 | +# Query available tiles |
| 210 | +tiles = gt.registry.load_blocks_for_region(bounds=bbox, year=2024) |
| 211 | + |
| 212 | +# Disable hash verification |
| 213 | +gt = GeoTessera(verify_hashes=False) |
| 214 | +``` |
| 215 | + |
| 216 | +## Documentation |
| 217 | + |
| 218 | +- Main docs: `README.md` - comprehensive usage guide |
| 219 | +- API docs: Built with Sphinx (see `docs/` directory) |
| 220 | +- Hosted at: https://geotessera.readthedocs.io |
| 221 | +- Changelog: `CHANGES.md` |
| 222 | + |
| 223 | +## Contributing |
| 224 | + |
| 225 | +When making changes: |
| 226 | + |
| 227 | +1. Keep modifications minimal and focused |
| 228 | +2. Follow existing code patterns and style |
| 229 | +3. Add tests for new functionality |
| 230 | +4. Update documentation if needed |
| 231 | +5. Ensure CI passes (build + tests on all platforms) |
| 232 | +6. Test with multiple Python versions if possible |
| 233 | + |
| 234 | +## Important Notes |
| 235 | + |
| 236 | +- **GDAL dependency**: Must be installed system-wide before Python packages |
| 237 | +- **Rich output**: Disable with `TERM=dumb` for non-interactive environments |
| 238 | +- **Registry updates**: New tiles require registry regeneration |
| 239 | +- **Embedding requests**: Users can request new geographic coverage via GitHub issues |
| 240 | +- **Minimal storage**: Only registry is cached; tiles are ephemeral |
| 241 | + |
| 242 | +## Version Information |
| 243 | + |
| 244 | +Version defined in `pyproject.toml` and exported from `geotessera/__init__.py`. |
| 245 | + |
| 246 | +## Links |
| 247 | + |
| 248 | +- Repository: https://github.com/ucam-eo/geotessera |
| 249 | +- PyPI: https://pypi.org/project/geotessera/ |
| 250 | +- Tessera Model: https://github.com/ucam-eo/tessera |
| 251 | +- Documentation: https://geotessera.readthedocs.io/ |
| 252 | +- Issue Tracker: https://github.com/ucam-eo/geotessera/issues |
0 commit comments