Skip to content

Commit e955f72

Browse files
committed
Merge remote-tracking branch 'origin/main' into copilot/fix-tmp-tiles-discovery
2 parents 440e6d6 + 32171df commit e955f72

File tree

15 files changed

+1129
-315
lines changed

15 files changed

+1129
-315
lines changed

.github/copilot-instructions.md

Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
# GitHub Copilot Instructions for GeoTessera
2+
3+
## Project Overview
4+
5+
GeoTessera is a Python library for accessing and working with Tessera geospatial foundation model embeddings. The library provides tools to download, process, and visualize satellite imagery embeddings from Sentinel-1 and Sentinel-2 data at 10m resolution.
6+
7+
### Core Architecture
8+
9+
- **Two-step workflow**: Retrieve embeddings (numpy arrays) → Export to desired format (GeoTIFF/zarr)
10+
- **Registry system**: Parquet-based metadata registry for efficient tile lookup
11+
- **0.1-degree grid**: Tiles cover ~11km × 11km, named by center coordinates
12+
- **Direct HTTP downloads**: On-demand tile fetching with automatic cleanup
13+
- **Hash verification**: SHA256 checksums ensure data integrity by default
14+
15+
## Technology Stack
16+
17+
### Core Dependencies
18+
19+
- **Python**: 3.11, 3.12, or 3.13 required (3.11+ in general)
20+
- **CLI Framework**: `typer` with `rich` for interactive output
21+
- **Geospatial**: `rasterio`, `geopandas`, `rioxarray` for GIS operations
22+
- **Data Processing**: `numpy`, `pandas`, `pyarrow` (for Parquet registry)
23+
- **Visualization**: `matplotlib`, `scikit-learn` (PCA), `scikit-image`
24+
- **Storage**: `zarr`, `xarray`, `dask` for chunked data handling
25+
26+
### Build System
27+
28+
- **Package Manager**: Uses `uv` for dependency management (preferred) or `pip`
29+
- **Configuration**: `pyproject.toml` with setuptools backend
30+
- **Lock File**: `uv.lock` for reproducible builds
31+
- **Test Framework**: `cram` (shell-based functional testing)
32+
- **Linting**: `ruff` for code quality
33+
34+
## Coding Standards
35+
36+
### Python Style
37+
38+
- Follow PEP 8 conventions
39+
- Use type hints for function signatures (e.g., `Optional[str]`, `Path`, etc.)
40+
- Use `typing_extensions.Annotated` for CLI argument annotations with `typer`
41+
- Prefer pathlib `Path` over string paths
42+
- Use f-strings for string formatting
43+
44+
### Code Organization
45+
46+
```
47+
geotessera/
48+
├── __init__.py # Package exports and version
49+
├── core.py # Main GeoTessera class
50+
├── registry.py # Parquet registry management
51+
├── cli.py # Main CLI commands
52+
├── registry_cli.py # Registry-specific CLI
53+
├── tiles.py # Tile operations
54+
├── visualization.py # Visualization functions
55+
├── web.py # Web map generation
56+
├── country.py # Country bounding box utilities
57+
└── progress.py # Progress tracking utilities
58+
```
59+
60+
### Key Patterns
61+
62+
1. **Rich Console Output**: Use `rich.console.Console` and `rich.progress.Progress` for user-facing output
63+
2. **Logging**: Configure with `rich.logging.RichHandler` for pretty logs
64+
3. **Temporary Files**: Use `tempfile` for intermediate data, clean up automatically
65+
4. **Error Handling**: Provide clear error messages with context
66+
5. **CLI Structure**: Commands are typer apps with descriptive help text
67+
68+
## Testing Guidelines
69+
70+
### Test Framework: Cram
71+
72+
Tests are written in `.t` files using cram (shell-based testing):
73+
74+
```bash
75+
# Example test structure
76+
Setup environment:
77+
$ export TERM=dumb
78+
$ export TESTDIR="$CRAMTMP/test_outputs"
79+
80+
Run command and check output:
81+
$ geotessera version
82+
[version number]
83+
```
84+
85+
### Test Structure
86+
87+
- `tests/cli.t` - CLI command tests
88+
- `tests/hash.t` - Hash verification tests
89+
- `tests/viz.t` - Visualization tests
90+
- `tests/zarr.t` - Zarr format tests
91+
92+
### Running Tests
93+
94+
```bash
95+
# Run all tests
96+
env TERM=dumb TTY_INTERACTIVE=0 uv run cram tests -v
97+
98+
# Run specific test file
99+
env TERM=dumb TTY_INTERACTIVE=0 uv run cram tests/cli.t -v
100+
```
101+
102+
### Testing Best Practices
103+
104+
- Set `TERM=dumb` to disable ANSI output in tests
105+
- Use `$CRAMTMP` for temporary test data
106+
- Override `XDG_CACHE_HOME` for isolated caching
107+
- Check command exit codes and output patterns
108+
- Test both success and error cases
109+
110+
## Build and Development Workflow
111+
112+
### Initial Setup
113+
114+
```bash
115+
# Clone repository
116+
git clone https://github.com/ucam-eo/geotessera
117+
cd geotessera
118+
119+
# Install with uv (preferred)
120+
uv sync --locked --all-extras --dev
121+
122+
# Or with pip
123+
pip install -e .
124+
```
125+
126+
### Development Commands
127+
128+
```bash
129+
# Run tests
130+
env TERM=dumb TTY_INTERACTIVE=0 uv run cram tests -v
131+
132+
# Run CLI locally
133+
uv run -m geotessera.cli --help
134+
python -m geotessera.cli --help
135+
136+
# Lint code
137+
ruff check .
138+
ruff format .
139+
```
140+
141+
### CI/CD
142+
143+
- GitHub Actions workflow: `.github/workflows/ci.yml`
144+
- Multi-platform testing: Ubuntu, macOS (Intel & Apple Silicon)
145+
- Python versions: 3.11, 3.12, 3.13
146+
- Dependencies: GDAL must be installed before Python packages
147+
- Tests run with `uv run cram tests -v`
148+
149+
## Key Concepts to Remember
150+
151+
### Coordinate System
152+
153+
- Tiles use WGS84 coordinates (longitude, latitude)
154+
- Tile naming: `grid_{lon}_{lat}` (e.g., `grid_0.15_52.05`)
155+
- Bounding box format: `(min_lon, min_lat, max_lon, max_lat)`
156+
- GeoTIFF exports use UTM projection from landmask tiles
157+
158+
### Data Files
159+
160+
1. **Embeddings**: `grid_0.15_52.05.npy` - int8 quantized arrays (H×W×128)
161+
2. **Scales**: `grid_0.15_52.05_scales.npy` - float32 scale factors
162+
3. **Landmasks**: `grid_0.15_52.05.tiff` - UTM projection + land/water masks
163+
4. **Registry**: `registry.parquet` - Parquet metadata with tile locations & hashes
164+
165+
### Hash Verification
166+
167+
- Enabled by default for security
168+
- Verifies embedding, scale, and landmask files
169+
- Can be disabled: `--skip-hash` flag or `GEOTESSERA_SKIP_HASH=1`
170+
- Use `verify_hashes=False` parameter in Python API
171+
172+
### Cache Behavior
173+
174+
- Only registry.parquet is cached (~few MB)
175+
- Embedding/landmask tiles downloaded to temp files, cleaned up immediately
176+
- Cache location: `~/.cache/geotessera` (Linux/macOS) or `%LOCALAPPDATA%/geotessera` (Windows)
177+
- Override with `--cache-dir` or `cache_dir` parameter
178+
179+
## Common Operations
180+
181+
### Adding a New CLI Command
182+
183+
1. Add command function to `cli.py` with `@app.command()` decorator
184+
2. Use type hints with `typer.Option()` or `typer.Argument()`
185+
3. Add docstring for help text
186+
4. Use `rich.console.Console` for output
187+
5. Add test in appropriate `.t` file
188+
189+
### Working with Embeddings
190+
191+
```python
192+
# Fetch single tile
193+
embedding, crs, transform = gt.fetch_embedding(lon=0.15, lat=52.05, year=2024)
194+
195+
# Fetch region
196+
tiles = gt.registry.load_blocks_for_region(bounds=bbox, year=2024)
197+
embeddings = gt.fetch_embeddings(tiles)
198+
199+
# Export as GeoTIFF
200+
files = gt.export_embedding_geotiffs(bbox=bbox, output_dir="./output", year=2024)
201+
```
202+
203+
### Registry Operations
204+
205+
```python
206+
# Initialize with custom registry
207+
gt = GeoTessera(registry_path="/path/to/registry.parquet")
208+
209+
# Query available tiles
210+
tiles = gt.registry.load_blocks_for_region(bounds=bbox, year=2024)
211+
212+
# Disable hash verification
213+
gt = GeoTessera(verify_hashes=False)
214+
```
215+
216+
## Documentation
217+
218+
- Main docs: `README.md` - comprehensive usage guide
219+
- API docs: Built with Sphinx (see `docs/` directory)
220+
- Hosted at: https://geotessera.readthedocs.io
221+
- Changelog: `CHANGES.md`
222+
223+
## Contributing
224+
225+
When making changes:
226+
227+
1. Keep modifications minimal and focused
228+
2. Follow existing code patterns and style
229+
3. Add tests for new functionality
230+
4. Update documentation if needed
231+
5. Ensure CI passes (build + tests on all platforms)
232+
6. Test with multiple Python versions if possible
233+
234+
## Important Notes
235+
236+
- **GDAL dependency**: Must be installed system-wide before Python packages
237+
- **Rich output**: Disable with `TERM=dumb` for non-interactive environments
238+
- **Registry updates**: New tiles require registry regeneration
239+
- **Embedding requests**: Users can request new geographic coverage via GitHub issues
240+
- **Minimal storage**: Only registry is cached; tiles are ephemeral
241+
242+
## Version Information
243+
244+
Version defined in `pyproject.toml` and exported from `geotessera/__init__.py`.
245+
246+
## Links
247+
248+
- Repository: https://github.com/ucam-eo/geotessera
249+
- PyPI: https://pypi.org/project/geotessera/
250+
- Tessera Model: https://github.com/ucam-eo/tessera
251+
- Documentation: https://geotessera.readthedocs.io/
252+
- Issue Tracker: https://github.com/ucam-eo/geotessera/issues

.github/workflows/conda.yml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
name: Build
2+
3+
on: [push, pull_request]
4+
5+
jobs:
6+
build_wheels:
7+
name: Build wheels on ${{ matrix.os }}
8+
runs-on: ${{ matrix.os }}
9+
strategy:
10+
matrix:
11+
os: [windows-latest]
12+
python-version: [3.11, 3.12, 3.13]
13+
14+
steps:
15+
- uses: actions/checkout@v4
16+
17+
- uses: conda-incubator/setup-miniconda@v3
18+
with:
19+
auto-update-conda: true
20+
activate-environment: true
21+
environment-file: environment.yml
22+
channels: conda-forge,anaconda,main
23+
python-version: ${{ matrix.python-version }}
24+
25+
- name: Install Local Package
26+
run: pip install -e .
27+
28+
- name: Run CLI Tests
29+
shell: pwsh
30+
env:
31+
TERM: dumb
32+
run: .\tests\cli.ps1 -Verbose

README.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ representation maps at 10m resolution. These embeddings compress a full year of
1111
temporal-spectral features into dense representations optimized for downstream
1212
geospatial analysis tasks. Read more details about [the model](https://github.com/ucam-eo/tessera).
1313

14-
![Coverage map](https://raw.githubusercontent.com/ucam-eo/tessera-coverage-map/refs/heads/main/map.png)
14+
![Coverage map](https://github.com/ucam-eo/tessera-coverage-map/blob/main/map.png)
1515

1616
### Request missing embeddings
1717

@@ -28,6 +28,13 @@ After you submit the request, we will **prioritize your ROI** and notify you via
2828
**A request for support**
2929
Due to limited compute resources, we're unable to fulfill embedding requests covering large geographic areas or requiring substantial processing time. To help us serve the community better, we kindly ask requesters—especially those from commercial organizations or those requiring large-scale processing—to sponsor their requests by providing us with Azure credits. Importantly, the resulting outputs will be contributed to our global embeddings database, making them freely available for the entire research and user community. This approach allows us to scale our service while building a shared resource that benefits everyone. If you are in a position to support us in this way, please contact Prof. S.Keshav at sk818@cam.ac.uk. We greatly appreciate your understanding and support in making Tessera more accessible to all.
3030

31+
### Important Notice ⚠️
32+
On 20th August 2025, we updated the data processing pipeline of GeoTessera to resolve the issue of tiling artifacts, as shown below. We have retained the embeddings generated before August 20, as they remain effective for use in small-scale areas. After the 2024 embedding generation is completed, we will reprocess the tiles affected by tiling artifacts. If you observe such artifacts during use and they significantly impact performance, please raise the issue **[here](../../issues/new?template=embedding-request.yml&labels=embedding-request)**, and we will prioritize reprocessing your request.
33+
34+
![Pipeline Change](https://github.com/ucam-eo/geotessera/blob/main/pipeline_change.png)
35+
36+
Please note that if the artifacts you observe are slanted, this is not a bug in the pipeline but rather a result of the Sentinel-1/2 satellite trajectories. Currently, Tessera cannot completely eliminate such artifacts, as they reflect the inherent characteristics of the raw data. However, we have observed that they have minimal impact on downstream tasks.
37+
3138
## Table of Contents
3239

3340
- [Installation](#installation)

environment.yml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
name: base
2+
channels:
3+
- defaults
4+
- conda-forge
5+
- main
6+
dependencies:
7+
- cram>=0.7
8+
- dask
9+
- geodatasets>=2024.8.0
10+
- geopandas
11+
- matplotlib
12+
- numpy>=1.24.0
13+
- pandas
14+
- pyarrow>=17.0.0
15+
- rasterio
16+
- rich
17+
- rioxarray
18+
- scikit-image>=0.25.2
19+
- scikit-learn>=1.7.1
20+
- sphinx>=8.2.3
21+
- typer
22+
- xarray
23+
- zarr

0 commit comments

Comments
 (0)