Skip to content

Commit d1aa0bd

Browse files
authored
Merge pull request #84 from avsm/geoparquet
Switch to Geoparquet registries, add a Tiles class, and sampling points for embeddings
2 parents bd0e6cd + 35f5a36 commit d1aa0bd

28 files changed

+6963
-3426
lines changed

.github/workflows/ci.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,6 @@ jobs:
3737

3838
- name: Build
3939
run: uv sync --locked --all-extras --dev
40+
41+
- name: Test
42+
run: env TERM=dumb TTY_INTERACTIVE=0 uv run cram tests -v

CHANGES.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,151 @@
1+
## v0.7.0 (2025-11-09)
2+
3+
This release moves to a Parquet-based registry for more efficient handling of
4+
the growing embeddings metadata for TESSERA. It no longer maintains a central
5+
cache, instead preferring the user to specify an embeddings directory within
6+
which the remote registry tiles are mirrored (as npy files) and additional
7+
mosaics and GeoTIFFs are generated. This helps make efficient use of disk space
8+
due to the large size of the embeddings.
9+
10+
There are also new APIs for efficiently sampling embeddings for point data, and
11+
to generate mosaics for classifiers over ROIs.
12+
13+
Note that there are significant interface changes throughout this release
14+
compared to 0.6; please read the migration notes below. The library will
15+
continue to evolve as we add more usecases, so please create issues on
16+
<https://github.com/ucam-eo/geotessera> with your wishlists!
17+
18+
- **GeoParquet registry support**: Transitioned from text-based manifests to
19+
Parquet files (`registry.parquet`, `landmasks.parquet') for all tile metadata
20+
- **Remove caching layer for tiles**: All embedding and landmask tiles are
21+
now directly downloaded to temporary files and only the Parquet registry is
22+
cached, since users were finding that embeddings storage was being duplicated
23+
in the old tile cache. This leads to a significant reduction in disk space.
24+
- **Direct embeddings downloads**: Replaced Pooch with direct downloads
25+
to temporary files with SHA256 verification.
26+
- **Lazy iterators** for reducing memory usage for large ROIs.
27+
28+
### CLI Changes
29+
30+
- **New global options**:
31+
- `--registry-path` - Specify registry.parquet file
32+
- `--registry-url` - Specify registry URL
33+
- `--cache-dir` - Control registry cache location (replaces `TESSERA_DATA_DIR`)
34+
- Removed `--auto-update` and `--manifests-repo-url`
35+
36+
- **Enhanced `info` command**: Shows tiles per year and total landmask counts using fast pandas operations
37+
- **Enhanced `coverage` command**: Generate a 3D globegl globe with coverage textures for HTML viewing.
38+
- **New `--dry-run` option for `download` command**: Calculate total download size without downloading
39+
- Shows file count, total size, number of tiles, year, and format
40+
- Accounts for existing files (resume capability) - only counts files that would be downloaded
41+
- For NPY format: calculates exact sizes from registry for embeddings, scales, and landmasks
42+
- For TIFF format: provides size estimates (4x quantized size due to float32 conversion)
43+
- Useful for planning downloads and estimating bandwidth/storage requirements
44+
- Usage: `geotessera download --bbox '...' --dry-run`
45+
46+
### Registry CLI Changes
47+
48+
- **New `export-manifests` command**: Convert Parquet registry files to Pooch-format text manifests for backwards compatibility
49+
- Reads `registry.parquet` and `landmasks.parquet` files
50+
- Generates block-based text registry files in `registry/embeddings/` and `registry/landmasks/` subdirectories
51+
- Creates separate entries for `.npy` and `_scales.npy` files with their respective hashes
52+
- Useful for maintaining the tessera-manifests repository
53+
- Usage: `geotessera-registry export-manifests /path/to/v1 --output-dir ~/src/git/ucam-eo/tessera-manifests`
54+
55+
### Infrastructure Improvements
56+
57+
- **CRAM test suite**: Added comprehensive CLI tests using CRAM (Command-line Regression Acceptance Testing)
58+
- **Dumb terminal support**: Added `TERM=dumb` support for non-interactive environments and CI pipelines
59+
- **Logging system**: Migrated from print statements to Python's standard `logging` module for better integration
60+
61+
### Breaking Changes
62+
63+
- **NPY Download Format**: `geotessera download --format npy` now saves **quantized** embeddings with scales instead of dequantized embeddings
64+
- **New structure**: Files saved in `embeddings/{year}/grid_{lon}_{lat}.npy` (quantized) and `_scales.npy` (float32 scales)
65+
- **Landmasks included**: Saved in `landmasks/landmask_{lon}_{lat}.tif` structure
66+
- **No JSON metadata**: Removed JSON metadata files (use registry for metadata)
67+
- **Resume capability**: Can interrupt and restart downloads without re-downloading existing files
68+
- If you have existing NPY downloads, re-download with new version. Downloaded directories can now be reused with `GeoTessera(embeddings_dir=...)`
69+
70+
- **Registry API Changes**: Internal registry methods now return tuple for better resource management
71+
- `Registry.fetch()` now returns `(file_path, needs_cleanup)` tuple instead of just path
72+
- `Registry.fetch_landmask()` now returns `(file_path, needs_cleanup)` tuple instead of just path
73+
- These are internal changes - most users won't be affected
74+
75+
- **Registry Format Requirements**: Updated schema for Parquet registry files
76+
- `registry.parquet` now requires both `file_size` and `scales_hash` columns
77+
- `landmasks.parquet` requires `file_size` column
78+
- `file_size` used for accurate download progress reporting with total size
79+
- `scales_hash` stores SHA256 hash for scales files separately from embedding hash
80+
- Registry validation will fail if required columns are missing
81+
- Regenerate registries with latest `geotessera-registry scan` to include new columns
82+
83+
- **Environment variables**: `TESSERA_REGISTRY_DIR` and `TESSERA_DATA_DIR` deprecated in favor of CLI parameters
84+
- **Registry format**: Completely new backend that migrates from text manifests to GeoParquet.
85+
- **Cache behavior**: Only the registry is now cached, and not tile data to allow clients to manage their own disk usage.
86+
87+
### New API Features
88+
89+
- **`Tiles` class**: New abstraction for working with Tessera tiles
90+
- Provides unified interface for tile manipulation as either GeoTIFF or dequantized NumPy arrays
91+
- Simplifies conversion between formats
92+
- Accessible via `from geotessera.tiles import Tiles`
93+
94+
- **`GeoTessera(embeddings_dir=...)`**: New constructor parameter for local tile reuse
95+
- Points to directory containing pre-downloaded tiles
96+
- Expected structure: `embeddings/{year}/grid_{lon}_{lat}.npy` and `_scales.npy`, `landmasks/landmask_{lon}_{lat}.tif`
97+
- Automatically uses local files when available, downloads only if missing
98+
99+
- **`sample_embeddings_at_points(points, year, embeddings_dir=None, refresh=False)`**: Efficient point sampling
100+
- Extract embedding values at arbitrary lon/lat coordinates
101+
- Supports multiple input formats: list of tuples, GeoJSON FeatureCollection, GeoPandas GeoDataFrame
102+
- Automatically groups points by tile for efficient batch processing
103+
- Optional metadata return (tile info, pixel coords, CRS)
104+
- Can override instance `embeddings_dir` per call
105+
- Example: `embeddings = gt.sample_embeddings_at_points([(lon, lat), ...], year=2024)`
106+
107+
- **`fetch_embedding(..., refresh=False)`**: New parameter to force re-download
108+
- When `refresh=True`, re-downloads even if local tiles exist in `embeddings_dir`
109+
- Useful for updating tiles or verifying data integrity
110+
111+
- **New Registry size query methods**: Public API for querying file sizes from registry
112+
- `registry.get_tile_file_size(year, lon, lat)` - Get size of an embedding tile in bytes
113+
- `registry.get_landmask_file_size(lon, lat)` - Get size of a landmask tile in bytes
114+
- `registry.calculate_download_requirements(tiles, output_dir, format_type)` - Calculate total download size for a list of tiles
115+
- These methods replace direct registry DataFrame access and provide proper error handling
116+
- Used internally by CLI `--dry-run` option and available for programmatic use
117+
- Example: `size = gt.registry.get_tile_file_size(2024, 0.15, 52.05)`
118+
119+
- **`embeddings_count(bbox, year)`**: Get count of tiles in a bounding box
120+
- Returns total number of embedding tiles within a geographic region
121+
- Useful for planning downloads and estimating processing requirements
122+
- Example: `count = gt.embeddings_count((min_lon, min_lat, max_lon, max_lat), 2024)`
123+
124+
- **`export_coverage_map(output_file)`**: Export coverage data to JSON
125+
- Generates global coverage map showing which tiles have embeddings for which years
126+
- Returns dictionary with tile coverage information
127+
- Optionally saves to JSON file for use in visualizations
128+
129+
- **`generate_coverage_texture(coverage_data, output_file)`**: Generate coverage texture for globe visualization
130+
- Creates 3600x1800 pixel equirectangular projection texture
131+
- Each pixel represents a 0.1-degree tile, colored by coverage status
132+
- Used with `coverage` command for 3D globe visualizations, but also for your own visualisations
133+
134+
- **`dequantize_embedding(quantized_embedding, scales)`**: Public utility function for dequantization
135+
- Converts quantized embeddings to float32 by multiplying with scale factors
136+
- Useful when working directly with downloaded quantized NPY files, but use the Tiles class for normal usage.
137+
- Example: `embedding = dequantize_embedding(quantized, scales)`
138+
139+
### Migration Notes
140+
141+
From v0.6.0 to v0.7.0:
142+
- Update initialization code to use new `cache_dir` parameter instead of environment variables
143+
- Remove any custom `TESSERA_DATA_DIR` or `TESSERA_REGISTRY_DIR` environment variable usage
144+
- Expect reduced disk usage as tiles are no longer cached but potentially more downloads.
145+
- **If using NPY downloads**: Re-download tiles with new format to get quantized structure
146+
- **To reuse downloaded tiles**: Use `GeoTessera(embeddings_dir="path/to/tiles")` when initializing
147+
- **For point sampling**: Replace manual tile iteration with `sample_embeddings_at_points()`
148+
1149
## v0.6.0 (2025-09-15)
2150

3151
- registry: Add support for a Parquet registry as an alternative source

0 commit comments

Comments
 (0)