|
| 1 | +## v0.7.0 (2025-11-09) |
| 2 | + |
| 3 | +This release moves to a Parquet-based registry for more efficient handling of |
| 4 | +the growing embeddings metadata for TESSERA. It no longer maintains a central |
| 5 | +cache, instead preferring the user to specify an embeddings directory within |
| 6 | +which the remote registry tiles are mirrored (as npy files) and additional |
| 7 | +mosaics and GeoTIFFs are generated. This helps make efficient use of disk space |
| 8 | +due to the large size of the embeddings. |
| 9 | + |
| 10 | +There are also new APIs for efficiently sampling embeddings for point data, and |
| 11 | +to generate mosaics for classifiers over ROIs. |
| 12 | + |
| 13 | +Note that there are significant interface changes throughout this release |
| 14 | +compared to 0.6; please read the migration notes below. The library will |
| 15 | +continue to evolve as we add more usecases, so please create issues on |
| 16 | +<https://github.com/ucam-eo/geotessera> with your wishlists! |
| 17 | + |
| 18 | +- **GeoParquet registry support**: Transitioned from text-based manifests to |
| 19 | + Parquet files (`registry.parquet`, `landmasks.parquet') for all tile metadata |
| 20 | +- **Remove caching layer for tiles**: All embedding and landmask tiles are |
| 21 | + now directly downloaded to temporary files and only the Parquet registry is |
| 22 | + cached, since users were finding that embeddings storage was being duplicated |
| 23 | + in the old tile cache. This leads to a significant reduction in disk space. |
| 24 | +- **Direct embeddings downloads**: Replaced Pooch with direct downloads |
| 25 | + to temporary files with SHA256 verification. |
| 26 | +- **Lazy iterators** for reducing memory usage for large ROIs. |
| 27 | + |
| 28 | +### CLI Changes |
| 29 | + |
| 30 | +- **New global options**: |
| 31 | + - `--registry-path` - Specify registry.parquet file |
| 32 | + - `--registry-url` - Specify registry URL |
| 33 | + - `--cache-dir` - Control registry cache location (replaces `TESSERA_DATA_DIR`) |
| 34 | + - Removed `--auto-update` and `--manifests-repo-url` |
| 35 | + |
| 36 | +- **Enhanced `info` command**: Shows tiles per year and total landmask counts using fast pandas operations |
| 37 | +- **Enhanced `coverage` command**: Generate a 3D globegl globe with coverage textures for HTML viewing. |
| 38 | +- **New `--dry-run` option for `download` command**: Calculate total download size without downloading |
| 39 | + - Shows file count, total size, number of tiles, year, and format |
| 40 | + - Accounts for existing files (resume capability) - only counts files that would be downloaded |
| 41 | + - For NPY format: calculates exact sizes from registry for embeddings, scales, and landmasks |
| 42 | + - For TIFF format: provides size estimates (4x quantized size due to float32 conversion) |
| 43 | + - Useful for planning downloads and estimating bandwidth/storage requirements |
| 44 | + - Usage: `geotessera download --bbox '...' --dry-run` |
| 45 | + |
| 46 | +### Registry CLI Changes |
| 47 | + |
| 48 | +- **New `export-manifests` command**: Convert Parquet registry files to Pooch-format text manifests for backwards compatibility |
| 49 | + - Reads `registry.parquet` and `landmasks.parquet` files |
| 50 | + - Generates block-based text registry files in `registry/embeddings/` and `registry/landmasks/` subdirectories |
| 51 | + - Creates separate entries for `.npy` and `_scales.npy` files with their respective hashes |
| 52 | + - Useful for maintaining the tessera-manifests repository |
| 53 | + - Usage: `geotessera-registry export-manifests /path/to/v1 --output-dir ~/src/git/ucam-eo/tessera-manifests` |
| 54 | + |
| 55 | +### Infrastructure Improvements |
| 56 | + |
| 57 | +- **CRAM test suite**: Added comprehensive CLI tests using CRAM (Command-line Regression Acceptance Testing) |
| 58 | +- **Dumb terminal support**: Added `TERM=dumb` support for non-interactive environments and CI pipelines |
| 59 | +- **Logging system**: Migrated from print statements to Python's standard `logging` module for better integration |
| 60 | + |
| 61 | +### Breaking Changes |
| 62 | + |
| 63 | +- **NPY Download Format**: `geotessera download --format npy` now saves **quantized** embeddings with scales instead of dequantized embeddings |
| 64 | + - **New structure**: Files saved in `embeddings/{year}/grid_{lon}_{lat}.npy` (quantized) and `_scales.npy` (float32 scales) |
| 65 | + - **Landmasks included**: Saved in `landmasks/landmask_{lon}_{lat}.tif` structure |
| 66 | + - **No JSON metadata**: Removed JSON metadata files (use registry for metadata) |
| 67 | + - **Resume capability**: Can interrupt and restart downloads without re-downloading existing files |
| 68 | + - If you have existing NPY downloads, re-download with new version. Downloaded directories can now be reused with `GeoTessera(embeddings_dir=...)` |
| 69 | + |
| 70 | +- **Registry API Changes**: Internal registry methods now return tuple for better resource management |
| 71 | + - `Registry.fetch()` now returns `(file_path, needs_cleanup)` tuple instead of just path |
| 72 | + - `Registry.fetch_landmask()` now returns `(file_path, needs_cleanup)` tuple instead of just path |
| 73 | + - These are internal changes - most users won't be affected |
| 74 | + |
| 75 | +- **Registry Format Requirements**: Updated schema for Parquet registry files |
| 76 | + - `registry.parquet` now requires both `file_size` and `scales_hash` columns |
| 77 | + - `landmasks.parquet` requires `file_size` column |
| 78 | + - `file_size` used for accurate download progress reporting with total size |
| 79 | + - `scales_hash` stores SHA256 hash for scales files separately from embedding hash |
| 80 | + - Registry validation will fail if required columns are missing |
| 81 | + - Regenerate registries with latest `geotessera-registry scan` to include new columns |
| 82 | + |
| 83 | +- **Environment variables**: `TESSERA_REGISTRY_DIR` and `TESSERA_DATA_DIR` deprecated in favor of CLI parameters |
| 84 | +- **Registry format**: Completely new backend that migrates from text manifests to GeoParquet. |
| 85 | +- **Cache behavior**: Only the registry is now cached, and not tile data to allow clients to manage their own disk usage. |
| 86 | + |
| 87 | +### New API Features |
| 88 | + |
| 89 | +- **`Tiles` class**: New abstraction for working with Tessera tiles |
| 90 | + - Provides unified interface for tile manipulation as either GeoTIFF or dequantized NumPy arrays |
| 91 | + - Simplifies conversion between formats |
| 92 | + - Accessible via `from geotessera.tiles import Tiles` |
| 93 | + |
| 94 | +- **`GeoTessera(embeddings_dir=...)`**: New constructor parameter for local tile reuse |
| 95 | + - Points to directory containing pre-downloaded tiles |
| 96 | + - Expected structure: `embeddings/{year}/grid_{lon}_{lat}.npy` and `_scales.npy`, `landmasks/landmask_{lon}_{lat}.tif` |
| 97 | + - Automatically uses local files when available, downloads only if missing |
| 98 | + |
| 99 | +- **`sample_embeddings_at_points(points, year, embeddings_dir=None, refresh=False)`**: Efficient point sampling |
| 100 | + - Extract embedding values at arbitrary lon/lat coordinates |
| 101 | + - Supports multiple input formats: list of tuples, GeoJSON FeatureCollection, GeoPandas GeoDataFrame |
| 102 | + - Automatically groups points by tile for efficient batch processing |
| 103 | + - Optional metadata return (tile info, pixel coords, CRS) |
| 104 | + - Can override instance `embeddings_dir` per call |
| 105 | + - Example: `embeddings = gt.sample_embeddings_at_points([(lon, lat), ...], year=2024)` |
| 106 | + |
| 107 | +- **`fetch_embedding(..., refresh=False)`**: New parameter to force re-download |
| 108 | + - When `refresh=True`, re-downloads even if local tiles exist in `embeddings_dir` |
| 109 | + - Useful for updating tiles or verifying data integrity |
| 110 | + |
| 111 | +- **New Registry size query methods**: Public API for querying file sizes from registry |
| 112 | + - `registry.get_tile_file_size(year, lon, lat)` - Get size of an embedding tile in bytes |
| 113 | + - `registry.get_landmask_file_size(lon, lat)` - Get size of a landmask tile in bytes |
| 114 | + - `registry.calculate_download_requirements(tiles, output_dir, format_type)` - Calculate total download size for a list of tiles |
| 115 | + - These methods replace direct registry DataFrame access and provide proper error handling |
| 116 | + - Used internally by CLI `--dry-run` option and available for programmatic use |
| 117 | + - Example: `size = gt.registry.get_tile_file_size(2024, 0.15, 52.05)` |
| 118 | + |
| 119 | +- **`embeddings_count(bbox, year)`**: Get count of tiles in a bounding box |
| 120 | + - Returns total number of embedding tiles within a geographic region |
| 121 | + - Useful for planning downloads and estimating processing requirements |
| 122 | + - Example: `count = gt.embeddings_count((min_lon, min_lat, max_lon, max_lat), 2024)` |
| 123 | + |
| 124 | +- **`export_coverage_map(output_file)`**: Export coverage data to JSON |
| 125 | + - Generates global coverage map showing which tiles have embeddings for which years |
| 126 | + - Returns dictionary with tile coverage information |
| 127 | + - Optionally saves to JSON file for use in visualizations |
| 128 | + |
| 129 | +- **`generate_coverage_texture(coverage_data, output_file)`**: Generate coverage texture for globe visualization |
| 130 | + - Creates 3600x1800 pixel equirectangular projection texture |
| 131 | + - Each pixel represents a 0.1-degree tile, colored by coverage status |
| 132 | + - Used with `coverage` command for 3D globe visualizations, but also for your own visualisations |
| 133 | + |
| 134 | +- **`dequantize_embedding(quantized_embedding, scales)`**: Public utility function for dequantization |
| 135 | + - Converts quantized embeddings to float32 by multiplying with scale factors |
| 136 | + - Useful when working directly with downloaded quantized NPY files, but use the Tiles class for normal usage. |
| 137 | + - Example: `embedding = dequantize_embedding(quantized, scales)` |
| 138 | + |
| 139 | +### Migration Notes |
| 140 | + |
| 141 | +From v0.6.0 to v0.7.0: |
| 142 | +- Update initialization code to use new `cache_dir` parameter instead of environment variables |
| 143 | +- Remove any custom `TESSERA_DATA_DIR` or `TESSERA_REGISTRY_DIR` environment variable usage |
| 144 | +- Expect reduced disk usage as tiles are no longer cached but potentially more downloads. |
| 145 | +- **If using NPY downloads**: Re-download tiles with new format to get quantized structure |
| 146 | +- **To reuse downloaded tiles**: Use `GeoTessera(embeddings_dir="path/to/tiles")` when initializing |
| 147 | +- **For point sampling**: Replace manual tile iteration with `sample_embeddings_at_points()` |
| 148 | + |
1 | 149 | ## v0.6.0 (2025-09-15) |
2 | 150 |
|
3 | 151 | - registry: Add support for a Parquet registry as an alternative source |
|
0 commit comments