Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper)

Introduction

Equitable urban transportation applications require high-fidelity digital representations of the built environment: not just streets and sidewalks, but bike lanes, marked and unmarked crossings, curb ramps and cuts, obstructions, traffic signals, signage, street markings, potholes, and more. Direct inspections and manual annotations are prohibitively expensive at scale. Conventional machine learning methods require substantial annotated training data for adequate performance. In this paper, we consider vision language models as a mechanism for annotating diverse urban features from satellite images, reducing the dependence on human annotation to produce large training sets.

Contribution

We demonstrate proof-of-concept combining a state-of-the-art vision language model and variants of a prompting strategy that asks the model to consider segmented elements independently of the original image.

Pipeline of our proposed automated annotation process. Users input a pair of (satellite image, annotation guidance). The image will go through a set of processes including segmentation, filtering, and set-of-mark generation. Then the image and guidance will go through a vision-language model, the output of which is post-processed to produce the final annotation results. The procedure requires no fine-tuning, and can be applied on different features with minimal adjust on the guidance.

Quantitative Evaluation

Experiments on two urban features --- stop lines and raised tables --- show that while direct zero-shot prompting correctly annotates nearly zero images, the pre-segmentation strategies can annotate images with near 40% IoU (intersection-over-union) accuracy.

Features	Direct Prompting	SoM - No Context	SoM - In Context	SoM - Combination
Stop Lines	0.0000	0.2483	0.3354	0.3657
Raised Tables	0.0190	0.3315	0.4069	0.4189

Qualitative Evaluation

$\color{green}{Green}$ and $\color{orange}{yellow}$ outlines indicate perfect and approximate annotations, respectively. A $\color{red}{Red}$ outline indicate inaccurate annotations.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
images		images
models		models
outputs		outputs
visualizations		visualizations
.git_ignore		.git_ignore
README.md		README.md
gitignore.txt		gitignore.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper)

Introduction

Contribution

Quantitative Evaluation

Qualitative Evaluation

Stop Lines

Raised Table.

About

Uh oh!

Releases

Packages

Uh oh!

Languages

BeanHam/2024-vl-annotation

Folders and files

Latest commit

History

Repository files navigation

Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper)

Introduction

Contribution

Quantitative Evaluation

Qualitative Evaluation

Stop Lines

Raised Table.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages