Sanctuaria-Gaze is a multimodal dataset featuring egocentric recordings from 40 visits to four architecturally and culturally significant sanctuaries in Italy. Collected using Meta Project Aria Glasses, the dataset includes synchronized RGB videos, gaze raw data, head motion, and environmental point cloud data, totaling over 4 hours of recordings.
output.mp4
Alongside the data, we provide an open-source framework for automatic detection and analysis of Areas of Interest (AOIs), enabling gaze-based research without manual annotation.
The dataset is publicly available at the Hugging Face Hub.
This dataset represents the largest multimodal egocentric dataset focused on religious heritage. Each recording captures a eal-world visual exploration of sacred spaces, allowing researchers to analyze human visual attention, spatial navigation, and interaction with architectural and religious elements.
Participants were equipped with Meta Project Aria Glasses, which integrate multiple synchronized sensors.
Each video was captured at a resolution of 1408×1408 pixels and a frame rate of 15 fps.
The following sensor streams were recorded:
| Modality | Sensor | Frequency | Description |
|---|---|---|---|
| RGB Video | RGB camera | 15 Hz | Egocentric view of the environment |
| Gaze | Eye-tracking cameras | 30 Hz | Timestamped gaze coordinates for visual attention analysis |
| Depth & SLAM | SLAM cameras | 15 Hz | Depth estimation and 3D scene reconstruction |
| IMU | Accelerometer / Gyroscope | 1000 Hz / 800 Hz | Head motion and inertial measurements |
| Magnetometer | Magnetic field sensor | 10 Hz | Orientation and spatial context |
| Barometer | Pressure sensor | 50 Hz | Altitude and environmental pressure |
Each participant was asked to freely explore one of the churches while wearing the glasses, simulating a natural visit.
They were given complete autonomy over their movements and activities, without specific instructions or tasks.
Before every recording, the glasses were calibrated following the official Project Aria calibration procedure to ensure accurate gaze tracking.
To minimize social or environmental interference, each participant explored the church individually, ensuring undisturbed, natural behavior.
- Participants: 20
- Sequences: 40 (10 per sanctuary)
- Average duration: 6.7 minutes
- Total duration: 4.47 hours
- Total frames: 241,355
Along with the dataset, we provide an open-source framework for AOI detection and gaze-based analysis.
- Batch or single-file processing
- IDT-based scanpath generation
- Automatic AOI detection with pretrained models
- Frame extraction and annotation
- Annotated video generation
- Configurable command-line interface
Clone the repository and install dependencies:
git clone https://github.com/aimagelab/Sanctuaria-Gaze.git
cd Sanctuaria-Gaze
pip install -r requirements.txt
python annotate.py --idt --verbose path/to/subject_gaze.csv path/to/subject.mp4python annotate.py --idt --verbose path/to/folder/--idt: Run IDT scanpath generation--no-extract: Skip frame extraction--no-annotate: Skip annotation--no-video: Skip video creation--idt-dis-threshold FLOAT: Set IDT dispersion threshold (default: 0.05)--idt-dur-threshold INT: Set IDT duration threshold (default: 100)--stop-frame INT: Stop after this frame number--verbose: Enable verbose logging
The input _gaze.csv file must include the following columns:
gaze_timestampworld_indexconfidencenorm_pos_xnorm_pos_y
each row represents a single gaze sample, with normalized gaze positions (norm_pos_x, norm_pos_y) ranging from 0 to 1, and timestamps (gaze_timestamp) in seconds. The confidence column indicates the reliability of each gaze point, and world_index corresponds to the frame index in the associated video.
The list of object names (classes) used for detection must be specified in a plain text file named object_classes.txt.
Each line should contain a single object name. Lines starting with # are treated as comments and ignored.
Example object_classes.txt:
# Religious and architectural objects
altar
crucifix
pews
pulpit
chalice
...
This file should be placed in the project directory.
The detection models (e.g., YOLOv8_World, OWLv2) will automatically load object names from this file at runtime.
Suppose you have the following files in data/:
subject1_gaze.csvsubject1.mp4subject2_gaze.csvsubject2.mp4
To process all pairs in the folder:
python annotate.py --idt data/To process a single pair with verbose output and custom IDT thresholds:
python annotate.py data/subject1_gaze.csv data/subject1.mp4 --idt --idt-dis-threshold 0.07 --idt-dur-threshold 120 --verbose- Annotated CSV files and processed videos are saved in the output directory.
- Temporary extracted frames and intermediate files are cleaned up automatically.
This project is licensed under the MIT License. See the LICENSE file for details.
If you use the dataset or the tool in your research, please cite:
@article{cartella2025sanctuaria,
title={Sanctuaria-Gaze: A Multimodal Egocentric Dataset for Human Attention Analysis in Religious Sites},
author={Cartella, Giuseppe and Cuculo, Vittorio and Cornia, Marcella and Papasidero, Marco and Ruozzi, Federico and Cucchiara, Rita},
journal={Journal on Computing and Cultural Heritage (JOCCH)},
year={2025},
publisher={ACM New York, NY, USA},
doi={10.1145/3769091}
}This work has been supported by the PNRR project “Italian Strengthening of Esfri RI Resilience (ITSERR)” funded by the European Union - NextGenerationEU (CUP B53C22001770006).

