This document describes the metadata schema for Spatial LibriSpeech. The metadata is organised as columns of a pandas' DataFrame. We first describe how to load the dataframe, and then list all its columns.
Once you have downloaded any the dataset you can load the dataset with the following command:
import pandas as pd
metadata = pd.read_parquet("<path_to_dataset>/metadata.parquet")Alternatively, you can download the metadata directly from the internet:
import pandas as pd
metadata = pd.read_parquet("https://docs-assets.developer.apple.com/ml-research/datasets/spatial-librispeech/v1/metadata.parquet")The rest of this document describes the metadata schema for Spatial LibriSpeech (organised by columns in the dataframe)
-
sample_id- Type:
int - Range: [0, 22]
- Dataloader feature:
spatial_librispeech.Feature.SAMPLE_ID - Numeric identifier for sample. Corresponding audio files will be named
{sample_id:06}.flac.
- Type:
-
split- Type:
string - Values: [
train,test] - Determines whether current sample belongs to the train or the test set.
- Type:
-
lite_version- Type:
boolean - If true, current sample is part of the lite version of the dataset.
- Type:
-
acoustics/frequency_bins- Type:
numpy.arrayof 33floats - Unit: hertz
- Dataloader feature:
spatial_librispeech.Feature.FREQUENCY_BINS - Mean frequency values of the third octave bins used for all acoustic features.
- Type:
-
acoustics/c50_db- Type:
numpy.arrayof 33floats - Unit: decibels
- Dataloader feature:
spatial_librispeech.Feature.C50_DB - Third octave values of speech clarity (C50) in decibels.
- Type:
-
acoustics/drr_db- Type:
numpy.arrayof 33floats - Unit: decibels
- Dataloader feature:
spatial_librispeech.Feature.DRR_DB - Third octave values of direct-to-reverberant-ratio (DRR) in decibels.
- Type:
-
acoustics/edt_ms- Type:
numpy.arrayof 33floats - Unit: milliseconds
- Dataloader feature:
spatial_librispeech.Feature.EDT_MS - Third octave values of early-decay time (EDT) in milliseconds.
- Type:
-
acoustics/t20_ms- Type:
numpy.arrayof 33floats - Unit: milliseconds
- Dataloader feature:
spatial_librispeech.Feature.T20_MS - Third octave values of 20dB decay duration (T20) in milliseconds.
- Type:
-
acoustics/t30_ms- Type:
numpy.arrayof 33floats - Unit: milliseconds
- Dataloader feature:
spatial_librispeech.Feature.T30_MS - Third octave values of 30dB decay duration (T30) in milliseconds.
- Type:
-
audio_info/duration- Type:
float - Unit: seconds
- Sample duration in seconds.
- Type:
-
audio_info/frames- Type:
int - Unit: seconds
- Number of frames in audio sample. Note sample rate is 16kHz.
- Type:
-
audio_info/size/ambisonics- Type:
int - Unit: bytes
- Size of speech ambisonics sample in bytes.
- Type:
-
audio_info/size/noise_ambisonics- Type:
int - Unit: bytes
- Size of distractor noise ambisonics sample in bytes.
- Type:
-
audio_info/checksum/ambisonics- Type:
string - Hexadecimal representation of sha-256 checksum of speech ambisonics sample.
- Type:
-
audio_info/checksum/noise_ambisonics- Type:
string - Hexadecimal representation of sha-256 checksum of distractor noise ambisonics sample.
- Type:
-
speech/azimuth- Type:
float - Unit: radians
- Dataloader feature:
spatial_librispeech.Feature.AZIMUTH - Horizontal angle between speech source and microphone array.
- Type:
-
speech/elevation- Type:
float - Unit: radians
- Dataloader feature:
spatial_librispeech.Feature.ELEVATION - Vertical angle between speech source and microphone array.
- Type:
-
speech/distance- Type:
float - Unit: meters
- Dataloader feature:
spatial_librispeech.Feature.DISTANCE - Distance between speech source and microphone array.
- Type:
-
speech/speaking_azimuth- Type:
float - Unit: radians
- Dataloader feature:
spatial_librispeech.Feature.SPEAKING_AZIMUTH - Horizontal rotation of speech source with respect to microphone array. Zero is in direct line of array.
- Type:
-
speech/speaking_elevation- Type:
float - Unit: radians
- Dataloader feature:
spatial_librispeech.Feature.SPEAKING_ELEVATION - Vertical rotation of speech source with respect to microphone array. Zero is in direct line of array.
- Type:
-
speech/mrp- Type:
float - Unit: decibels at random active speech level (dB-ASL)
- Signal power for speech at mouth reference point (25 mm in front of the lip plane,cf. ITU-T P.58)
- Type:
-
speech/source_id- Type:
float - Range: [1, 20]
- Numeric identifier for speech source location in the room.
- Type:
-
speech/directivity_id- Type:
int - Range: [0, 15]
- Dataloader feature:
spatial_librispeech.Feature.DIRECTIVITY_ID - Numeric identifier for the different directivity profiles applied to speech.
- Type:
-
speech/librispeech_metadata/book_id"- Type:
int -
- Dataloader feature:
spatial_librispeech.Feature.BOOK_ID
- Dataloader feature:
- Numeric identifier for book being read.
- Type:
-
speech/librispeech_metadata/chapter_id"- Type:
int - Dataloader feature:
spatial_librispeech.Feature.CHAPTER_ID - Numeric identifier for chapter being read.
- Type:
-
speech/librispeech_metadata/chapter_title"- Type:
string - Name of chapter being read.
- Type:
-
speech/librispeech_metadata/project_id"- Type:
int - Dataloader feature:
spatial_librispeech.Feature.PROJECT_ID - Numeric identifier for project being read.
- Type:
-
speech/librispeech_metadata/project_title"- Type:
string - Name of project being read.
- Type:
-
speech/librispeech_metadata/reader_id"- Type:
int - Numeric identifier for reader.
- Type:
-
speech/librispeech_metadata/reader_name"- Type:
string - Name (or alias) of reader.
- Type:
-
speech/librispeech_metadata/reader_sex"- Type:
string - Dataloader feature:
spatial_librispeech.Feature.READER_SEX - Values: [
m,f] # TODO: verify these - Sex of reader.
- Type:
-
speech/librispeech_metadata/sequence_number"- Type:
int - Dataloader feature:
spatial_librispeech.Feature.SEQUENCE_NUMBER - Sequence identifier for utterance (will be ordered for same project, book, and chapter).
- Type:
-
speech/librispeech_metadata/transcription"- Type:
string - Dataloader feature:
spatial_librispeech.Feature.TRANSCRIPTION - Transcription of text being read.
- Type:
-
speech/librispeech_metadata/subset"- Type:
string - Values: [
train-clean-100,train-clean-360,test-clean,test-other] - Original librispeech subset for utterance.
- Type:
-
noise/azimuth- Type:
float - Unit: radians
- Dataloader feature:
spatial_librispeech.Feature.NOISE_AZIMUTH - Horizontal angle between distractor noise source and microphone array.
- Type:
-
noise/elevation- Type:
float - Unit: radians
- Dataloader feature:
spatial_librispeech.Feature.NOISE_ELEVATION - Vertical angle between distractor noise source and microphone array.
- Type:
-
noise/distance- Type:
float - Unit: meters
- Dataloader feature:
spatial_librispeech.Feature.NOISE_DISTANCE - Distance between distractor noise source and microphone array.
- Type:
-
noise/snr- Type:
float - Unit: decibels
- Dataloader feature:
spatial_librispeech.Feature.SNR_DB - Signal-to-noise ratio for distractor noise in decibels.
- Type:
-
noise/source_id- Type:
float - Range: [1, 20]
- Numeric identifier for distractor noise source location in the room.
- Type:
-
noise/deep_noise_suppression_metadata/filename- Type:
string - Filename of noise in deep noise suppression.
- Type:
-
noise/deep_noise_suppression_metadata/is_audioset- Type:
boolean - If true, noise is part of Audioset dataset, otherwise it is part of Freesound dataset.
- Type:
-
noise/deep_noise_suppression_metadata/is_audioset- Type:
string - Comma-separated labels for noise, for Audioset you may need to consult lookup table.
- Type:
-
noise/deep_noise_suppression_metadata/youtube_id- Type:
string - Only available for Audioset samples, youtube id of video from where noise was extracted.
- Type:
-
room/room_id- Type:
int - Dataloader feature:
spatial_librispeech.Feature.ROOM_ID - Numeric identifier for simulated simulated room.
- Type:
-
room/floor_area- Type:
float - Unit: squared meters
- Room's floor surface area.
- Type:
-
room/surface_area- Type:
float - Unit: squared meters
- Sum of the area of all surface (floor, walls, ceiling) in the room.
- Type:
-
room/volume- Type:
float - Unit: cubic meters
- Room's total volume.
- Type: