Thank you so much for your wonderful work. Other than #19 , I also find some other issues with this dataset.
Problem Overview of NavQA dataset
ALL position-related questions in the NavQA dataset have identical ground truth answers, which is a data annotation error that affects evaluation accuracy.
Cause Analysis
- Time Parsing Issue
Human annotations: Time stamps like 7:57:53 (morning 7:57 AM)
Actual caption time range: 10:49:45 to 11:03:17 (morning 10:49-11:03 AM)
Parsed timestamps: 1673873873.0 (corresponding to 7:57:53 AM)
Problem: Parsed timestamps are ~3 hours earlier than the actual caption time range
- Caption Index Lookup Failure
Due to incorrect time parsing, np.argmax(diff > 0) - 1 returns -1
All position questions fail to find corresponding captions
context_captions becomes an empty list
- Fallback Mechanism Issue
When context_captions is empty, the code uses a fallback mechanism
All position questions end up using the same caption (index 162)
This caption has position: [-0.39364902000000007, 0.0023255999999999897, -0.011701379999999999]
Data Flow Analysis
Human CSV annotation → Time parsing → Caption lookup → Position extraction
↓ ↓ ↓ ↓
"7:57:53" → 1673873873.0 → No match → Fallback to same position
Code Location
File: remembr/scripts/question_scripts/form_question_jsons.py
Function: parse_answer() (lines 60-64):
elif q_type == 'position':
if len(context) == 1:
out_dict = {
'position': context[0]['position'] # Always same position
}
Impact
Evaluation accuracy: All position questions have identical ground truth
Current Status
Script runs without errors but produces the same caption (index 162), with position error ~ 200m
This is a data quality issue that needs to be addressed at the annotation level rather than just the code level.
Thank you so much for your wonderful work. Other than #19 , I also find some other issues with this dataset.
Problem Overview of NavQA dataset
ALL position-related questions in the NavQA dataset have identical ground truth answers, which is a data annotation error that affects evaluation accuracy.
Cause Analysis
Human annotations: Time stamps like 7:57:53 (morning 7:57 AM)
Actual caption time range: 10:49:45 to 11:03:17 (morning 10:49-11:03 AM)
Parsed timestamps: 1673873873.0 (corresponding to 7:57:53 AM)
Problem: Parsed timestamps are ~3 hours earlier than the actual caption time range
Due to incorrect time parsing,
np.argmax(diff > 0) - 1 returns -1All position questions fail to find corresponding captions
context_captions becomes an empty list
When context_captions is empty, the code uses a fallback mechanism
All position questions end up using the same caption (index 162)
This caption has position: [-0.39364902000000007, 0.0023255999999999897, -0.011701379999999999]
Data Flow Analysis
Code Location
File:
remembr/scripts/question_scripts/form_question_jsons.pyFunction:
parse_answer()(lines 60-64):Impact
Evaluation accuracy: All position questions have identical ground truth
Current Status
Script runs without errors but produces the same caption (index 162), with position error ~ 200m
This is a data quality issue that needs to be addressed at the annotation level rather than just the code level.