The Iceberg test data includes a complex nested reviews array field. This document explains how reviews are generated and how to cross-validate query results against expected counts.
num_reviews_per_document = (document_id % 5) + 1This creates a predictable pattern:
- Document ID 0, 5, 10, 15, ... → 1 review
- Document ID 1, 6, 11, 16, ... → 2 reviews
- Document ID 2, 7, 12, 17, ... → 3 reviews
- Document ID 3, 8, 13, 18, ... → 4 reviews
- Document ID 4, 9, 14, 19, ... → 5 reviews
| Pattern | Review Count | # of Documents | Total Reviews |
|---|---|---|---|
| id % 5 == 0 | 1 | 2,000 | 2,000 |
| id % 5 == 1 | 2 | 2,000 | 4,000 |
| id % 5 == 2 | 3 | 2,000 | 6,000 |
| id % 5 == 3 | 4 | 2,000 | 8,000 |
| id % 5 == 4 | 5 | 2,000 | 10,000 |
| TOTAL | Avg: 3 | 10,000 | 30,000 |
Each review object has this structure:
{
"date": "2025-11-13 01:08:32",
"author": "Alexander Stiedemann",
"ratings": {
"Value": 0,
"Cleanliness": 2,
"Overall": 3,
"Check in / front desk": 1,
"Rooms": 2
}
}-
Date: Weekly intervals starting from 2025-11-13
- Review 0: 2025-11-13 01:08:32
- Review 1: 2025-11-20 01:08:32 (+1 week)
- Review 2: 2025-11-27 01:08:32 (+2 weeks)
- And so on...
-
Author: Deterministic based on
(document_id + review_index)- Cycles through first/last name combinations
- Example: Doc 0 Review 0 → "Alexander Stiedemann"
- Example: Doc 1 Review 1 → "Leoma Kuphal"
-
Ratings: 5 rating categories with values 0-4
- Deterministic based on
(document_id + review_index) % 5 - Each rating field offset by different amounts
- Overall rating is most important for aggregations
- Deterministic based on
SELECT id, name, reviews
FROM iceberg_external
WHERE id IN [0, 1, 2, 3, 4]
ORDER BY idExpected Results:
Doc 0: 1 review (array length = 1)
Doc 1: 2 reviews (array length = 2)
Doc 2: 3 reviews (array length = 3)
Doc 3: 4 reviews (array length = 4)
Doc 4: 5 reviews (array length = 5)
Total: 5 documents, 15 review objects (1+2+3+4+5)
Validation Code:
for row in results:
expected = (row['id'] % 5) + 1
actual = len(row['reviews'])
assert actual == expected, f"Doc {row['id']}: expected {expected}, got {actual}"SELECT h.id, h.name, r.author, r.ratings.Overall as overall_rating
FROM iceberg_external h
UNNEST h.reviews r
WHERE h.id < 10
ORDER BY h.id, r.dateExpected Results:
30 rows total (1+2+3+4+5+1+2+3+4+5)
Doc 0: 1 unnested row
Doc 1: 2 unnested rows
Doc 2: 3 unnested rows
Doc 3: 4 unnested rows
Doc 4: 5 unnested rows
Doc 5: 1 unnested row
Doc 6: 2 unnested rows
Doc 7: 3 unnested rows
Doc 8: 4 unnested rows
Doc 9: 5 unnested rows
Validation Code:
def calculate_expected_review_count(doc_ids):
return sum([(doc_id % 5) + 1 for doc_id in doc_ids])
expected = calculate_expected_review_count(range(10)) # = 30
actual = len(results)
assert actual == expected, f"Expected {expected} unnested reviews, got {actual}"SELECT h.type, COUNT(r) as review_count
FROM iceberg_external h
UNNEST h.reviews r
WHERE h.id < 100
GROUP BY h.type
ORDER BY h.typeExpected Results:
For IDs 0-99 (100 documents):
- Total reviews: sum((i % 5) + 1 for i in range(100)) = 300
By type (type cycles every 5):
- Hotel (id % 5 == 0): 20 docs × 1 review = 20 reviews
- Hostel (id % 5 == 1): 20 docs × 2 reviews = 40 reviews
- Resort (id % 5 == 2): 20 docs × 3 reviews = 60 reviews
- Motel (id % 5 == 3): 20 docs × 4 reviews = 80 reviews
- Inn (id % 5 == 4): 20 docs × 5 reviews = 100 reviews
Total: 20+40+60+80+100 = 300 reviews ✓
SELECT h.id, AVG(r.ratings.Overall) as avg_overall
FROM iceberg_external h
UNNEST h.reviews r
WHERE h.id BETWEEN 0 AND 4
GROUP BY h.id
ORDER BY h.idExpected Results:
Doc 0: 1 review with Overall=3 → avg = 3.0
Doc 1: 2 reviews with Overall=[4,0] → avg = 2.0
Doc 2: 3 reviews with Overall=[0,1,2] → avg = 1.0
Doc 3: 4 reviews with Overall=[1,2,3,4] → avg = 2.5
Doc 4: 5 reviews with Overall=[2,3,4,0,1] → avg = 2.0
The test class provides helper methods for validation:
def _get_expected_reviews_for_doc(self, doc_id):
"""Get expected number of reviews for a specific document ID."""
return (doc_id % 5) + 1
def _calculate_expected_review_count(self, doc_ids):
"""Calculate expected total review count for given document IDs."""
return sum([(doc_id % 5) + 1 for doc_id in doc_ids])SELECT id, ARRAY_LENGTH(reviews) as review_count
FROM iceberg_external
WHERE id < 20
ORDER BY idSELECT h.id, r.author, r.date
FROM iceberg_external h
UNNEST h.reviews r
WHERE h.id < 5
ORDER BY h.id, r.dateSELECT h.id, h.name, r.ratings.Overall
FROM iceberg_external h
UNNEST h.reviews r
WHERE r.ratings.Overall >= 4 AND h.id < 100
ORDER BY h.idSELECT id, name, ARRAY_LENGTH(reviews) as review_count
FROM iceberg_external
WHERE ARRAY_LENGTH(reviews) = 5 AND id < 100
ORDER BY idExpected: IDs 4, 9, 14, 19, 24, 29, ... (20 documents)
If you get unexpected review counts:
- Check document ID range: Make sure you're querying the expected IDs
- Verify UNNEST syntax: Ensure proper UNNEST and alias usage
- Check filter conditions: WHERE clauses can change expected counts
- Validate calculation: Use helper methods to compute expected values
-- This should return exactly 30 reviews for IDs 0-9
SELECT COUNT(*) as total_reviews
FROM iceberg_external h
UNNEST h.reviews r
WHERE h.id < 10If result ≠ 30, check:
- Data generation logic
- Iceberg table creation
- Query syntax
The reviews array provides rich data for testing:
- Deterministic: Predictable counts based on document ID
- Nested structure: Tests complex JSON handling
- Cross-validation: Easy to verify correctness
- Realistic: Mimics real hotel review data
Total across all 10,000 documents: 30,000 review objects