Skip to content

Commit 657dab7

Browse files
authored
Merge pull request #22 from Bilpapster/compact-to-native
[FEATURE] Handling compact data representations for all connectors with minimum configurations
2 parents 0395e05 + 3847883 commit 657dab7

10 files changed

Lines changed: 923 additions & 19 deletions

File tree

docs/concepts/compact-vs-native-data.rst

Lines changed: 417 additions & 0 deletions
Large diffs are not rendered by default.

docs/concepts/data-quality.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -289,6 +289,7 @@ What's Next?
289289

290290
Now that you understand how data quality concepts evolve for streaming data:
291291

292+
- 📊 **Understand data formats**: :doc:`compact-vs-native-data` - How Stream DaQ handles different data representations seamlessly
292293
- 🪟 **Learn about windowing**: :doc:`stream-windows` - How to make infinite streams manageable
293294
- 📏 **Explore measures**: :doc:`measures-and-assessments` - The building blocks of Stream DaQ quality checks
294295
- 💡 **See it in action**: :doc:`../examples/index` - Real-world quality monitoring examples

docs/concepts/index.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,13 @@ Welcome to the conceptual heart of Stream DaQ! Understanding these core concepts
4040

4141
**Stream processing principles** - Understand late arrivals, watermarks, and how Stream DaQ handles the complexity of real-time data.
4242

43+
.. grid-item-card:: 📊 Compact vs Native Data
44+
:link: compact-vs-native-data
45+
:link-type: doc
46+
:class-header: bg-secondary text-white
47+
48+
**Data format strategies** - Learn when to use compact vs native formats and how Stream DaQ handles both seamlessly.
49+
4350
The Big Picture
4451
---------------------
4552

@@ -168,5 +175,6 @@ Ready to dive deeper? Start with :doc:`data-quality` to understand why streaming
168175
stream-windows
169176
measures-and-assessments
170177
real-time-monitoring
178+
compact-vs-native-data
171179

172180
|made_with_love|

docs/concepts/stream-windows.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -454,6 +454,7 @@ What's Next?
454454

455455
Now that you understand how to slice infinite streams into manageable windows:
456456

457+
- 📊 **Understand data formats**: :doc:`compact-vs-native-data` - How different data formats work seamlessly with windows
457458
- 📏 **Learn about measures**: :doc:`measures-and-assessments` - What to calculate within each window
458459
- ⚡ **Explore real-time concepts**: :doc:`real-time-monitoring` - Production considerations for windowed monitoring
459460
- 💡 **See windowing in action**: :doc:`../examples/index` - Real-world windowing patterns

docs/examples/advanced-examples.rst

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,139 @@
11
🧙‍♂️ Advanced Examples
22
=============================
33

4+
Compact Data Monitoring Example
5+
--------------------------------
6+
7+
Stream DaQ provides seamless support for compact data formats commonly used in IoT and resource-constrained environments. Instead of manually transforming compact data into individual records, Stream DaQ handles this automatically, allowing you to focus on defining meaningful quality measures.
8+
9+
.. seealso::
10+
11+
For conceptual background on compact vs native data formats, see :doc:`../concepts/compact-vs-native-data`.
12+
13+
**What makes data "compact"?**
14+
15+
Compact data represents multiple field values in a single record, typically using arrays or lists. This format is prevalent in IoT scenarios because it:
16+
17+
- **Reduces bandwidth usage** by ~60% compared to individual field transmissions
18+
- **Minimizes storage requirements** on resource-constrained devices
19+
- **Enables efficient batch transmission** of multiple sensor readings
20+
- **Optimizes network protocols** for wireless sensor networks
21+
22+
**Common IoT scenarios using compact data:**
23+
24+
- Environmental monitoring stations (temperature, humidity, pressure)
25+
- Industrial sensor networks (vibration, temperature, speed)
26+
- Smart building systems (occupancy, air quality, energy usage)
27+
- Vehicle telemetry (GPS coordinates, speed, fuel consumption, engine metrics)
28+
29+
.. code-block:: python
30+
31+
# pip install streamdaq
32+
33+
import pathway as pw
34+
from streamdaq import DaQMeasures as dqm
35+
from streamdaq import CompactData, Windows, StreamDaQ
36+
37+
# Configuration for compact IoT sensor data
38+
FIELDS_COLUMN = "fields"
39+
FIELDS = ["temperature", "humidity", "pressure"] # IoT sensor measurements
40+
VALUES_COLUMN = "values"
41+
TIMESTAMP_COLUMN = "timestamp"
42+
43+
# Example compact data source (simulating IoT sensor network)
44+
class CompactDataSource(pw.io.python.ConnectorSubject):
45+
"""Simulates IoT sensors sending compact data format."""
46+
def run(self):
47+
nof_fields = len(FIELDS)
48+
nof_compact_rows = 5
49+
timestamp = value = 0
50+
for _ in range(nof_compact_rows):
51+
message = {
52+
TIMESTAMP_COLUMN: timestamp,
53+
FIELDS_COLUMN: FIELDS,
54+
VALUES_COLUMN: [value + i for i in range(nof_fields)]
55+
}
56+
value += len(FIELDS)
57+
timestamp += 1
58+
self.next(**message)
59+
60+
# Define schema for compact data structure
61+
schema_dict = {
62+
TIMESTAMP_COLUMN: int,
63+
FIELDS_COLUMN: list[str],
64+
VALUES_COLUMN: list[int | None] # Supports missing values
65+
}
66+
schema = pw.schema_from_dict(schema_dict)
67+
68+
# Create compact data stream
69+
compact_data_stream = pw.io.python.read(
70+
CompactDataSource(),
71+
schema=schema,
72+
)
73+
74+
# Configure Stream DaQ for automatic compact data handling
75+
daq = StreamDaQ().configure(
76+
window=Windows.sliding(duration=3, hop=1, origin=0),
77+
source=compact_data_stream,
78+
time_column=TIMESTAMP_COLUMN,
79+
wait_for_late=1, # Handle late IoT data arrivals
80+
81+
# Stream DaQ automatically transforms compact to native format
82+
compact_data=CompactData() \
83+
.with_fields_column(FIELDS_COLUMN) \
84+
.with_values_column(VALUES_COLUMN) \
85+
.with_values_dtype(int)
86+
)
87+
88+
# Define quality measures for individual sensor fields
89+
# Notice: Direct field access despite compact input format!
90+
daq.add(dqm.count('pressure'), name="readings") \
91+
.add(dqm.missing_count('temperature') +
92+
dqm.missing_count('pressure') +
93+
dqm.missing_count('humidity'),
94+
assess="<2", name="missing_readings") \
95+
.add(dqm.is_frozen('humidity'), name="frozen_humidity_sensor")
96+
97+
# Start monitoring
98+
daq.watch_out()
99+
100+
**Stream DaQ's Automatic Transformation Benefits:**
101+
102+
1. **No Manual Preprocessing**: Stream DaQ internally converts compact data to native format for quality analysis
103+
2. **Seamless Field Access**: Reference individual fields (``temperature``, ``humidity``, ``pressure``) directly in quality measures
104+
3. **Missing Value Handling**: Automatic support for ``None`` values common in real-world IoT scenarios
105+
4. **Type Safety**: Configurable data type handling with validation
106+
5. **Temporal Alignment**: Proper time-based windowing despite compact input format
107+
108+
**Compact vs Native Data Comparison:**
109+
110+
.. code-block:: json
111+
112+
// Compact format (1 record):
113+
{
114+
"timestamp": 1,
115+
"fields": ["temperature", "humidity", "pressure"],
116+
"values": [23.5, 65.2, 1013.25]
117+
}
118+
119+
// Equivalent native format (3 records):
120+
{"timestamp": 1, "temperature": 23.5}
121+
{"timestamp": 1, "humidity": 65.2}
122+
{"timestamp": 1, "pressure": 1013.25}
123+
124+
**Why This Matters for IoT:**
125+
126+
Without Stream DaQ's automatic handling, you would typically need to:
127+
128+
- Manually unpack compact rows into individual field records
129+
- Handle missing values and data type conversions
130+
- Manage temporal alignment across different fields
131+
- Write custom transformation logic before quality monitoring
132+
133+
Stream DaQ eliminates this preprocessing pipeline, allowing you to focus on defining meaningful quality measures rather than data transformation logic. This is especially valuable in resource-constrained environments where development time and computational efficiency are critical.
134+
135+
For a complete working example with detailed comments, see the ``examples/compact_data.py`` file in the examples directory. To understand the conceptual differences between compact and native data formats, see :doc:`../concepts/compact-vs-native-data`.
136+
4137
Schema Validation Example
5138
--------------------------
6139

docs/examples/basic-examples.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,3 +110,11 @@ The trend measure calculates the slope of a linear regression line through the d
110110
Trend analysis complements traditional min-max and range checks for comprehensive data quality monitoring. While threshold checks validate current values, trend analysis ensures data consistency over time by detecting unexpected patterns or gradual shifts that could indicate sensor drift or measurement errors.
111111

112112
Luckily, Stream DaQ offers a suite of over 30 data quality measures, including range conformance, profiling statistics, trend analysis and many more - making comprehensive data quality monitoring both powerful and effortless!
113+
114+
**What's Next?**
115+
116+
Ready for more advanced scenarios? Check out:
117+
118+
- 🧙‍♂️ **Advanced Examples**: :doc:`advanced-examples` - Compact data handling, schema validation, and more
119+
- 📚 **Core Concepts**: :doc:`../concepts/index` - Deep dive into streaming data quality theory
120+
- 📊 **Data Formats**: :doc:`../concepts/compact-vs-native-data` - Understanding different data representations

examples/compact_data.py

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
# pip install streamdaq
2+
3+
import pathway as pw
4+
from streamdaq import DaQMeasures as dqm
5+
from streamdaq import CompactData, Windows, StreamDaQ
6+
7+
# Configuration constants for compact data structure
8+
FIELDS_COLUMN = "fields"
9+
FIELDS = ["temperature", "humidity", "pressure"] # simulating IoT sensor measurements
10+
VALUES_COLUMN = "values"
11+
TIMESTAMP_COLUMN = "timestamp"
12+
13+
14+
# We first need to define a data source sending compact data.
15+
# If you already have one, skip this part!
16+
class CompactDataSource(pw.io.python.ConnectorSubject):
17+
"""
18+
Simulates an IoT sensor network sending compact data format.
19+
20+
Example compact format:
21+
{
22+
"timestamp": 1,
23+
"fields": ["temperature", "humidity", "pressure"],
24+
"values": [23.5, 65.2, 1013.25]
25+
}
26+
27+
vs. traditional native format:
28+
{"timestamp": 1, "temperature": 23.5, "humidity": 65.2, "pressure": 1013.25}
29+
"""
30+
31+
def run(self):
32+
nof_fields = len(FIELDS)
33+
nof_compact_rows = 5 # how many compact data rows to send in this simulation
34+
timestamp = value = 0
35+
for _ in range(nof_compact_rows):
36+
message = {
37+
TIMESTAMP_COLUMN: timestamp,
38+
FIELDS_COLUMN: FIELDS,
39+
VALUES_COLUMN: [value + i for i in range(nof_fields)]
40+
# VALUES_COLUMN: [value + i if (value + i) % 5 > 0 else None for i in range(nof_fields)]
41+
# replace with the above line to make it more spicy by adding a missing reading every five ;)
42+
}
43+
value += len(FIELDS)
44+
timestamp += 1
45+
self.next(**message)
46+
47+
48+
# Define schema for the compact data structure
49+
schema_dict = {
50+
TIMESTAMP_COLUMN: int,
51+
FIELDS_COLUMN: list[str],
52+
VALUES_COLUMN: list[int | None], # Supports missing values (None) for real-world scenarios
53+
}
54+
schema = pw.schema_from_dict(schema_dict)
55+
56+
# Create the compact data stream (simulating IoT sensor network)
57+
compact_data_stream = pw.io.python.read(
58+
CompactDataSource(),
59+
schema=schema,
60+
)
61+
62+
print("The initial data source sends compact data, like this:")
63+
pw.debug.compute_and_print(compact_data_stream)
64+
65+
# If you already have a compact data source, your job starts here!
66+
67+
# Step 1: Configure Stream DaQ for compact data monitoring
68+
# Stream DaQ automatically handles the transformation from compact to native format,
69+
# eliminating the need for manual data preprocessing that would typically require:
70+
# - Unpacking compact rows into individual field records
71+
# - Handling missing values and data type conversions
72+
# - Managing temporal alignment across different fields
73+
daq = StreamDaQ().configure(
74+
window=Windows.sliding(duration=3, hop=1, origin=0), # 3-second sliding window with 1-second hop
75+
source=compact_data_stream,
76+
time_column=TIMESTAMP_COLUMN,
77+
# Just define how your compact data is structured; Stream DaQ takes care of all the rest!
78+
# This CompactData configuration tells Stream DaQ how to interpret your format
79+
compact_data=CompactData()
80+
.with_fields_column(FIELDS_COLUMN)
81+
.with_values_column(VALUES_COLUMN)
82+
.with_values_dtype(int),
83+
)
84+
85+
# Step 2: Define data quality measures for IoT sensor monitoring
86+
# Notice how we can directly reference individual fields (temperature, humidity, pressure)
87+
# even though they arrive in compact format - Stream DaQ handles the unpacking automatically!
88+
daq.add(dqm.count("pressure"), name="readings") \
89+
.add(dqm.missing_count("temperature")
90+
+ dqm.missing_count("pressure")
91+
+ dqm.missing_count("humidity"), # Measures the total missing readings per window in all fields
92+
assess="<2", # We can tolerate at most one missing reading per window
93+
name="missing_readings",
94+
). \
95+
add(dqm.is_frozen("humidity"), name="frozen_humidity_sensor") # Detect stuck humidity sensor
96+
97+
# Complete list of Data Quality Measures (dqm): https://github.com/Bilpapster/stream-DaQ/blob/main/streamdaq/DaQMeasures.py
98+
99+
100+
# Step 3: Kick-off monitoring and let Stream DaQ do the work while you focus on the important
101+
daq.watch_out()
102+
103+
# IoT Compact Data Monitoring Benefits:
104+
#
105+
# 1. Bandwidth Efficiency:
106+
# - Compact format reduces network traffic by ~60% compared to individual field transmissions
107+
# - Critical for battery-powered sensors with limited connectivity
108+
#
109+
# 2. Automatic Transformation:
110+
# - Stream DaQ internally converts compact data to native format for quality analysis
111+
# - No manual preprocessing required - just specify the compact data structure
112+
# - Handles missing values, data types, and temporal alignment automatically
113+
#
114+
# 3. Real-World IoT Scenarios:
115+
# - Environmental monitoring stations (temperature, humidity, pressure)
116+
# - Industrial sensor networks (vibration, temperature, speed)
117+
# - Smart building systems (occupancy, air quality, energy usage)
118+
# - Vehicle telemetry (GPS, speed, fuel consumption, engine metrics)
119+
#
120+
# 4. Quality Monitoring Without Complexity:
121+
# - Apply the same quality measures as native data streams
122+
# - Detect sensor failures, missing readings, and data anomalies
123+
# - Monitor trends and patterns across multiple sensor types simultaneously
124+
#
125+
# Stream DaQ's compact data handling eliminates the typical IoT data preprocessing
126+
# pipeline, allowing you to focus on defining meaningful quality measures rather
127+
# than data transformation logic. This is especially valuable in resource-constrained
128+
# environments where development time and computational efficiency are critical!
129+
#
130+
# 📚 Learn More:
131+
# - Comprehensive compact data documentation: docs/examples/advanced-examples.rst
132+
# - Conceptual background: docs/concepts/compact-vs-native-data.rst

0 commit comments

Comments
 (0)