Skip to content

Commit a1598f3

Browse files
add(capability): Data Processing & ETL (#957)
1 parent 973624e commit a1598f3

File tree

3 files changed

+141
-1
lines changed

3 files changed

+141
-1
lines changed

architectures/rag-llm-app/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ This document maps the services used in a Retrieval-Augmented Generation (RAG) L
7171

7272
| **Component** | **Google Cloud** | **Amazon Web Services (AWS)** | **Microsoft Azure** | **CCC Service** |
7373
| -------------------------- | ------------------------- | ----------------------------- | ------------------------------- | ------------------------------------------------ |
74-
| **ETL/Data Processing** | Dataflow | Glue, Lambda | Azure Data Factory | Data Processing (Service not yet defined) |
74+
| **ETL/Data Processing** | Dataflow | Glue | Azure Data Factory | [CCC.ETL](/catalogs/orchestration/etl/) |
7575
| **Workflow Orchestration** | Cloud Composer (Airflow) | Step Functions | Durable Functions, Logic Apps | Workflow Orchestration (Service not yet defined) |
7676
| **Chunking & Indexing** | Cloud Composer w/ Airflow | Glue jobs | Data Factory Mapping Data Flows | Data Processing (Service not yet defined) |
7777

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
imported-capabilities:
2+
- reference-id: CCC
3+
entries:
4+
- reference-id: CCC.Core.CP01
5+
remarks: Encryption in Transit
6+
- reference-id: CCC.Core.CP02
7+
remarks: Encryption at Rest
8+
- reference-id: CCC.Core.CP03
9+
remarks: Access Log Publication
10+
- reference-id: CCC.Core.CP06
11+
remarks: Access control
12+
- reference-id: CCC.Core.CP07
13+
remarks: Event Publication
14+
- reference-id: CCC.Core.CP09
15+
remarks: Metrics Publication
16+
- reference-id: CCC.Core.CP10
17+
remarks: Log Publication
18+
- reference-id: CCC.Core.CP14
19+
remarks: API Access
20+
- reference-id: CCC.Core.CP18
21+
remarks: Resource Versioning
22+
- reference-id: CCC.Core.CP20
23+
remarks: Resource Tagging
24+
- reference-id: CCC.Core.CP23
25+
remarks: Network Access Rules
26+
- reference-id: CCC.Core.CP28
27+
remarks: Command-line Interface
28+
- reference-id: CCC.Core.CP29
29+
remarks: Active Ingestion
30+
- reference-id: CCC.Core.CP31
31+
remarks: Elastic Scaling
32+
33+
capabilities:
34+
- id: CCC.ETL.CP01
35+
title: Batch Processing
36+
description: |
37+
Supports the processing of bounded (batch) data sources
38+
using a consistent programming model or engine.
39+
40+
- id: CCC.ETL.CP02
41+
title: Stream Processing
42+
description: |
43+
Supports the processing of unbounded (streaming) data sources
44+
using a consistent programming model or engine.
45+
46+
- id: CCC.ETL.CP03
47+
title: Schema Evolution
48+
description: |
49+
Automatically detects source data structures and manages changes in
50+
schema (e.g., column additions) over time without pipeline failure.
51+
52+
- id: CCC.ETL.CP04
53+
title: Distributed Data Shuffling
54+
description: |
55+
Provides an internal service to re-partition and group data across
56+
distributed workers for complex operations like joins and aggregations.
57+
58+
- id: CCC.ETL.CP05
59+
title: Windowing and Event-Time Processing
60+
description: |
61+
Enables grouping of data based on time attributes, supporting tumbling,
62+
hopping, and session windows with late-data handling (watermarking).
63+
64+
- id: CCC.ETL.CP06
65+
title: Change Data Capture (CDC) Integration
66+
description: |
67+
Supports incremental data ingestion by tracking changes in source
68+
transaction logs rather than full table scans.
69+
70+
- id: CCC.ETL.CP07
71+
title: Connectivity and Connector Library
72+
description: |
73+
Provides pre-built, managed connectors for a variety of sources and
74+
sinks (e.g., Object Storage, RDBMS, NoSQL, Pub/Sub).
75+
76+
- id: CCC.ETL.CP08
77+
title: Job Bookmarks
78+
description: |
79+
Persists the state of a processing job (e.g., checkpointing or bookmarks)
80+
to ensure exactly-once processing and fault tolerance.
81+
82+
- id: CCC.ETL.CP09
83+
title: Pushdown Optimization
84+
description: |
85+
The ability to translate transformation logic into the native language
86+
of the source or sink (e.g., SQL) to minimize data movement.
87+
88+
- id: CCC.ETL.CP10
89+
title: Visual Orchestration
90+
description: |
91+
Provides a graphical interface to define dependencies
92+
between extraction, transformation, and loading tasks.
93+
94+
- id: CCC.ETL.CP11
95+
title: Data Lineage & Metadata Tracking
96+
description: |
97+
Captures and exports metadata regarding the data sources,
98+
the transformation steps, and the final destination (sink), showing the
99+
flow from source to destination for compliance and debugging.
100+
101+
- id: CCC.ETL.CP12
102+
title: User-Defined Function (UDF) Support
103+
description: |
104+
Allows developers to inject custom logic (Python, Java, SQL) into the
105+
managed processing pipeline for complex transformations.
106+
107+
- id: CCC.ETL.CP13
108+
title: Time-Based Job Triggering
109+
description: |
110+
Supports time-based (cron) mechanisms to initiate data processing workflows.
111+
112+
- id: CCC.ETL.CP14
113+
title: Event Based Job Triggering
114+
description: |
115+
Supports event-based (file arrival) mechanisms to initiate
116+
data processing workflows.
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
metadata:
2+
id: CCC.ETL
3+
title: CCC ETL
4+
description: |
5+
Service provides capabilities for extracting, transforming, and loading (ETL) data
6+
across diverse sources and sinks. It supports batch and real-time streaming
7+
architectures, managed data orchestration (DAGs), and serverless execution
8+
engines to process large-scale datasets with built-in fault tolerance.
9+
category-ids:
10+
- CCC.Pipeline
11+
version: ""
12+
last-modified: "2026-02-12T00:00:00Z"
13+
example-csp-services:
14+
- provider: AWS
15+
service: AWS Glue
16+
url: https://docs.aws.amazon.com/glue/
17+
- provider: Azure
18+
service: Azure Data Factory
19+
url: https://learn.microsoft.com/azure/data-factory/
20+
- provider: GCP
21+
service: Google Cloud Dataflow
22+
url: https://cloud.google.com/dataflow/docs
23+
applicability-categories: []
24+
mapping-references: []

0 commit comments

Comments
 (0)