Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 65 additions & 2 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,19 @@ Stream Processing with Apache Flink

This repository contains the code for the book **[Stream Processing: Hands-on with Apache Flink](https://leanpub.com/streamprocessingwithapacheflink)**.

> **🚀 Project Status: Ready for Streaming Processing!**
>
> ✅ **Infrastructure**: RedPanda (Kafka) + Flink cluster running
> ✅ **Project**: Compiled and packaged
> ✅ **Data Loaded**: 1M+ transactions, 5K+ customers, 4K+ accounts in Kafka topics
> ✅ **Next Step**: Ready to run Flink streaming jobs


### Table of Contents
1. [Environment Setup](#environment-setup)
2. [Register UDF](#register-udf)
3. [Deploy a JAR file](#deploy-a-jar-file)
2. [Project Setup and Data Loading](#project-setup-and-data-loading)
3. [Register UDF](#register-udf)
4. [Deploy a JAR file](#deploy-a-jar-file)


### Environment Setup
Expand All @@ -35,6 +43,61 @@ or this command for kafka setup
```


### Project Setup and Data Loading
This section documents the steps to compile the project and load data into Kafka for streaming processing.

#### 1. Compile the Project
The project is a Maven-based Java application. To compile it:

```shell
# Compile only
mvn clean compile

# Compile and create JAR file
mvn clean package
```

#### 2. Load Data into Kafka
After starting the infrastructure (RedPanda/Kafka + Flink), you need to populate the topics with data:

**Run Transactions Producer:**
```shell
mvn exec:java -Dexec.mainClass="io.streamingledger.producers.TransactionsProducer"
```
This will:
- Load 1,000,000 transactions from `/data/transactions.csv`
- Send them to the `transactions` topic
- Use optimized producer settings (batch size: 64KB, compression: gzip)

**Run State Producer:**
```shell
mvn exec:java -Dexec.mainClass="io.streamingledger.producers.StateProducer"
```
This will:
- Load 5,369 customers from `/data/customers.csv` → `customers` topic
- Load 4,500 accounts from `/data/accounts.csv` → `accounts` topic

#### 3. Verify Data Loading
After running both producers, you should have:
- **`transactions`** topic: 1,000,000 transaction records
- **`customers`** topic: 5,369 customer records
- **`accounts`** topic: 4,500 account records

You can monitor the data flow through RedPanda Console at `http://localhost:8080`

#### 4. Alternative Ways to Run
```shell
# Option 1: Using Maven exec plugin (recommended)
mvn exec:java -Dexec.mainClass="io.streamingledger.producers.TransactionsProducer"

# Option 2: Using the compiled JAR
java -cp target/classes:target/dependency/* io.streamingledger.producers.TransactionsProducer

# Option 3: Using the packaged JAR
java -jar target/spf-0.1.0.jar
```


### Register UDF
```shell
CREATE FUNCTION maskfn AS 'io.streamingledger.udfs.MaskingFn' LANGUAGE JAVA USING JAR '/opt/flink/jars/spf-0.1.0.jar';
Expand Down
3 changes: 1 addition & 2 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
version: "3.7"
services:
redpanda:
command:
Expand Down Expand Up @@ -26,7 +25,7 @@ services:
- "19644:9644"
console:
container_name: redpanda-console
image: docker.redpanda.com/vectorized/console:v2.2.3
image: redpandadata/console:v2.2.3
entrypoint: /bin/sh
command: -c 'echo "$$CONSOLE_CONFIG_FILE" > /tmp/config.yml; /app/console'
environment:
Expand Down