diff --git a/Readme.md b/Readme.md index 8d34a16..f1e7b98 100644 --- a/Readme.md +++ b/Readme.md @@ -7,11 +7,19 @@ Stream Processing with Apache Flink This repository contains the code for the book **[Stream Processing: Hands-on with Apache Flink](https://leanpub.com/streamprocessingwithapacheflink)**. +> **🚀 Project Status: Ready for Streaming Processing!** +> +> ✅ **Infrastructure**: RedPanda (Kafka) + Flink cluster running +> ✅ **Project**: Compiled and packaged +> ✅ **Data Loaded**: 1M+ transactions, 5K+ customers, 4K+ accounts in Kafka topics +> ✅ **Next Step**: Ready to run Flink streaming jobs + ### Table of Contents 1. [Environment Setup](#environment-setup) -2. [Register UDF](#register-udf) -3. [Deploy a JAR file](#deploy-a-jar-file) +2. [Project Setup and Data Loading](#project-setup-and-data-loading) +3. [Register UDF](#register-udf) +4. [Deploy a JAR file](#deploy-a-jar-file) ### Environment Setup @@ -35,6 +43,61 @@ or this command for kafka setup ``` +### Project Setup and Data Loading +This section documents the steps to compile the project and load data into Kafka for streaming processing. + +#### 1. Compile the Project +The project is a Maven-based Java application. To compile it: + +```shell +# Compile only +mvn clean compile + +# Compile and create JAR file +mvn clean package +``` + +#### 2. Load Data into Kafka +After starting the infrastructure (RedPanda/Kafka + Flink), you need to populate the topics with data: + +**Run Transactions Producer:** +```shell +mvn exec:java -Dexec.mainClass="io.streamingledger.producers.TransactionsProducer" +``` +This will: +- Load 1,000,000 transactions from `/data/transactions.csv` +- Send them to the `transactions` topic +- Use optimized producer settings (batch size: 64KB, compression: gzip) + +**Run State Producer:** +```shell +mvn exec:java -Dexec.mainClass="io.streamingledger.producers.StateProducer" +``` +This will: +- Load 5,369 customers from `/data/customers.csv` → `customers` topic +- Load 4,500 accounts from `/data/accounts.csv` → `accounts` topic + +#### 3. Verify Data Loading +After running both producers, you should have: +- **`transactions`** topic: 1,000,000 transaction records +- **`customers`** topic: 5,369 customer records +- **`accounts`** topic: 4,500 account records + +You can monitor the data flow through RedPanda Console at `http://localhost:8080` + +#### 4. Alternative Ways to Run +```shell +# Option 1: Using Maven exec plugin (recommended) +mvn exec:java -Dexec.mainClass="io.streamingledger.producers.TransactionsProducer" + +# Option 2: Using the compiled JAR +java -cp target/classes:target/dependency/* io.streamingledger.producers.TransactionsProducer + +# Option 3: Using the packaged JAR +java -jar target/spf-0.1.0.jar +``` + + ### Register UDF ```shell CREATE FUNCTION maskfn AS 'io.streamingledger.udfs.MaskingFn' LANGUAGE JAVA USING JAR '/opt/flink/jars/spf-0.1.0.jar'; diff --git a/docker-compose.yaml b/docker-compose.yaml index cd55c96..1301e4d 100644 --- a/docker-compose.yaml +++ b/docker-compose.yaml @@ -1,4 +1,3 @@ -version: "3.7" services: redpanda: command: @@ -26,7 +25,7 @@ services: - "19644:9644" console: container_name: redpanda-console - image: docker.redpanda.com/vectorized/console:v2.2.3 + image: redpandadata/console:v2.2.3 entrypoint: /bin/sh command: -c 'echo "$$CONSOLE_CONFIG_FILE" > /tmp/config.yml; /app/console' environment: