Skip to content

Apache Kafka Tutorial: Introduction for Beginners

lyx2000 edited this page Apr 23, 2025 · 1 revision

Overview

Apache Kafka has established itself as a cornerstone technology for building real-time data pipelines and streaming applications. This comprehensive guide explores the fundamentals of Kafka, its architecture, core concepts, and practical implementation details to help beginners understand and work with this powerful distributed streaming platform.

What is Apache Kafka?

Apache Kafka is an open-source distributed event streaming platform designed to handle high-throughput, real-time data feeds. Originally developed at LinkedIn to process massive amounts of user interaction data, Kafka has since evolved into a robust solution used by over 80% of Fortune 100 companies[1]. It serves as a central nervous system for modern data-driven organizations, enabling them to collect, store, process, and analyze streams of events at scale.

Kafka provides three key capabilities that make it uniquely powerful:

  • Publishing and subscribing to streams of records (similar to a message queue)

  • Storing streams of records durably and reliably

  • Processing streams of records as they occur or retrospectively[2]

Unlike traditional messaging systems that typically delete messages after consumption, Kafka maintains a configurable retention period for all published records, making it possible to replay data streams and reprocess information when needed. This fundamental design choice enables Kafka to serve both real-time applications and batch processing systems with the same underlying infrastructure.

Core Concepts and Architecture

Understanding Kafka's architecture requires familiarity with several key concepts:

Events

At the heart of Kafka is the concept of an event (also called a message or record). An event represents something that happened in the world - such as a payment transaction, website click, sensor reading, or any other noteworthy occurrence. In Kafka, events are typically represented as key-value pairs, often serialized in formats like JSON, Avro, or Protocol Buffers[1].

Topics

Topics function as logical channels or categories to which events are published. They serve as the fundamental organizing principle in Kafka, allowing producers and consumers to focus only on relevant data streams. For example, a retail application might have separate topics for "orders," "inventory-updates," and "user-signups."[1].

Learn More : What is a Kafka Topic ? All You Need to Know & Best Practices▸

Partitions

Each topic in Kafka is divided into partitions, which are ordered, immutable sequences of records. Partitions serve two critical purposes: they enable parallel processing by allowing multiple consumers to read from a topic simultaneously, and they distribute data across multiple servers for scalability and fault tolerance[8].

When messages have no specified key, they are distributed across partitions in a round-robin fashion. Messages with the same key are guaranteed to be sent to the same partition, ensuring ordered processing within that key[8].

Learn More: What is a Kafka Partition ? All You Need to Know & Best Practices▸

Brokers

Brokers are the servers that form a Kafka cluster. Each broker hosts some of the partitions from various topics and handles requests from producers, consumers, and other brokers. A Kafka cluster typically consists of multiple brokers for redundancy and load distribution[9].

Learn More: Learn Kafka Broker: Definition & Best Practices▸

Producers and Consumers

Producers are applications that publish events to Kafka topics. They can choose to specify which partition to send messages to or allow Kafka to handle distribution based on the message key[10].

Consumers are applications that subscribe to topics and process the published events. They maintain an offset (position) in each partition they consume, allowing them to control their position in the event stream[10].

Learn More: Apache Kafka Clients: Usage & Best Practices▸

Consumer Groups

Consumer groups allow a group of consumers to collaborate in processing messages from one or more topics. Kafka ensures that each partition is consumed by exactly one consumer in the group, facilitating parallel processing while maintaining ordered delivery within each partition[9].

Learn more: What is Kafka Consumer Group?▸

ZooKeeper and KRaft

Traditionally, Kafka relied on Apache ZooKeeper for cluster coordination, metadata management, and leader election. However, recent versions of Kafka have introduced KRaft (Kafka Raft) mode, which eliminates the ZooKeeper dependency by implementing the coordination layer within Kafka itself[5][6].

Learn More:

How Kafka Works

Kafka's operation can be understood through several key mechanisms:

Message Storage and Retention

Kafka stores all published messages on disk, maintaining them for a configurable retention period regardless of whether they've been consumed. This persistence layer uses a highly efficient append-only log structure, allowing Kafka to deliver high throughput even with modest hardware[2].

Distributed and Replicated Design

To ensure fault tolerance, Kafka replicates partition data across multiple brokers. Each partition has one broker designated as the leader, handling all reads and writes, while other brokers maintain replicas that stay in sync with the leader. If a leader fails, one of the in-sync replicas automatically becomes the new leader[9].

Producer and Consumer Operation

When a producer publishes a message, it connects to any broker in the cluster, which acts as a bootstrap server. The broker provides metadata about topic partitions and their leaders, allowing the producer to route subsequent requests directly to the appropriate leaders.

Consumers operate similarly, first connecting to a bootstrap server to discover partition leaders, then establishing connections to those leaders to stream messages. Consumers track their position in each partition using offsets, which they periodically commit back to Kafka to enable resumption after failures[10].

Working with Kafka

Once your Kafka environment is running, you can begin producing and consuming messages:

Producing Messages

The following command starts a console producer that allows sending messages to a topic:


bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic myfirsttopic
>my first message 
>my second message

Alternatively, you can write a producer application using Kafka's client libraries. Here's a simple Java example:


import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.serialization.StringSerializer;
import java.util.Properties;

public class SimpleProducer {
    public static void main(String[] args) {
        String topicName = "myfirsttopic";
        
        Properties props = new Properties();
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        
        Producer<String, String> producer = new KafkaProducer<>(props);
        
        for (int i = 0; i < 10; i++) {
            producer.send(new ProducerRecord<>(topicName, "key-"+i, "value-"+i));
        }
        
        producer.close();
    }
}

This producer sends ten messages with keys and values to the specified topic[10].

Consuming Messages

To consume messages using the console consumer:


bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic myfirsttopic --from-beginning

For programmatic consumption, here's a simple Java consumer example:


import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.common.serialization.StringDeserializer;
import java.time.Duration;
import java.util.Collections;
import java.util.Properties;

public class SimpleConsumer {
    public static void main(String[] args) {
        String topicName = "myfirsttopic";
        
        Properties props = new Properties();
        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(ConsumerConfig.GROUP_ID_CONFIG, "my-group");
        props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
        
        Consumer<String, String> consumer = new KafkaConsumer<>(props);
        consumer.subscribe(Collections.singletonList(topicName));
        
        try {
            while (true) {
                ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
                for (ConsumerRecord<String, String> record : records) {
                    System.out.printf("Offset = %d, Key = %s, Value = %s%n", 
                                     record.offset(), record.key(), record.value());
                }
            }
        } finally {
            consumer.close();
        }
    }
}


This consumer continuously polls for new messages and processes them as they arrive[10].

Best Practices

Implementing Kafka effectively requires adherence to several best practices:

Topic Design

  • Create topics with an appropriate number of partitions based on your throughput needs and consumer parallelism requirements

  • Use descriptive topic names that reflect the data they contain

  • Consider topic compaction for key-based datasets where only the latest value per key is needed

Producer Configuration

  • Set appropriate acknowledgment levels ( acks ) based on your durability requirements:

    • acks=0 for maximum throughput with no durability guarantees

    • acks=1 for confirmation from the leader (potential data loss if leader fails)

    • acks=all for confirmation from all in-sync replicas (highest durability)

  • Enable idempotent producers to prevent duplicate messages

  • Configure batch size and linger time to optimize throughput

Consumer Configuration

  • Design consumer groups carefully, considering throughput requirements and processing semantics

  • Implement proper error handling for consumer applications

  • Manage offsets explicitly for critical applications instead of relying on automatic commits

  • Set appropriate values for max.poll.records and max.poll.interval.ms based on your processing requirements

Monitoring and Maintenance

  • Monitor consumer lag to identify processing bottlenecks

  • Track broker health metrics including disk usage, CPU, and memory

  • Implement alerting for critical conditions such as under-replicated partitions

  • Regularly review and adjust partition counts as your application scales

Use Cases for Kafka

Kafka's versatility makes it suitable for numerous applications:

Real-time Data Pipelines

Kafka excels at moving data between systems in real-time, serving as the backbone for ETL processes, change data capture, and data integration patterns.

Event-Driven Microservices

Organizations use Kafka to facilitate communication between microservices while maintaining loose coupling and enabling system-wide event sourcing.

Stream Processing

Combined with processing frameworks like Kafka Streams or Apache Flink, Kafka enables real-time analytics, complex event processing, and continuous transformation of data streams.

Activity Tracking and Monitoring

Kafka's ability to handle high-volume event streams makes it ideal for collecting user activity data, application metrics, logs, and system telemetry.

Conclusion

Apache Kafka provides a robust foundation for building real-time data systems. Its unique combination of high throughput, scalability, and durability enables applications that were previously impractical with traditional messaging systems.

This tutorial has introduced the fundamental concepts, architecture, and practical aspects of working with Kafka. As you continue exploring Kafka, consider diving deeper into topics like security configuration, advanced stream processing with Kafka Streams, and integration with other data systems.

By mastering Kafka, you'll unlock powerful capabilities for building modern, event-driven applications that can process vast amounts of data with reliability and efficiency.

If you find this content helpful, you might also be interested in our product AutoMQ. AutoMQ is a cloud-native alternative to Kafka by decoupling durability to S3 and EBS. 10x Cost-Effective. No Cross-AZ Traffic Cost. Autoscale in seconds. Single-digit ms latency. AutoMQ now is source code available on github. Big Companies Worldwide are Using AutoMQ. Check the following case studies to learn more:

References:

  1. What is Apache Kafka?

  2. AWS: What is Apache Kafka?

  3. Best Kafka Tutorials, Examples, and Learning Resources

  4. Apache Kafka in 6 Minutes

  5. Kafka Tutorial for Beginners

  6. Apache Kafka for Beginners: A Comprehensive Guide

  7. Kafka Tutorial - Core Concepts

  8. Apache Kafka Handbook

  9. Setting up a Kafka Cluster

  10. Kafka Architecture Tutorial

  11. An Introduction to Apache Kafka

  12. Get Started with Apache Kafka

  13. Getting Started with Kafka and Java

  14. Apache Kafka Tutorial

  15. Kafka Tutorial

  16. Apache Kafka Tutorial

  17. Kafka for Beginners

  18. Introduction to Kafka

  19. Apache Kafka Crash Course

  20. Step-by-Step Guide to Redpanda Console for Kafka

  21. Kafka Introduction

  22. Getting Started with Kafka

  23. Kafka Beginners Course

  24. Confluent Platform Quickstart

  25. Learn Apache Kafka with Conduktor

  26. Nussknacker Redpanda Integration

  27. Kafka Performance Tuning

  28. Complete Guide to Apache Kafka for Beginners

  29. Kafka Best Practices

  30. Confluent Platform Quickstart Guide

  31. Kafka Post-Deployment

  32. Enterprise Kafka Cluster Strategies and Best Practices

  33. Apache Kafka Fundamentals

  34. Best Practices for Apache Kafka Configuration

  35. Getting Started with Kafka

  36. Apache Kafka for Beginners

  37. Hands-on Kafka: First Steps

  38. Event Streaming with Apache Kafka

  39. LinkedIn Learning: Kafka

  40. Apache Kafka for Beginners: Learn Kafka by Hands-on

  41. Apache Kafka Tutorial for Beginners

  42. Coursera Kafka Courses

  43. Apache Kafka Basics: Beginners Guide

  44. Apache Kafka in 100 Seconds

  45. Kafka Scala Tutorial for Beginners

  46. Apache Kafka Tutorial Playlist

  47. Apache Kafka Events Course

  48. Get Started with Kafka in 5 Simple Steps

  49. Apache Kafka Quickstart

  50. Introduction to Apache Kafka

  51. Kafka on Confluent Cloud

  52. Apache Kafka for Beginners

  53. Kafka Consumer Best Practices

  54. Top 10 Kafka Configuration Tweaks for Performance

  55. Quickstart Guide for Apache Kafka

  56. Kafka Best Practices Guide

  57. Apache Kafka Deep Dive

  58. The Best Apache Kafka Tutorials

  59. Best Kafka Course for Learning

  60. Kafka Cluster Architecture

  61. EP92: Top 5 Kafka use cases

AutoMQ Wiki Key Pages

What is automq

Getting started

Architecture

Deployment

Migration

Observability

Integrations

Releases

Benchmarks

Reference

Articles

Clone this wiki locally