Skip to content

Latest commit

 

History

History
209 lines (155 loc) · 8.79 KB

File metadata and controls

209 lines (155 loc) · 8.79 KB

Fluent Bit Log Forwarder for Codon HPC Cluster

This guide outlines the process for deploying a Fluent Bit log forwarder on the Codon HPC cluster. The goal is to collect logs generated by SLURM jobs and forward them to a centralized EFK (Elasticsearch, Fluentd, Kibana) stack.

This setup uses a periodic SLURM job, managed by scrontab, to collect and ship logs. This approach is necessary in HPC environments where users typically lack permissions to run persistent daemons on compute nodes, which aligns with the best practice of keeping those nodes dedicated to scheduled jobs.

High-Level Overview

The log forwarding process works as follows:

  1. Log Generation: SLURM jobs are configured to write their logs to a designated directory on a shared filesystem (/homes/bia_svc/log-forwarder/logs/).
  2. Scheduled Collection: A scrontab job runs periodically (e.g., every 30 minutes), submitting a SLURM batch script (run_demo_cycle.sbatch).
  3. Log Forwarding: This script executes Fluent Bit within a Singularity container. Fluent Bit reads new log entries, tracks its progress to avoid duplication, and sends them to the central Fluentd service.
  4. Centralized Storage & Visualization: The logs are received by the EFK stack, indexed in Elasticsearch, and become available for searching and visualization in Kibana.

This entire process is automated and designed to run with minimal user privileges on the HPC cluster.

1. Prerequisites

Before you begin, ensure you have the following:

  • Singularity/Apptainer: Codon, like many HPC clusters, uses Singularity as a container runtime instead of Docker. You will need to build or pull a Singularity image for Fluent Bit.
  • Centralized Log Storage: Your SLURM jobs should be configured to write logs to a shared filesystem accessible by your forwarding job (e.g., /hps or /nfs).
  • EFK Cluster Endpoint: You need the hostname and port for the HTTP input of your centralized Fluentd/Elasticsearch instance.

Run

kubectl --kubeconfig=k8s-hh-wp-webadmin-32-conf.yaml -n logging-test get service fluentd-http

and

kubectl --kubeconfig=k8s-hh-wp-webadmin-32-conf.yaml -n logging-test get nodes -o wide 

and combine the internal ip and port number.

  • One can see the logs in Kibana's Discover tab: Run
kubectl --kubeconfig=k8s-hh-wp-webadmin-32-conf.yaml -n logging-test get service kibana

and

kubectl --kubeconfig=k8s-hh-wp-webadmin-32-conf.yaml -n logging-test get nodes -o wide 

and combine the internal ip and port number. Make sure to update it in the config file.

2. Directory Structure

Organize your files on the shared filesystem as follows. Currently files are in /homes/bia_svc/log-forwarder for testing purposes.

This structure keeps your configuration, scripts, and log data organized.

/homes/bia_svc/log-forwarder/
├── add_tag.lua             # Lua file for tagging the logs
├── fluent-bit.conf         # Fluent Bit configuration
├── generate_log.sh         # Log generator for testing
├── parsers.conf            # Parser configuration
├── fluentbit_data/         # Stores state database for log file offsets (separated by env)
└── logs/                   # Directory where your jobs write their logs
    ├── bia/
    │   ├── dev/
    │   │   └── ingest_run.log
    │   ├── staging/
    │   │   └── ingest_run.log
    │   └── prod/
    │       └── ingest_run.log
    └── empiar/
        ├── dev/
        │   └── some_job.log
        ├── staging/
        │   └── some_job.log
        └── prod/
            └── some_job.log
└── run_demo_cycle.sbatch   # The SLURM submission script

3. Configuration

4. Building and Deploying the Log Forwarder

4.1. Building and Testing the Singularity Image

Before submitting the job to SLURM, you must build a Singularity image for Fluent Bit and test it to ensure it functions correctly.

singularity pull docker://fluent/fluent-bit:latest

This command downloads the latest Fluent Bit image from Docker Hub and converts it into a Singularity Image File (.sif) named fluent-bit_latest.sif.

Note that the deployment is handled by GitLab CI under the section deploy-hpc.

4.2. SLURM Submission Script

To test the forwarder, you must first generate some log data. You can do this by running the generate_log.sh script, which will create or append to a log file in the logs/ directory. After generating logs, submit the sbatch job to send them.

To run the job with the default application (bia) and environment (dev), simply submit the script:

# 1. Generate sample logs
./generate_log.sh -a bia -e dev

# 2. Submit the job to forward the logs
sbatch run_demo_cycle.sbatch

4.3. Dry Run bia-ingest

This section provides a quick guide to setting up the bia-ingest tool from the fluent-logging branch and performing a test run.

1. Setup and Installation

# 1. Clone the repository
git clone https://github.com/BioImage-Archive/bia-integrator.git
cd bia-integrator

# 2. Switch to the feature branch
git checkout fluent-logging

# 3. Set up a Python environment and install dependencies
python -m venv venv
source venv/bin/activate
pip install poetry
poetry --directory=bia-ingest install

# If pyproject.toml has changed, you may need to update the lock file first:
poetry --directory=bia-ingest lock

2. Running a Dry Run with File Logging

This example executes the ingest logic without saving any data (--dryrun) and directs all detailed logs to a file (--logfile).

poetry --directory=bia-ingest run biaingest ingest --dryrun S-BIAD1285 --logfile /homes/bia_svc/log-forwarder/logs/ingest_run.log

5. Scheduling with scrontab

To run your log forwarding job periodically, use scrontab, SLURM's built-in cron utility.

  • Edit your scrontab:
scrontab -e
  • Add a new cron job: Add a line to the file that defines the schedule and the script to run. The format is similar to a standard Linux crontab. For example, to run the job every 30 minutes:
# Log forwarder (namespace=logging-test)
#SCRON --time=15
#SCRON --mem=500M
*/30 * * * * sbatch /homes/bia_svc/log-forwarder/run_demo_cycle.sbatch
  • Save and exit the editor. Verify the job is scheduled:
scrontab -l

This will list your currently scheduled jobs.

6. Usage

  • Configure Your Jobs: Ensure your primary SLURM jobs write their application logs to the /homes/bia_svc/log-forwarder/logs/ directory.
  • Let it Run: The scrontab scheduler will automatically submit the run_demo_cycle.sbatch job at the specified interval in scrontab.

Each time it runs, it will:

  • Scan the /logs directory for new or updated log files.
  • Send any new log entries since its last run to your EFK cluster.
  • Exit gracefully thanks to the Exit_On_Eof flag.

Then one can see the logs in Kibana's Discover tab.

How to View and Filter Logs in Kibana

Once logs are forwarded to the EFK stack, they are indexed in Elasticsearch with a tag that identifies their application and environment. This allows for powerful filtering in Kibana.

  1. Navigate to Kibana: Open your Kibana dashboard using the URL you found in the previous step (http://<NODE_IP>:<NODE_PORT>).
  2. Go to the Discover Tab: This is the primary interface for exploring log data.
  3. Set the Time Range: Make sure the time filter in the top-right corner is set to a range that includes when your logs were generated (e.g., "Last 15 minutes").

Filtering by Application and Environment

The logs forwarded from the HPC contain an event_tag field that identifies their application and environment (e.g., codon.bia.staging). You can use this field in Kibana's Query Language (KQL) to filter your logs.

  • To see all logs from the bia application:
    event_tag : codon.bia.*
  • To see all logs from the staging environment across all applications:
    event_tag : codon.*.staging
  • To see logs for a specific combination, like empiar in prod:
    event_tag : codon.empiar.prod

7. Troubleshooting

  • Check SLURM Output: The /homes/bia_svc/logs/log-forwarder-%j.out and /homes/bia_svc/logs/log-forwarder-%j.err files will contain the output from your submission script, including any errors from SLURM or the singularity command.
  • Fluent Bit Verbosity: If logs are not being sent, temporarily increase the Log_Level in fluent-bit.conf to debug to get more detailed output.
  • Path Mismatches: The most common issue is incorrect paths. Double-check all paths in fluent-bit.conf and run_demo_cycle.sbatch, ensuring the --bind paths for Singularity are correct.