Datakeeper

The Datakeeper program is designed to maintain data size at reasonable levels for systems like DAS (Distributed Acoustic Sensing) by enforcing various criteria, such as retention periods and other data reduction strategies.

Data Retention Policy

The DAS server stores data in HDF5 format at 10-second intervals, creating a new folder for each day.

Function – Automatic Deletion

The DAS box includes built-in support for automatic data deletion:

Data is automatically removed after X days (the retention period).
The retention period is configurable and can be adjusted based on specific needs.

Data Reduction Policy

The Data Reduction Policy defines strategies to manage and reduce the volume of stored data:

Features:

Automatic Deletion:
Data older than X days is automatically deleted, with configurable deletion strategies.
Modify Region of Interest (ROI):
Allows selection of specific channels, e.g., n[i]...n[j], from the HDF5 file. The selected region is saved into a new HDF5 file for further analysis.
Reduce Sampling Rate:
- Temporal Downsampling: Reduce the data sampling rate in time, using techniques like averaging or summation.
- Spatial Downsampling: Reduce the number of channels (space), again using aggregation methods like averaging or summation.
Remove Specific Time Segments:
Retain only the data from defined time blocks, for example, between 12:05 – 14:30, focusing on a specific region of interest.
Event-driven Storage:
When an event occurs, data from relevant channels within a ±X km range or ±X seconds around the event is saved for later use.
Geofence-based Storage:
When an AIS (Automatic Identification System) signal is detected within a defined geofenced area, data is retained to ensure critical information isn't lost.

🚀 Installation Guide

Please note that Git is required for all installation methods. Additionally, DataKeeper requires either:

Python 3.9,
GLIBC 2.33 or higher.

You can check your GLIBC version with the following command:

ldd --version

You can install DataKeeper using one of the following methods:

1. One-liner Script Installation

Without Building the Binary

Use this method for a quick setup (downloads a prebuilt version):

curl -sSfL https://raw.githubusercontent.com/SUNET/datakeeper/refs/heads/main/deployment/install.sh | sudo sh

🔧 With Build from Source (Docker Required)

Build the project from scratch using Docker:

curl -sSfL https://raw.githubusercontent.com/SUNET/datakeeper/refs/heads/main/deployment/install.sh | bash -s -- --with-build=true

2. Manual Git-Based Installation

Use this method if you prefer to inspect or modify the installation script before running it.

🛠️ Clone and Run

git clone --depth=1 https://github.com/SUNET/datakeeper.git 
cd datakeeper/deployment
chmod +x install.sh
less install.sh  # Optional: inspect the script
./install.sh

🔧 Build from Source

./install.sh --with-build=true

3. Run with Poetry (Development Mode)

For local development and contributions:

Install Poetry
Follow the official Poetry installation guide if you haven't already. Additionally, ensure that you have Python 3.9 installed on your system.
Clone and Set Up Environment

git clone https://github.com/SUNET/datakeeper.git
cd datakeeper
poetry install
poetry shell

Run the Application

poetry run datakeeper --help

Development Dependencies

poetry add pytest nox black mypy --group dev

pytest: For running tests.
nox: For automating testing and other development tasks.
black: For code formatting.
mypy: For static type checking.

cd datakeeper
poetry shell
poetry install
python main.py --help

pytest

pytest tests/ --verbose -s

Usage

After installation, you can inspect the DataKeeper CLI by running:

datakeeper --help

Start Monitoring a Specific folder

To begin monitoring a folder with a specified configuration, use the following command:

datakeeper schedule --config path/to/config/file

Example Configuration File

Here’s an example of what the configuration file might look like:

[DATAKEEPER]
LOG_DIRECTORY = /tmp/datakeeper
PLUGIN_DIR = /tmp/datakeeper/datakeeper/policy_system/plugins
POLICY_PATH = /tmp/datakeeper/datakeeper/config/policy.yaml
DB_PATH = /tmp/datakeeper/datakeeper/database/database.sqlite
INIT_FILE_PATH = /tmp/datakeeper/datakeeper/database/init.sql

[API]
HOST = 0.0.0.0
PORT = 3000

[AIS]
AIS_USER=XXX
AIS_USER_PASSWORD=XXX
AIS_SERVER_HOST=XXX
AIS_SERVER_PORT=XXX

[MONGO]
MONGO_URL=mongodb://XXX:XXX@localhost:27017
MONGO_DB=XXX
MONGO_COLLECTION=XXX
ENABLE_MONGO_OUTPUT=true

[KAFKA]
BOOTSTRAP_SERVERS=XXX
KAFKA_TOPIC=XXX
ENABLE_KAFKA_OUTPUT=false

For an example of policy.yaml, visit check the policy file:
Example Policy on GitHub

Generate Test Data in the Monitored Folder

To generate test data in the monitored folder, use the following command:

python main.py generate --format hdf5 --base-dir [folder-path] --create-dir
datakeeper generate --format hdf5 --base-dir [folder-path] --create-dir --num-files [N]

This will create the necessary directories and generate the data in the specified format (hdf5).

AIS-monitoring

Send live data from the AIS-server (Sjöfartsverket) to Kafka and MongoDB, or optionally to a file.

Description

The ais-router command connects to the AIS-server and routes real-time vessel tracking data to different output systems. It supports Kafka, MongoDB, or both, based on the provided flags.

This tool is useful for ingesting AIS (Automatic Identification System) data for maritime tracking systems, data lakes, or analytics platforms.

Usage

python main.py ais-router [OPTIONS]
datakeeper ais-router [OPTIONS]

Options

Option	Type	Description
`--config-path PATH`	Path	Path to the configuration file (optional) typ ini or use environmenet variables. see example conf above
`--enable-kafka-output`	Flag	Enable output to Kafka.
`--enable-mongo-output`	Flag	Enable output to MongoDB.

Example

Run AIS router with a configuration file and enable both Kafka and MongoDB outputs:

python main.py ais-router \
    --config-path ./config.yml \
    --enable-kafka-output \
    --enable-mongo-output

Notes

If no output flags are provided, the tool may fall environment variables or config files

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
ais_live_router		ais_live_router
datakeeper		datakeeper
deployment		deployment
tests		tests
.gitignore		.gitignore
README.md		README.md
VERSION		VERSION
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Datakeeper

Data Retention Policy

Function – Automatic Deletion

Data Reduction Policy

Features:

🚀 Installation Guide

1. One-liner Script Installation

Without Building the Binary

🔧 With Build from Source (Docker Required)

2. Manual Git-Based Installation

🛠️ Clone and Run

🔧 Build from Source

3. Run with Poetry (Development Mode)

Development Dependencies

pytest

Usage

Start Monitoring a Specific folder

Example Configuration File

Generate Test Data in the Monitored Folder

AIS-monitoring

Description

Usage

Options

Example

Notes

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

SUNET/datakeeper

Folders and files

Latest commit

History

Repository files navigation

Datakeeper

Data Retention Policy

Function – Automatic Deletion

Data Reduction Policy

Features:

🚀 Installation Guide

1. One-liner Script Installation

Without Building the Binary

🔧 With Build from Source (Docker Required)

2. Manual Git-Based Installation

🛠️ Clone and Run

🔧 Build from Source

3. Run with Poetry (Development Mode)

Development Dependencies

pytest

Usage

Start Monitoring a Specific folder

Example Configuration File

Generate Test Data in the Monitored Folder

AIS-monitoring

Description

Usage

Options

Example

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages