Skip to content

08_data_models__persistence_layer

Benedikt Kuehne edited this page Jan 7, 2026 · 1 revision

Chapter 8: Data Models (Persistence Layer)

In Chapter 7: Background Task Execution, we explored how EMBArk efficiently handles demanding tasks behind the scenes, ensuring the user interface remains responsive. But all those tasks – uploading firmware, running analyses, managing users, orchestrating workers – involve creating, reading, updating, and deleting vast amounts of information. Where does EMBArk keep all this data, and how does it ensure everything is organized and connected?

Imagine EMBArk as a super-efficient office. It has many departments (like firmware analysis, user management, and worker orchestration), and each department handles different kinds of documents:

  • User IDs and passwords for the HR department.
  • Firmware files and analysis checklists for the lab.
  • Results reports and vulnerability findings for the security team.
  • Worker schedules and task assignments for the operations team.

The problem "Data Models (Persistence Layer)" solves is like setting up a meticulously organized filing cabinet for this office. It defines the structure of every single "document" (piece of information), ensures each document is stored consistently, and knows how to link related documents together. This is the "memory" of EMBArk, ensuring that all the information about uploaded firmware, analysis results, user accounts, and worker configurations is consistently saved, retrieved, and linked together. Without it, EMBArk would forget everything as soon as you close it!

Understanding the Key Concepts

To manage its data effectively, EMBArk uses two main ideas:

1. Data Models: "The Blueprints"

Think of a data model as a blueprint or a template for a specific type of information. Just like a blueprint for a house defines how many rooms it has, where the doors are, and what materials to use, a data model defines:

  • What pieces of information are stored (e.g., for a "Firmware Analysis" document, it might store version, start_date, status).
  • What kind of data each piece is (e.g., version is text, start_date is a date, status is a percentage).
  • How different pieces of information are related (e.g., a "Firmware Analysis" document is always linked to a specific "Firmware File" document and a "User" document).

In EMBArk, these blueprints are defined using Django's models.Model classes in Python. Each class represents a type of data EMBArk needs to store.

2. Persistence Layer: "The Librarian"

The persistence layer is the system that acts like the librarian for our filing cabinet (the database). It handles all the operations needed to:

  • Save new "documents" (create new data records).
  • Retrieve existing "documents" (read data from the database).
  • Update "documents" (change existing data).
  • Delete "documents" (remove data).

When you interact with EMBArk (e.g., upload a firmware), the web application talks to the persistence layer, which then translates those requests into commands for the actual database. This layer ensures data is stored safely and can always be found later.

Solving Our Use Case: Storing and Connecting a New Firmware Analysis

Let's walk through a concrete example: an analyst uploads a new firmware, and EMBArk starts an analysis. This involves creating several new pieces of information and linking them together.

Here's a simplified look at the kind of data EMBArk needs to store for this scenario:

  • Firmware File: The actual file uploaded.
  • Firmware Analysis: The specific settings for this analysis (version, architecture, who started it).
  • User: Who uploaded the firmware and initiated the analysis.
  • Result: The outcome of the analysis (CVEs found, security features).

These are all distinct "documents," but they are clearly related. The data models define these relationships, and the persistence layer saves them correctly.

The Data Storage Flow: A Simple Sequence

When an analyst uploads a firmware and starts an analysis, here's a simplified sequence of how EMBArk uses its data models and persistence layer:

sequenceDiagram
    participant User
    participant EMBArk Web Server
    participant Data Models (Persistence Layer)
    participant Database

    User->>EMBArk Web Server: Uploads firmware & submits analysis
    EMBArk Web Server->>Data Models (Persistence Layer): Creates new FirmwareFile record
    Data Models (Persistence Layer)->>Database: Saves firmware file details
    Database-->>Data Models (Persistence Layer): Confirmation
    EMBArk Web Server->>Data Models (Persistence Layer): Creates new FirmwareAnalysis record (links to FirmwareFile & User)
    Data Models (Persistence Layer)->>Database: Saves analysis details
    Database-->>Data Models (Persistence Layer): Confirmation
    Note over Data Models (Persistence Layer): EMBA runs, then Importer updates data
    EMBArk Web Server->>Data Models (Persistence Layer): Creates new Result record (links to FirmwareAnalysis)
    Data Models (Persistence Layer)->>Database: Saves analysis results
    Database-->>Data Models (Persistence Layer): Confirmation
    EMBArk Web Server-->>User: Analysis started!
Loading

The "Data Models (Persistence Layer)" participant here represents the Python code that defines the models and interacts with the database to save and retrieve data.

Under the Hood: The Blueprints and the Librarian

Let's dive into some of EMBArk's actual data models and see how they're defined and used. EMBArk is built on Django, which provides a powerful Object-Relational Mapper (ORM) that makes interacting with the database feel like working with Python objects.

1. Defining the Core Data: FirmwareFile and FirmwareAnalysis

These models are the backbone of any analysis.

# Simplified snippet from embark/uploader/models.py

import uuid
from django.db import models
from django.utils import timezone
from users.models import User as Userclass # Import User model

class FirmwareFile(models.Model):
    id = models.UUIDField(primary_key=True, default=uuid.uuid4)
    file = models.FileField(upload_to='firmware_files/') # Stores the actual file path
    upload_date = models.DateTimeField(default=timezone.now)
    user = models.ForeignKey(Userclass, on_delete=models.SET_NULL, null=True) # Link to the User

    def __str__(self):
        return self.file.name # Display the file name

class FirmwareAnalysis(models.Model):
    id = models.UUIDField(primary_key=True, default=uuid.uuid4)
    user = models.ForeignKey(Userclass, on_delete=models.SET_NULL, null=True) # Link to the User
    firmware = models.ForeignKey(FirmwareFile, on_delete=models.SET_NULL, null=True) # Link to the FirmwareFile
    firmware_name = models.CharField(max_length=127, default="File unknown")
    version = models.CharField(max_length=127, blank=True)
    start_date = models.DateTimeField(default=timezone.now)
    finished = models.BooleanField(default=False)
    status = models.JSONField(default=dict) # For real-time progress (Chapter 3)

    # Other analysis settings like architecture, scan_modules (Chapter 2)
    # ...

    def __str__(self):
        return f"Analysis {self.id} for {self.firmware_name}"
  • FirmwareFile: This blueprint defines how to store information about an uploaded firmware file.
    • id: A unique identifier (UUID).
    • file: Stores the actual firmware binary.
    • user = models.ForeignKey(Userclass, ...): This is a foreign key, a crucial concept! It creates a link to the User model. This means each FirmwareFile record "knows" which User uploaded it. If the User is deleted (on_delete=models.SET_NULL), this link becomes NULL.
  • FirmwareAnalysis: This blueprint holds all the details about one specific analysis run.
    • It also has a ForeignKey to User (who started the analysis).
    • And a ForeignKey to FirmwareFile (which file is being analyzed).
    • status = models.JSONField(default=dict): This field (used in Chapter 3: Real-time Progress Monitoring) stores dynamic, structured data like the current progress percentage.

2. Storing Analysis Results: Result, Vulnerability, and SoftwareInfo

After EMBA finishes an analysis, EMBArk stores all the findings in these models.

# Simplified snippet from embark/dashboard/models.py

from django.db import models
import uuid
from uploader.models import FirmwareAnalysis # Import FirmwareAnalysis model

class Vulnerability(models.Model):
    cve = models.CharField(max_length=18, help_text='CVE-XXXX-XXXXXXX')
    info = models.JSONField(null=True)

    def __str__(self):
        return self.cve

class SoftwareInfo(models.Model):
    id = models.UUIDField(primary_key=True, default=uuid.uuid4)
    name = models.CharField(max_length=256)
    version = models.CharField(max_length=32)
    supplier = models.CharField(max_length=1024)
    # ... other fields like license, cpe, purl ...

    def __str__(self):
        return f"{self.name} {self.version}"

class SoftwareBillOfMaterial(models.Model):
    id = models.UUIDField(primary_key=True, default=uuid.uuid4)
    meta = models.CharField(max_length=1024)
    component = models.ManyToManyField(SoftwareInfo, blank=True) # Link to many SoftwareInfo objects

    def __str__(self):
        return f"SBOM {self.id}"

class Result(models.Model):
    firmware_analysis = models.OneToOneField(FirmwareAnalysis, on_delete=models.CASCADE, primary_key=True) # One-to-one link
    os_verified = models.CharField(blank=True, null=True, max_length=256)
    cve_critical = models.TextField(default='{}') # Stores JSON string
    cve_high = models.TextField(default='{}')
    exploits = models.IntegerField(default=0)
    vulnerability = models.ManyToManyField(Vulnerability, blank=True) # Link to many Vulnerability objects
    sbom = models.OneToOneField(SoftwareBillOfMaterial, on_delete=models.CASCADE, null=True, blank=True) # Optional One-to-one link

    # Other result fields like canary, relro, no_exec, pie, stripped (Chapter 4)
    # ...

    def __str__(self):
        return f"Results for {self.firmware_analysis.firmware_name}"
  • Vulnerability: A simple blueprint for storing a CVE ID and its details.
  • SoftwareInfo: A blueprint for a single software component (used in SBOM).
  • SoftwareBillOfMaterial: Represents an entire SBOM.
    • component = models.ManyToManyField(SoftwareInfo, ...): This is a many-to-many relationship. An SBOM can have many software components, and a SoftwareInfo item (like "OpenSSL 1.1.1") might appear in many different SBOMs.
  • Result: This is the main summary of the analysis.
    • firmware_analysis = models.OneToOneField(FirmwareAnalysis, ...): This is a one-to-one relationship. Each FirmwareAnalysis can only have one Result summary, and each Result belongs to one FirmwareAnalysis. This is important for linking the summary findings back to the original analysis settings.
    • vulnerability = models.ManyToManyField(Vulnerability, ...): A Result can have many Vulnerability entries, and a Vulnerability (like CVE-2023-1234) might be found in many Results.
    • sbom = models.OneToOneField(SoftwareBillOfMaterial, ...): Optionally links to an SBOM.

3. User & Team Management: User and Team

Users are fundamental to EMBArk, as seen in Chapter 1: User Authentication & Authorization.

# Simplified snippet from embark/users/models.py

from django.db import models
from django.contrib.auth.models import AbstractUser # Django's base user model

class Team(models.Model):
    name = models.CharField(primary_key=True, max_length=150, unique=True)
    is_active = models.BooleanField(default=True)

    def __str__(self):
        return self.name

class User(AbstractUser):
    timezone = models.CharField(max_length=32, default='UTC')
    email = models.EmailField(unique=True, blank=True)
    team = models.ForeignKey(Team, on_delete=models.SET_NULL, null=True, related_name='member_of_team') # Link to Team
    api_key = models.CharField(max_length=64, blank=True, null=True) # User's API key

    class Meta:
        permissions = (
            ("user_permission", "Can access user menues of embark"),
            ("uploader_permission_minimal", "Can access uploader functionalities of embark"),
            # ... many more custom permissions (Chapter 1) ...
        )

    def __str__(self):
        return self.username
  • Team: A blueprint for user teams.
  • User: This model extends Django's built-in AbstractUser, adding custom fields like timezone, api_key, and a ForeignKey to Team. It also defines specific permissions, which are crucial for Chapter 1: User Authentication & Authorization.

4. Worker Node Information: Worker, Configuration, and OrchestratorState

For distributed analysis, EMBArk needs to store information about its worker nodes and how to manage them, as discussed in Chapter 6: Worker Node Orchestration.

# Simplified snippet from embark/workers/models.py

import ipaddress
from django.db import models
from users.models import User # Import User model

class Configuration(models.Model):
    user = models.ForeignKey(User, on_delete=models.CASCADE)
    name = models.CharField(max_length=150)
    ssh_user = models.CharField(max_length=150)
    ip_range = models.CharField(max_length=20) # e.g., 192.168.1.0/24
    # ... other SSH key fields ...

    def clean(self): # Custom validation
        try:
            ipaddress.ip_network(self.ip_range, strict=False)
        except ValueError as value_error:
            raise ValidationError({"ip_range": f"Invalid IP range: {value_error}"}) from value_error

    def __str__(self):
        return self.name

class Worker(models.Model):
    configurations = models.ManyToManyField(Configuration, blank=True) # Many-to-many link
    ip_address = models.GenericIPAddressField(unique=True)
    name = models.CharField(max_length=100)
    reachable = models.BooleanField(default=False)
    status = models.CharField(max_length=1, default='U') # e.g., 'U'nconfigured, 'C'onfigured
    analysis_id = models.UUIDField(blank=True, null=True) # ID of current analysis running on worker

    def __str__(self):
        return f"{self.name} ({self.ip_address})"

class OrchestratorState(models.Model):
    free_workers = models.ManyToManyField(Worker, related_name='free_workers') # Many-to-many link to free workers
    busy_workers = models.ManyToManyField(Worker, related_name='busy_workers') # Many-to-many link to busy workers
    tasks = models.JSONField(default=list, null=True) # The queue of tasks

    def __str__(self):
        return "Orchestrator State"
  • Configuration: Defines how EMBArk connects to and manages workers (SSH credentials, IP ranges). It has a ForeignKey to User.
  • Worker: Represents an individual worker machine.
    • configurations = models.ManyToManyField(Configuration, ...): A Worker can be managed by multiple Configurations (if, for example, multiple users manage the same worker with different SSH keys).
    • analysis_id: Stores the ID of the FirmwareAnalysis currently running on this worker.
  • OrchestratorState: This model is crucial for the Chapter 6: Worker Node Orchestration. It stores the state of the Orchestrator.
    • free_workers = models.ManyToManyField(Worker, ...): A list of workers currently available.
    • busy_workers = models.ManyToManyField(Worker, ...): A list of workers currently running an analysis.
    • tasks = models.JSONField(...): Stores the queue of tasks that are waiting to be assigned to workers.

5. Admin Interface: Managing Data

Django automatically provides an administrative interface where you can view, create, update, and delete these model records. You just need to "register" them.

# Simplified snippet from embark/dashboard/admin.py
from django.contrib import admin
from dashboard.models import Result, Vulnerability, SoftwareInfo, SoftwareBillOfMaterial

admin.site.register(Result)
admin.site.register(Vulnerability)
admin.site.register(SoftwareInfo)
admin.site.register(SoftwareBillOfMaterial)

# Similar registrations in uploader/admin.py, users/admin.py, workers/admin.py

This simple admin.site.register() line tells Django to make these data models available in its built-in admin dashboard, allowing administrators to easily manage the "documents" in EMBArk's "filing cabinet."

Conclusion

Data Models and the Persistence Layer are the unsung heroes of EMBArk. They provide the structured "memory" for the entire system, defining the blueprints for all information (users, firmware, analysis results, workers) and ensuring this data is consistently saved, retrieved, and linked. By leveraging Django's powerful ORM, EMBArk can manage complex relationships between different types of information, forming the essential backbone for all its operations and reporting.

Now that we understand how EMBArk stores and organizes its data, the final piece of the puzzle is how to get EMBArk up and running in a real-world environment. In the next chapter, we'll cover Chapter 9: Deployment & Environment Setup, where you'll learn how to deploy and configure EMBArk for production use.


Generated by AI Codebase Knowledge Builder. References: [1], [2], [3], [4], [5], [6], [7], [8]

Clone this wiki locally