feat: clica command for a full backup/restore #3042

tchoumi313 · 2025-12-12T20:14:11Z

Resolves #2869 , resolves #2700
Creates a complete backup of your CISO Assistant instance, including both the database and all evidence attachments.
Restores a complete backup of your CISO Assistant instance, including both the database and all evidence attachments.

Summary by CodeRabbit

New Features
- Attachment export/import (ZIP + streaming) plus new REST endpoints for export, import, metadata, batch download/upload; full-restore supports streaming and partial-success reporting.
- CLI: backup-full and restore-full commands with resume, batch-size, and optional hash verification.
Data integrity
- SHA-256 attachment hashes added and backfilled to support verification during restore.
Documentation
- CLI docs updated with backup/restore workflows, examples, manifest format, and troubleshooting.
Chores
- Added CLI env template and gitignore entry.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-12T20:14:59Z

Walkthrough

Adds end-to-end attachment export/import utilities, new serdes API endpoints and views (ZIP and streaming batch formats), CLI streaming backup/restore commands, attachment SHA‑256 support with migration/backfill, and supporting CLI env/gitignore changes.

Changes

Cohort / File(s)	Summary
Attachment export/import utils `backend/serdes/export_utils.py`	New module adding `AttachmentExporter` (collects evidence revisions with attachments; packages attachments into ZIP) and `AttachmentImporter` (validates and extracts attachments from ZIP, supports dry-run, aggregates per-file errors, saves to storage and assigns to revisions).
API views `backend/serdes/views.py`	Added APIViews: `ExportAttachmentsView`, `LoadAttachmentsView`, `FullRestoreView`, `AttachmentMetadataView`, `BatchDownloadAttachmentsView`, `BatchUploadAttachmentsView`. Implement ZIP export/import, streaming batch format (length‑prefixed blocks with JSON headers + bytes), permission checks, integration with DB restore flow, partial-success/multi-status reporting, and logging.
URL routes `backend/serdes/urls.py`	Added routes: `export-attachments/`, `full-restore/`, `attachment-metadata/`, `batch-download-attachments/`, `batch-upload-attachments/`.
CLI commands & docs `cli/clica.py`, `cli/README.md`	New CLI commands `backup_full(dest_dir, batch_size, resume)` and `restore_full(src_dir, verify_hashes)` implementing streaming backups/restores, manifest handling, optional hash verification; README updated to describe streaming protocol, usage, and examples.
Attachment hashing field & migration `backend/core/models.py`, `backend/core/migrations/0119_add_attachment_hash_to_evidencerevision.py`	Added `attachment_hash` CharField to `EvidenceRevision`; `save()` now computes SHA‑256 for attachments (storage-aware, chunked) with safe error handling. Migration adds the field and a RunPython backfill computing SHA‑256 from storage (chunked reads) and a reverse that clears hashes.
CLI env & gitignore `cli/.clica.env.template`, `cli/.gitignore`, `cli/.clica.env`	Added `.clica.env.template` with `TOKEN`, `VERIFY_CERTIFICATE`, `API_URL`; added `.clica.env` to `.gitignore`; removed concrete `.clica.env` file from repository.
Misc gitignore `.gitignore`	Added ignore entry for `/cli/db_backup`.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CLI as CLI (clica)
    participant ExportAPI as ExportAttachmentsView
    participant Exporter as AttachmentExporter
    participant DB as Database (EvidenceRevision)
    participant Storage as ObjectStorage
    participant RestoreAPI as FullRestoreView
    participant Importer as AttachmentImporter

    CLI->>ExportAPI: GET /serdes/export-attachments/
    ExportAPI->>Exporter: create_attachments_zip()
    Exporter->>DB: collect_all_attachments()
    DB-->>Exporter: queryset of revisions
    Exporter->>Storage: read attachment files
    Storage-->>Exporter: file bytes
    Exporter-->>ExportAPI: ZIP BytesIO
    ExportAPI-->>CLI: ZIP response

    CLI->>RestoreAPI: POST /serdes/full-restore/ (DB dump + attachments stream/ZIP)
    RestoreAPI->>DB: perform DB restore (existing load logic)
    DB-->>RestoreAPI: DB restore result
    RestoreAPI->>Importer: extract_attachments_from_zip(stream or zip, dry_run=false)
    Importer->>Storage: write attachment files (evidence-revisions/...)
    Storage-->>Importer: stored confirmation
    Importer->>DB: assign attachment to EvidenceRevision (save -> attachment_hash computed)
    DB-->>Importer: save confirmations
    Importer-->>RestoreAPI: stats (processed/restored/errors)
    RestoreAPI-->>CLI: consolidated restore response (207 if partial)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Focus areas:
- backend/serdes/export_utils.py — ZIP read/write, filename parsing/validation, per-file error resilience.
- backend/serdes/views.py — streaming length‑prefixed block parsing, multi-status responses, permission checks, and DB/attachment coordination.
- backend/core/migrations/0119_* & backend/core/models.py — backfill performance, chunked storage reads, and correct/robust hash computation in save().
- cli/clica.py — streaming protocol, resume logic, manifest correctness and error handling.

Poem

🐰 I packed the files with careful paws tonight,

Zips snug as burrows, hashes shining bright,
I hopped through streams and stitched each byte in place,
Carrots and crumbs now travel safe through space,
Hooray — the rabbit's backup dance takes flight!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly describes the main feature being added: a clica command for full backup/restore operations that includes both database and attachments.
Linked Issues check	✅ Passed	The PR implements full backup/restore functionality addressing both #2869 (restore failures) and #2700 (import errors with user tokens) through comprehensive new export/import utilities and API endpoints with proper error handling.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing the full backup/restore feature including database export, attachment handling, CLI commands, migrations, and supporting infrastructure changes.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch CA-1432-add-clica-command-to-create-a-full-backup-with-attachments

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 63c2bed and 0a40a6c.

📒 Files selected for processing (1)

.gitignore (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (python)

🔇 Additional comments (2)

.gitignore (2)

22-22: Standard Python cache pattern.

The __pycache__ entry correctly ignores Python bytecode cache directories, which should not be version-controlled.

23-23: Verify directory naming aligns with PR requirements.

The entry /cli/db_backup ignores the CLI backup directory, which is appropriate. However, the PR objectives reference "a default directory named db", while this entry specifies db_backup. Please confirm that the actual backup implementation matches this directory name and that it aligns with the stated PR requirements.

Optionally, consider adding a trailing / to make the directory-specific intent explicit: /cli/db_backup/

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

backend/serdes/views.py

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (1)

cli/README.md (1)

404-419: Examples: consider adding a note about secure storage/permissions for backup dirs.
Given backups can contain sensitive evidence, it’s worth explicitly recommending restrictive permissions (e.g., chmod 700 on the backup directory).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b7a141d and 4d42fb4.

📒 Files selected for processing (5)

backend/serdes/export_utils.py (1 hunks)
backend/serdes/urls.py (1 hunks)
backend/serdes/views.py (4 hunks)
cli/README.md (2 hunks)
cli/clica.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

backend/serdes/urls.py (1)

backend/serdes/views.py (2)

ExportAttachmentsView (245-298)

FullRestoreView (376-532)

backend/serdes/views.py (3)

backend/serdes/export_utils.py (4)

AttachmentExporter (15-82)

AttachmentImporter (85-185)

create_attachments_zip (61-82)

extract_attachments_from_zip (86-185)

backend/iam/models.py (2)

has_backup_permission (729-734)

username (742-743)

backend/core/utils.py (1)

compare_schema_versions (180-266)

cli/clica.py (1)

backend/serdes/views.py (5)

get (34-68)

get (246-298)

post (171-242)

post (304-373)

post (377-532)

backend/serdes/export_utils.py (2)

backend/core/models.py (2)

Evidence (3304-3384)

EvidenceRevision (3387-3450)

backend/serdes/views.py (2)

get (34-68)

get (246-298)

🪛 GitHub Check: CodeQL

backend/serdes/views.py

[warning] 345-350: Information exposure through an exception
Stack trace information flows to this location and may be exposed to an external user.
Stack trace information flows to this location and may be exposed to an external user.
Stack trace information flows to this location and may be exposed to an external user.

[warning] 355-359: Information exposure through an exception
Stack trace information flows to this location and may be exposed to an external user.
Stack trace information flows to this location and may be exposed to an external user.
Stack trace information flows to this location and may be exposed to an external user.

[warning] 503-507: Information exposure through an exception
Stack trace information flows to this location and may be exposed to an external user.

🪛 markdownlint-cli2 (0.18.1)

cli/README.md

288-288: Blank line inside blockquote

(MD028, no-blanks-blockquote)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: build_community_frontend
GitHub Check: test (3.12)
GitHub Check: enterprise-startup-functional-test (3.12)
GitHub Check: startup-functional-test (3.12)
GitHub Check: enterprise-startup-docker-compose-test
GitHub Check: startup-docker-compose-test
GitHub Check: build_enterprise_frontend
GitHub Check: build (3.12)

🔇 Additional comments (3)

backend/serdes/urls.py (1)

12-21: Routes look correct; confirm whether LoadAttachmentsView should also be exposed.
If LoadAttachmentsView is intended for standalone attachment restores (outside /full-restore/), it likely needs a path("load-attachments/", ...) entry; otherwise consider removing the view to avoid dead code.

cli/clica.py (1)

611-612: Command registration looks good.

backend/serdes/views.py (1)

51-63: Excluding iam.personalaccesstoken from dump/load is the right direction for restore robustness.
This aligns with the “regenerate PAT after restore” workflow and should help with token-related import/restore failures.

Also applies to: 127-136

coderabbitai · 2025-12-12T20:20:30Z

backend/serdes/export_utils.py

+    def package_attachments_to_zip(
+        self, revisions: QuerySet, zipf: zipfile.ZipFile
+    ) -> int:
+        count = 0
+
+        for revision in revisions:
+            if revision.attachment and default_storage.exists(revision.attachment.name):
+                try:
+                    with default_storage.open(revision.attachment.name, "rb") as file:
+                        file_content = file.read()
+
+                        filename = (
+                            f"{revision.evidence_id}_v{revision.version}_"
+                            f"{os.path.basename(revision.attachment.name)}"
+                        )
+
+                        zip_path = os.path.join(
+                            "attachments", "evidence-revisions", filename
+                        )
+
+                        zipf.writestr(zip_path, file_content)
+                        count += 1
+


⚠️ Potential issue | 🟠 Major

Exporter: avoid full in-RAM reads + use POSIX zip paths (Windows-safe).

import io import os +import posixpath +import shutil import zipfile @@ - zip_path = os.path.join( - "attachments", "evidence-revisions", filename - ) - - zipf.writestr(zip_path, file_content) + zip_path = posixpath.join( + "attachments", "evidence-revisions", filename + ) + with zipf.open(zip_path, "w") as zf: + shutil.copyfileobj(file, zf, length=1024 * 1024) count += 1

Also applies to: 61-82

🤖 Prompt for AI Agents

In backend/serdes/export_utils.py around lines 26-48 (and similarly 61-82), the exporter currently reads entire attachment files into memory and builds zip paths with os.path.join (which yields backslashes on Windows); change to stream the file into the zip to avoid full in-RAM reads and use POSIX-style zip paths. Specifically, open the storage file as a binary file-like object and create the entry in the ZipFile via zipfile.ZipFile.open(zip_path, "w") and copy the file contents into that entry in chunks (e.g., shutil.copyfileobj) instead of file.read(); build zip_path using posixpath.join("attachments", "evidence-revisions", filename) so the stored path uses forward slashes and apply the same changes to the other block at lines 61-82.

coderabbitai · 2025-12-12T20:20:30Z

backend/serdes/export_utils.py

+                for file_path in attachment_files:
+                    try:
+                        filename = os.path.basename(file_path)
+                        parts = filename.split("_", 2)
+
+                        if len(parts) < 3:
+                            stats["errors"].append(
+                                f"Invalid filename format: {filename}"
+                            )
+                            continue
+
+                        evidence_id = parts[0]
+                        version_str = parts[1]
+                        original_filename = parts[2]
+
+                        if not version_str.startswith("v"):
+                            stats["errors"].append(
+                                f"Invalid version format in: {filename}"
+                            )
+                            continue
+
+                        version = int(version_str[1:])
+
+                        if not dry_run:
+                            # Find the corresponding EvidenceRevision
+                            try:
+                                revision = EvidenceRevision.objects.get(
+                                    evidence_id=evidence_id, version=version
+                                )
+
+                                file_content = zipf.read(file_path)
+
+                                storage_path = (
+                                    f"evidence-revisions/{evidence_id}/"
+                                    f"v{version}/{original_filename}"
+                                )
+
+                                saved_path = default_storage.save(
+                                    storage_path, io.BytesIO(file_content)
+                                )
+
+                                revision.attachment = saved_path
+                                revision.save(update_fields=["attachment"])
+
+                                stats["restored"] += 1
+
+                            except EvidenceRevision.DoesNotExist:
+                                stats["errors"].append(
+                                    f"EvidenceRevision not found: "
+                                    f"evidence_id={evidence_id}, version={version}"
+                                )
+                            except Exception as e:
+                                stats["errors"].append(
+                                    f"Failed to restore {filename}: {str(e)}"
+                                )
+                        else:
+                            stats["restored"] += 1
+
+                    except Exception as e:
+                        stats["errors"].append(
+                            f"Error processing {file_path}: {str(e)}"
+                        )
+                        continue


⚠️ Potential issue | 🟠 Major

Importer: make restores idempotent (avoid default_storage.save auto-renames + orphaned files).
If storage_path already exists, many storage backends will rename on save(), leaving old files behind and changing paths each restore run.

if not dry_run: @@ - saved_path = default_storage.save( - storage_path, io.BytesIO(file_content) - ) + # Prefer overwrite semantics for idempotent restores + try: + if revision.attachment: + default_storage.delete(revision.attachment.name) + except Exception: + logger.warning( + "Failed to delete previous attachment", + revision_id=revision.id, + attachment_name=getattr(revision.attachment, "name", None), + ) + try: + if default_storage.exists(storage_path): + default_storage.delete(storage_path) + except Exception: + logger.warning( + "Failed to delete target path before restore", + storage_path=storage_path, + ) + + saved_path = default_storage.save( + storage_path, io.BytesIO(file_content) + )

Committable suggestion skipped: line range outside the PR's diff.

backend/serdes/export_utils.py

coderabbitai · 2025-12-12T20:20:30Z

backend/serdes/views.py

+class ExportAttachmentsView(APIView):
+    def get(self, request, *args, **kwargs):
+        if not request.user.has_backup_permission:
+            logger.warning(
+                "Unauthorized attachment export attempt",
+                user=request.user.username,
+            )
+            return Response(status=status.HTTP_403_FORBIDDEN)
+
+        logger.info(
+            "Starting attachment export",
+            user=request.user.username,
+        )
+
+        try:
+            exporter = AttachmentExporter()
+            zip_buffer, count = exporter.create_attachments_zip()
+
+            if count == 0:
+                logger.info("No attachments found to export")
+                return Response(
+                    {"message": "No attachments found"},
+                    status=status.HTTP_404_NOT_FOUND,
+                )
+
+            # Create response with ZIP file
+            timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
+            filename = f"ciso-assistant-attachments-{timestamp}.zip"
+
+            response = HttpResponse(
+                zip_buffer.getvalue(), content_type="application/zip"
+            )
+            response["Content-Disposition"] = f'attachment; filename="{filename}"'
+
+            logger.info(
+                "Attachment export completed successfully",
+                attachments_count=count,
+                filename=filename,
+                zip_size=len(zip_buffer.getvalue()),
+            )
+
+            return response
+
+        except Exception as e:
+            logger.error(
+                "Attachment export failed",
+                user=request.user.username,
+                error=str(e),
+                exc_info=True,
+            )
+            return Response(
+                {"error": "AttachmentExportFailed"},
+                status=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            )
+


⚠️ Potential issue | 🟠 Major

ExportAttachmentsView: watch out for large in-memory ZIP responses.
This returns zip_buffer.getvalue() (full ZIP in RAM). Consider streaming/FileResponse (or at least spooling to a temp file) to prevent server OOM on large evidence sets.

🤖 Prompt for AI Agents

In backend/serdes/views.py around lines 245 to 299, the current implementation builds the entire ZIP in memory with zip_buffer.getvalue(), risking OOM for large exports; change to stream the ZIP to the client by writing the archive to a spooled temporary file (e.g., tempfile.SpooledTemporaryFile with a sensible max_size) or directly to a StreamingHttpResponse/FileResponse that yields chunks, set the Content-Disposition and Content-Type headers from the temp file/stream, ensure the temp file is rewound before streaming, and close/delete the temp file after the response is sent to avoid retaining large buffers in memory.

backend/serdes/views.py

coderabbitai · 2025-12-12T20:20:30Z

backend/serdes/views.py

+class FullRestoreView(APIView):
+    def post(self, request, *args, **kwargs):
+        if not request.user.has_backup_permission:
+            logger.error("Unauthorized full restore attempt", user=request.user)
+            return Response(status=status.HTTP_403_FORBIDDEN)
+
+        logger.info(
+            "Starting full restore (database + attachments)",
+            user=request.user.username,
+        )
+
+        # Get uploaded files from multipart form data
+        backup_file = request.FILES.get("backup")
+        attachments_file = request.FILES.get("attachments")
+
+        if not backup_file:
+            return Response(
+                {"error": "NoBackupFileProvided"}, status=status.HTTP_400_BAD_REQUEST
+            )
+
+        # Step 1: Restore database backup
+        logger.info("Step 1/2: Restoring database backup")
+
+        try:
+            data = backup_file.read()
+            is_gzip = data.startswith(GZIP_MAGIC_NUMBER)
+            full_decompressed_data = gzip.decompress(data) if is_gzip else data
+
+            full_decompressed_data = json.loads(full_decompressed_data)
+            metadata, decompressed_data = full_decompressed_data
+            metadata = metadata["meta"]
+
+            current_version = VERSION.split("-")[0]
+            backup_version = None
+            schema_version = 0
+
+            for metadata_part in metadata:
+                backup_version = metadata_part.get("media_version")
+                schema_version = metadata_part.get("schema_version")
+                if backup_version is not None or schema_version is not None:
+                    break
+
+            try:
+                schema_version_int = int(schema_version)
+                compare_schema_versions(schema_version_int, backup_version)
+                if backup_version != VERSION:
+                    raise ValueError(
+                        "The version of the current instance and the one that generated the backup are not the same."
+                    )
+            except (ValueError, TypeError) as e:
+                logger.error(
+                    "Invalid schema version format",
+                    schema_version=schema_version,
+                    exc_info=e,
+                )
+                return Response(
+                    {"error": "InvalidSchemaVersion"},
+                    status=status.HTTP_400_BAD_REQUEST,
+                )
+
+            is_enterprise = apps.is_installed("enterprise_core")
+            if not is_enterprise:
+                for obj in decompressed_data:
+                    if obj["model"] != "iam.role":
+                        continue
+                    permissions = obj["fields"]["permissions"]
+                    enterprise_perms_indices = [
+                        i
+                        for i, perm in enumerate(permissions)
+                        if perm[1] == "enterprise_core"
+                    ]
+                    for perm_index in reversed(enterprise_perms_indices):
+                        permissions.pop(perm_index)
+
+                decompressed_data = [
+                    obj
+                    for obj in decompressed_data
+                    if obj["model"].split(".", 1)[0] != "enterprise_core"
+                ]
+
+            decompressed_data = json.dumps(decompressed_data)
+
+            # Reuse existing load_backup logic
+            load_backup_view = LoadBackupView()
+            db_response = load_backup_view.load_backup(
+                request, decompressed_data, backup_version, current_version
+            )
+
+            if db_response.status_code != 200:
+                return db_response
+
+            logger.info("Database backup restored successfully")
+
+        except Exception as e:
+            logger.error("Database restore failed", exc_info=e)
+            return Response(
+                {"error": "DatabaseRestoreFailed"},
+                status=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            )


⚠️ Potential issue | 🟠 Major

FullRestoreView: catch ValidationError from compare_schema_versions and avoid returning exception details.
As written, a schema/version incompatibility can bubble into the outer except Exception and become a 500 (DatabaseRestoreFailed) instead of a 400. Also, returning details: str(e) is a client info-leak.

+from django.core.exceptions import ValidationError @@ - except (ValueError, TypeError) as e: + except (ValidationError, ValueError, TypeError) as e: logger.error( "Invalid schema version format", schema_version=schema_version, exc_info=e, ) + if isinstance(e, ValidationError): + # compare_schema_versions raises ValidationError({"error": "<code>"}) + code = None + try: + code = e.message_dict.get("error") + if isinstance(code, list): + code = code[0] + except Exception: + code = None return Response( - {"error": "InvalidSchemaVersion"}, + {"error": code or "InvalidSchemaVersion"}, status=status.HTTP_400_BAD_REQUEST, ) @@ - return Response( - { - "error": "AttachmentRestoreFailed", - "message": "Database restored but attachments failed", - "details": str(e), - }, - status=status.HTTP_500_INTERNAL_SERVER_ERROR, - ) + return Response( + { + "error": "AttachmentRestoreFailed", + "message": "Database restored but attachments failed", + }, + status=status.HTTP_500_INTERNAL_SERVER_ERROR, + )

Also applies to: 500-509

🤖 Prompt for AI Agents

backend/serdes/views.py around lines 376-474 (and similarly handle 500-509): the compare_schema_versions call can raise a ValidationError that is currently falling through to the broad except and resulting in a 500 and/or leaking exception details; update the code to catch ValidationError (and similar schema/version errors) separately after compare_schema_versions, log a concise message without including sensitive exception details, and return a 400 response like {"error":"InvalidSchemaVersion"}; ensure the outer broad except still returns a 500 but does not include exception strings in the response body (only log them if needed).

coderabbitai · 2025-12-12T20:20:30Z

cli/clica.py

+@click.command()
+@click.option(
+    "--dest-dir",
+    required=True,
+    help="Destination directory to save backup files",
+    type=click.Path(file_okay=False, dir_okay=True),
+)
+def backup_full(dest_dir):
+    """Create a full backup including database and attachments"""
+    if not TOKEN:
+        print(
+            "No authentication token available. Please set PAT token in .clica.env.",
+            file=sys.stderr,
+        )
+        sys.exit(1)
+
+    headers = {"Authorization": f"Token {TOKEN}"}
+    dest_path = Path(dest_dir)
+    dest_path.mkdir(parents=True, exist_ok=True)
+
+    rprint("[bold blue]Exporting database backup...[/bold blue]")
+    url = f"{API_URL}/serdes/dump-db/"
+    res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
+
+    if res.status_code != 200:
+        rprint(
+            f"[bold red]Error exporting database: {res.status_code}[/bold red]",
+            file=sys.stderr,
+        )
+        rprint(res.text, file=sys.stderr)
+        sys.exit(1)
+
+    backup_file = dest_path / "backup.json.gz"
+    with open(backup_file, "wb") as f:
+        f.write(res.content)
+    rprint(f"[green]✓ Database backup saved to {backup_file}[/green]")
+
+    rprint("[bold blue]Exporting attachments...[/bold blue]")
+    url = f"{API_URL}/serdes/export-attachments/"
+    res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
+
+    if res.status_code == 404:
+        rprint("[yellow]No attachments found, skipping attachments backup[/yellow]")
+    elif res.status_code != 200:
+        rprint(
+            f"[bold red]Error exporting attachments: {res.status_code}[/bold red]",
+            file=sys.stderr,
+        )
+        rprint(res.text, file=sys.stderr)
+        sys.exit(1)
+    else:
+        attachments_file = dest_path / "attachments.zip"
+        with open(attachments_file, "wb") as f:
+            f.write(res.content)
+        rprint(f"[green]✓ Attachments backup saved to {attachments_file}[/green]")
+
+    rprint("[bold green]Full backup completed successfully![/bold green]")
+


⚠️ Potential issue | 🟠 Major

backup_full: add timeouts + stream downloads to avoid OOM on large backups.
Current res.content buffers the full DB/ZIP in memory.

def backup_full(dest_dir): @@ - res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE) + res = requests.get( + url, + headers=headers, + verify=VERIFY_CERTIFICATE, + stream=True, + timeout=(5, 300), + ) @@ - with open(backup_file, "wb") as f: - f.write(res.content) + with open(backup_file, "wb") as f: + for chunk in res.iter_content(chunk_size=1024 * 1024): + if chunk: + f.write(chunk) @@ - res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE) + res = requests.get( + url, + headers=headers, + verify=VERIFY_CERTIFICATE, + stream=True, + timeout=(5, 300), + ) @@ - with open(attachments_file, "wb") as f: - f.write(res.content) + with open(attachments_file, "wb") as f: + for chunk in res.iter_content(chunk_size=1024 * 1024): + if chunk: + f.write(chunk)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@click.command()

@click.option(

"--dest-dir",

required=True,

help="Destination directory to save backup files",

type=click.Path(file_okay=False, dir_okay=True),

)

def backup_full(dest_dir):

"""Create a full backup including database and attachments"""

if not TOKEN:

print(

"No authentication token available. Please set PAT token in .clica.env.",

file=sys.stderr,

)

sys.exit(1)

headers = {"Authorization": f"Token {TOKEN}"}

dest_path = Path(dest_dir)

dest_path.mkdir(parents=True, exist_ok=True)

rprint("[bold blue]Exporting database backup...[/bold blue]")

url = f"{API_URL}/serdes/dump-db/"

res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)

if res.status_code != 200:

rprint(

f"[bold red]Error exporting database: {res.status_code}[/bold red]",

file=sys.stderr,

)

rprint(res.text, file=sys.stderr)

sys.exit(1)

backup_file = dest_path / "backup.json.gz"

with open(backup_file, "wb") as f:

f.write(res.content)

rprint(f"[green]✓ Database backup saved to {backup_file}[/green]")

rprint("[bold blue]Exporting attachments...[/bold blue]")

url = f"{API_URL}/serdes/export-attachments/"

res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)

if res.status_code == 404:

rprint("[yellow]No attachments found, skipping attachments backup[/yellow]")

elif res.status_code != 200:

rprint(

f"[bold red]Error exporting attachments: {res.status_code}[/bold red]",

file=sys.stderr,

)

rprint(res.text, file=sys.stderr)

sys.exit(1)

else:

attachments_file = dest_path / "attachments.zip"

with open(attachments_file, "wb") as f:

f.write(res.content)

rprint(f"[green]✓ Attachments backup saved to {attachments_file}[/green]")

rprint("[bold green]Full backup completed successfully![/bold green]")

@click.command()

@click.option(

"--dest-dir",

required=True,

help="Destination directory to save backup files",

type=click.Path(file_okay=False, dir_okay=True),

)

def backup_full(dest_dir):

"""Create a full backup including database and attachments"""

if not TOKEN:

print(

"No authentication token available. Please set PAT token in .clica.env.",

file=sys.stderr,

)

sys.exit(1)

headers = {"Authorization": f"Token {TOKEN}"}

dest_path = Path(dest_dir)

dest_path.mkdir(parents=True, exist_ok=True)

rprint("[bold blue]Exporting database backup...[/bold blue]")

url = f"{API_URL}/serdes/dump-db/"

res = requests.get(

url,

headers=headers,

verify=VERIFY_CERTIFICATE,

stream=True,

timeout=(5, 300),

)

if res.status_code != 200:

rprint(

f"[bold red]Error exporting database: {res.status_code}[/bold red]",

file=sys.stderr,

)

rprint(res.text, file=sys.stderr)

sys.exit(1)

backup_file = dest_path / "backup.json.gz"

with open(backup_file, "wb") as f:

for chunk in res.iter_content(chunk_size=1024 * 1024):

if chunk:

f.write(chunk)

rprint(f"[green]✓ Database backup saved to {backup_file}[/green]")

rprint("[bold blue]Exporting attachments...[/bold blue]")

url = f"{API_URL}/serdes/export-attachments/"

res = requests.get(

url,

headers=headers,

verify=VERIFY_CERTIFICATE,

stream=True,

timeout=(5, 300),

)

if res.status_code == 404:

rprint("[yellow]No attachments found, skipping attachments backup[/yellow]")

elif res.status_code != 200:

rprint(

f"[bold red]Error exporting attachments: {res.status_code}[/bold red]",

file=sys.stderr,

)

rprint(res.text, file=sys.stderr)

sys.exit(1)

else:

attachments_file = dest_path / "attachments.zip"

with open(attachments_file, "wb") as f:

for chunk in res.iter_content(chunk_size=1024 * 1024):

if chunk:

f.write(chunk)

rprint(f"[green]✓ Attachments backup saved to {attachments_file}[/green]")

rprint("[bold green]Full backup completed successfully![/bold green]")

🤖 Prompt for AI Agents

In cli/clica.py around lines 450 to 507, replace the current requests.get(...); writing of res.content with streaming + timeouts: call requests.get with a timeout (e.g. timeout=(connect_timeout, read_timeout)) and stream=True for both the DB and attachments endpoints, check the response status code before streaming, and write the response to file using res.iter_content(chunk_size=<reasonable size>) inside a with-open, so you never access res.content; ensure you close the response (or use it as a context manager) and keep existing headers/verify args and existing 404 and error handling logic.

coderabbitai · 2025-12-12T20:20:31Z

cli/clica.py

+@click.command()
+@click.option(
+    "--src-dir",
+    required=True,
+    help="Source directory containing backup files",
+    type=click.Path(exists=True, file_okay=False, dir_okay=True),
+)
+def restore_full(src_dir):
+    """Restore a full backup including database and attachments"""
+    if not TOKEN:
+        print(
+            "No authentication token available. Please set PAT token in .clica.env.",
+            file=sys.stderr,
+        )
+        sys.exit(1)
+
+    headers = {"Authorization": f"Token {TOKEN}"}
+    src_path = Path(src_dir)
+
+    # Check for required backup file
+    backup_file = src_path / "backup.json.gz"
+    if not backup_file.exists():
+        rprint(
+            f"[bold red]Error: backup.json.gz not found in {src_dir}[/bold red]",
+            file=sys.stderr,
+        )
+        sys.exit(1)
+
+    attachments_file = src_path / "attachments.zip"
+    has_attachments = attachments_file.exists()
+
+    rprint("[bold blue]Starting full restore (database + attachments)...[/bold blue]")
+
+    # Prepare multipart form data
+    files = {
+        "backup": ("backup.json.gz", open(backup_file, "rb"), "application/gzip"),
+    }
+
+    if has_attachments:
+        files["attachments"] = (
+            "attachments.zip",
+            open(attachments_file, "rb"),
+            "application/zip",
+        )
+        rprint("[dim]Including attachments in restore[/dim]")
+    else:
+        rprint("[yellow]No attachments.zip found, restoring database only[/yellow]")
+
+    url = f"{API_URL}/serdes/full-restore/"
+    try:
+        res = requests.post(
+            url, headers=headers, files=files, verify=VERIFY_CERTIFICATE
+        )
+    finally:
+        for file_tuple in files.values():
+            file_tuple[1].close()
+
+    if res.status_code != 200:
+        rprint(
+            f"[bold red]Error during restore: {res.status_code}[/bold red]",
+            file=sys.stderr,
+        )
+        try:
+            error_data = res.json()
+            rprint(f"[red]{error_data}[/red]", file=sys.stderr)
+        except:
+            rprint(res.text, file=sys.stderr)
+        sys.exit(1)
+
+    result = res.json()
+
+    rprint(f"[green]✓ Database restored successfully[/green]")
+
+    if has_attachments:
+        if "attachments_restored" in result:
+            restored = result["attachments_restored"]
+            processed = result.get("attachments_processed", restored)
+            rprint(
+                f"[green]✓ Attachments restored: {restored} of {processed} processed[/green]"
+            )
+
+            if result.get("attachment_errors"):
+                rprint("[yellow]Errors encountered:[/yellow]")
+                for error in result["attachment_errors"]:
+                    rprint(f"  - {error}")
+
+    if result.get("status") == "partial_success":
+        rprint("[bold yellow]Full restore completed with warnings[/bold yellow]")
+    else:
+        rprint("[bold green]Full restore completed successfully![/bold green]")
+
+    rprint("[dim]Note: You may need to regenerate your Personal Access Token[/dim]")
+


⚠️ Potential issue | 🟡 Minor

restore_full: add timeout; avoid bare except: when parsing error responses.

- res = requests.post( - url, headers=headers, files=files, verify=VERIFY_CERTIFICATE - ) + res = requests.post( + url, + headers=headers, + files=files, + verify=VERIFY_CERTIFICATE, + timeout=(5, 600), + ) @@ - except: + except ValueError: rprint(res.text, file=sys.stderr)

🤖 Prompt for AI Agents

In cli/clica.py around lines 509 to 601, the restore_full HTTP call lacks a timeout and the error parsing uses a bare except; change the requests.post call to include a timeout (e.g. timeout=REQUEST_TIMEOUT or timeout=30) and keep verify=headers as-is, and replace the bare except when calling res.json() with a specific exception handler (except ValueError as e:) to handle JSON decode errors and fall back to printing res.text; ensure files are still closed in the finally block.

cli/README.md

eric-intuitem

Nice job. To finish it:

unzip the attachments
the default directory shall be db (create it if missing)
the db should be restored in sqlite3 format (ciso-assistant.sqlite3)

tchoumi313 · 2025-12-13T20:28:59Z

Nice job. To finish it:

unzip the attachments

the default directory shall be db (create it if missing)

the db should be restored in sqlite3 format (ciso-assistant.sqlite3)

Thanks!

Okay.
that's the backup default directory right? Okay.
The restore doesn't care about the database system behind, it just flushes then restore the data (just like the ui restore flow works). If the ciso-assistant instance uses sqlite3 or postgres the result shall be the same (database fully restored).

backend/serdes/views.py

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/core/models.py (1)
3381-3385: Use stored attachment_hash in Evidence property and avoid redundant logger initialization.

Evidence.attachment_hash (lines 3381–3385) currently reads the entire attachment file in a single operation using .read(), ignoring the new EvidenceRevision.attachment_hash stored field that is now computed efficiently with chunked I/O. Update the property to:

Return the stored hash from last_revision.attachment_hash when available

Fall back to chunked computation only for legacy rows without a stored hash
 @property
 def attachment_hash(self):
-    if not self.last_revision or not self.last_revision.attachment:
-        return None
-    return hashlib.sha256(self.last_revision.attachment.read()).hexdigest()
+    last = self.last_revision
+    if not last:
+        return None
+
+    # Prefer stored hash on the revision if present
+    if getattr(last, "attachment_hash", None):
+        return last.attachment_hash
+
+    # Fallback for legacy rows: compute on the fly (streamed)
+    if not last.attachment:
+        return None
+
+    hash_obj = hashlib.sha256()
+    for chunk in last.attachment.chunks():
+        hash_obj.update(chunk)
+    return hash_obj.hexdigest()
Also, in EvidenceRevision.save() around line 3487: remove the logger re-instantiation in the except block. A module-level logger is already imported; reuse it instead of calling get_logger(__name__) inside the exception handler.

♻️ Duplicate comments (4)

cli/clica.py (2)

488-505: backup_full still buffers DB in memory and lacks timeouts on heavy HTTP calls.

requests.get for /serdes/dump-db/ (Line 491) downloads the entire DB backup into memory via res.content, and has no timeout.
requests.get for /serdes/attachment-metadata/ (Line 527) and requests.post for /serdes/batch-download-attachments/ (Line 580) also have no timeouts.
For large instances this can hang indefinitely and/or OOM the CLI process.

Consider:

Adding stream=True + a reasonable timeout to the DB export and attachments batch calls.
Writing the DB backup to disk via iter_content instead of res.content.
Adding timeouts to metadata pagination requests.

Example for the DB export section:

-    url = f"{API_URL}/serdes/dump-db/"
-    res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
-
-    if res.status_code != 200:
-        rprint(
-            f"[bold red]Error exporting database: {res.status_code}[/bold red]",
-            file=sys.stderr,
-        )
-        rprint(res.text, file=sys.stderr)
-        sys.exit(1)
-
-    backup_file = dest_path / "backup.json.gz"
-    with open(backup_file, "wb") as f:
-        f.write(res.content)
+    url = f"{API_URL}/serdes/dump-db/"
+    with requests.get(
+        url,
+        headers=headers,
+        verify=VERIFY_CERTIFICATE,
+        stream=True,
+        timeout=(5, 300),
+    ) as res:
+        if res.status_code != 200:
+            rprint(
+                f"[bold red]Error exporting database: {res.status_code}[/bold red]",
+                file=sys.stderr,
+            )
+            rprint(res.text, file=sys.stderr)
+            sys.exit(1)
+
+        backup_file = dest_path / "backup.json.gz"
+        with open(backup_file, "wb") as f:
+            for chunk in res.iter_content(chunk_size=1024 * 1024):
+                if chunk:
+                    f.write(chunk)

And for metadata + batch download:

-        res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
+        res = requests.get(
+            url,
+            headers=headers,
+            verify=VERIFY_CERTIFICATE,
+            timeout=(5, 60),
+        )
@@
-            res = requests.post(
+            res = requests.post(
                 url,
                 headers=headers,
                 json={"revision_ids": batch_ids},
                 verify=VERIFY_CERTIFICATE,
-                stream=True,
+                stream=True,
+                timeout=(5, 600),
             )

Also applies to: 521-545, 570-588

710-801: restore_full: high memory usage when building attachments.dat, missing timeout, and bare except on JSON errors.

Memory/OOM risk
restore_full reads all attachment files into memory and concatenates them before writing to the temp file:
```
body_parts = []
...
file_bytes = f.read()
...
body_parts.append(struct.pack(">I", total_size))
body_parts.append(header_bytes)
body_parts.append(file_bytes)
...
temp_attachments.write(b"".join(body_parts))
```
For large backups this is effectively O(total_attachments_size) RAM and can easily OOM.

You can avoid this by streaming per file and using Path.stat() (or os.path.getsize) to compute the length prefix:

       # Build streaming data for ALL attachments

```
       body_parts = []
```
```
       for file_info in to_upload:
```
```
           # Read file
```

           with open(file_info["path"], "rb") as f:

```
               file_bytes = f.read()
```
```
           # Build JSON header
```
```
           header = {
```
```
               "id": file_info["id"],
```

               "evidence_id": file_info["evidence_id"],

               "version": file_info["version"],

               "filename": file_info["filename"],

               "hash": file_info["hash"],

               "size": len(file_bytes),

```
           }
```

           header_bytes = json.dumps(header).encode("utf-8")

           # Calculate total size: header + file

           total_size = len(header_bytes) + len(file_bytes)

           # Add: [4-byte length][header][file]

           body_parts.append(struct.pack(">I", total_size))

           body_parts.append(header_bytes)

           body_parts.append(file_bytes)

       # Create temporary file for attachments binary data

       temp_attachments = tempfile.NamedTemporaryFile(suffix=".dat", delete=False)

       temp_attachments.write(b"".join(body_parts))

```
       temp_attachments.close()
```

       attachments_data = temp_attachments.name

```
       rprint(
```

           f"[dim]Prepared {len(to_upload)} attachments for upload ({sum(len(p) for p in body_parts) / 1024 / 1024:.1f} MB)[/dim]"

```
       )
```

       temp_attachments = tempfile.NamedTemporaryFile(suffix=".dat", delete=False)

```
       total_bytes = 0
```
```
       for file_info in to_upload:
```

           file_path = file_info["path"]

           file_size = file_path.stat().st_size

```
           header = {
```
```
               "id": file_info["id"],
```

               "evidence_id": file_info["evidence_id"],

               "version": file_info["version"],

               "filename": file_info["filename"],

               "hash": file_info["hash"],

```
               "size": file_size,
```
```
           }
```

           header_bytes = json.dumps(header).encode("utf-8")

           total_size = len(header_bytes) + file_size

           # [4-byte length][header][file]

           temp_attachments.write(struct.pack(">I", total_size))

           temp_attachments.write(header_bytes)

           with open(file_path, "rb") as f:

               for chunk in iter(lambda: f.read(1024 * 1024), b""):

                   temp_attachments.write(chunk)

```
           total_bytes += total_size
```
```
       temp_attachments.flush()
```
```
       temp_attachments.close()
```

       attachments_data = temp_attachments.name

```
       rprint(
```

           f"[dim]Prepared {len(to_upload)} attachments for upload ({total_bytes / 1024 / 1024:.1f} MB)[/dim]"

```
       )
```

No timeout on restore request
The requests.post to /serdes/full-restore/ (Line 821) has no timeout; a stuck backend could hang the CLI indefinitely. Add a reasonable timeout:

```
   res = requests.post(
```
```
       url,
```
```
       headers=headers,
```
```
       files=files,
```
```
       verify=VERIFY_CERTIFICATE,
```
```
   )
```

```
   res = requests.post(
```
```
       url,
```
```
       headers=headers,
```
```
       files=files,
```
```
       verify=VERIFY_CERTIFICATE,
```
```
       timeout=(5, 600),
```
```
   )
```

Bare except when parsing error response
The error path still uses a bare except:
```
try:
    error_data = res.json()
    rprint(f"[red]{error_data}[/red]", file=sys.stderr)
except:
    rprint(res.text, file=sys.stderr)
```
This can mask unrelated errors and makes debugging harder. Narrow it to JSON decode errors:

```
   try:
```
```
       error_data = res.json()
```

       rprint(f"[red]{error_data}[/red]", file=sys.stderr)

```
   except:
```

       rprint(res.text, file=sys.stderr)

```
   try:
```
```
       error_data = res.json()
```

       rprint(f"[red]{error_data}[/red]", file=sys.stderr)

```
   except ValueError:
```

       # Not JSON, fallback to raw text

       rprint(res.text, file=sys.stderr)

These changes will make full restores more robust on large datasets and avoid CLI hangs or silent masking of non‑JSON errors.

Also applies to: 802-872

backend/serdes/views.py (2)

279-281: Consider streaming for large ZIP exports.

Returning zip_buffer.getvalue() loads the entire ZIP into memory. For instances with many/large attachments, consider using StreamingHttpResponse or a spooled temp file to avoid OOM.

This was flagged in a previous review. The current implementation works but may not scale for very large attachment sets.

439-455: Catch ValidationError from compare_schema_versions.

compare_schema_versions raises ValidationError for schema version mismatches (per backend/core/utils.py lines 179-265), but only ValueError and TypeError are caught here. This causes version mismatches to bubble up to the generic exception handler at line 490, returning a 500 instead of a proper 400 with the error code.

+from django.core.exceptions import ValidationError
 ...
             try:
                 schema_version_int = int(schema_version)
                 compare_schema_versions(schema_version_int, backup_version)
                 if backup_version != VERSION:
                     raise ValueError(
                         "The version of the current instance and the one that generated the backup are not the same."
                     )
-            except (ValueError, TypeError) as e:
+            except (ValidationError, ValueError, TypeError) as e:
                 logger.error(
                     "Invalid schema version format",
                     schema_version=schema_version,
                     exc_info=e,
                 )
+                error_code = "InvalidSchemaVersion"
+                if isinstance(e, ValidationError):
+                    try:
+                        error_code = e.message_dict.get("error", error_code)
+                        if isinstance(error_code, list):
+                            error_code = error_code[0]
+                    except Exception:
+                        pass
                 return Response(
-                    {"error": "InvalidSchemaVersion"},
+                    {"error": error_code},
                     status=status.HTTP_400_BAD_REQUEST,
                 )

🧹 Nitpick comments (6)

cli/.clica.env.template (1)

1-3: VERIFY_CERTIFICATE default here weakens TLS verification compared to code default.

The CLI defaults VERIFY_CERTIFICATE to true when unset, but this template sets it to false, which will disable TLS verification for anyone copying it as‑is. Consider either:

defaulting to true in the template and documenting how to turn it off for local/self‑signed setups, or

adding a strong warning comment in this template about using false only for trusted local environments.

cli/README.md (1)

319-322: Fix blank line between blockquotes (MD028).

There's a blank line between the WARNING and TIP blockquotes which triggers markdownlint MD028. Remove the blank line at line 320.
 > [!WARNING]
 > Restoring a backup will **replace all existing data** in your CISO Assistant instance. Make sure you have a current backup before performing a restore operation.
-
 > [!TIP]

backend/serdes/export_utils.py (3)

34-46: Use POSIX paths for ZIP entries and consider streaming writes.

os.path.join produces backslashes on Windows, which violates the ZIP specification requiring forward slashes. Also, file.read() loads the entire file into memory.

+import posixpath
+import shutil
 ...
                 try:
                     with default_storage.open(revision.attachment.name, "rb") as file:
-                        file_content = file.read()
-
                         filename = (
                             f"{revision.evidence_id}_v{revision.version}_"
                             f"{os.path.basename(revision.attachment.name)}"
                         )
 
-                        zip_path = os.path.join(
+                        zip_path = posixpath.join(
                             "attachments", "evidence-revisions", filename
                         )
 
-                        zipf.writestr(zip_path, file_content)
+                        with zipf.open(zip_path, "w") as zf:
+                            shutil.copyfileobj(file, zf, length=1024 * 1024)
                         count += 1

145-150: Make restore idempotent by deleting existing files before save.

default_storage.save() may auto-rename files if the path exists, causing orphaned files and path drift on repeated restores.

                             storage_path = (
                                 f"evidence-revisions/{evidence_id}/"
                                 f"v{version}/{original_filename}"
                             )
 
+                            # Delete existing file for idempotent restore
+                            if revision.attachment:
+                                try:
+                                    default_storage.delete(revision.attachment.name)
+                                except Exception:
+                                    pass
+                            if default_storage.exists(storage_path):
+                                try:
+                                    default_storage.delete(storage_path)
+                                except Exception:
+                                    pass
+
                             saved_path = default_storage.save(
                                 storage_path, io.BytesIO(file_content)
                             )

159-162: Avoid exposing raw exception details in API responses.

str(e) may contain sensitive internal information. Log the full exception server-side and return a sanitized message.

                             except Exception as e:
+                                logger.error(
+                                    "Failed to restore attachment",
+                                    filename=filename,
+                                    evidence_id=evidence_id,
+                                    version=version,
+                                    error=str(e),
+                                    exc_info=True,
+                                )
                                 stats["errors"].append(
-                                    f"Failed to restore {filename}: {str(e)}"
+                                    f"Failed to restore {filename}"
                                 )

backend/serdes/views.py (1)

870-884: File is fully loaded into memory before streaming.

Despite the comment "Read file in chunks", the entire file is buffered in file_buffer before yielding. For large files, this defeats the streaming benefit. However, fixing this requires knowing total_size upfront for the length prefix, which complicates true streaming.

One approach: compute size via default_storage.size() first, then stream chunks directly:
file_size = default_storage.size(revision.attachment.name)
total_size = len(header_bytes) + file_size
yield struct.pack(">I", total_size)
yield header_bytes
with default_storage.open(revision.attachment.name, "rb") as f:
    for chunk in iter(lambda: f.read(1024 * 1024), b""):
        yield chunk

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4d42fb4 and 346b00a.

📒 Files selected for processing (9)

backend/core/migrations/0119_add_attachment_hash_to_evidencerevision.py (1 hunks)
backend/core/models.py (3 hunks)
backend/serdes/export_utils.py (1 hunks)
backend/serdes/urls.py (1 hunks)
backend/serdes/views.py (4 hunks)
cli/.clica.env.template (1 hunks)
cli/.gitignore (1 hunks)
cli/README.md (2 hunks)
cli/clica.py (3 hunks)

✅ Files skipped from review due to trivial changes (1)

cli/.gitignore

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-09-19T08:46:34.495Z

Learnt from: Mohamed-Hacene
Repo: intuitem/ciso-assistant-community PR: 2550
File: backend/core/serializers.py:1277-1292
Timestamp: 2025-09-19T08:46:34.495Z
Learning: In EvidenceWriteSerializer, the link and attachment fields are read-only, so updates to these fields should be done through EvidenceRevisionWriteSerializer to maintain proper versioning.

Applied to files:

backend/core/models.py

📚 Learning: 2025-09-19T09:08:52.306Z

Learnt from: Mohamed-Hacene
Repo: intuitem/ciso-assistant-community PR: 2550
File: backend/core/migrations/0098_evidence_lifecycle_data_migration.py:21-28
Timestamp: 2025-09-19T09:08:52.306Z
Learning: EvidenceRevision.save() method automatically sets is_published from the parent Evidence object and handles folder assignment from the evidence's folder with proper safety checks.

Applied to files:

backend/core/models.py

🧬 Code graph analysis (4)

backend/core/models.py (1)

backend/core/utils.py (1)

sha256 (37-41)

backend/serdes/views.py (4)

backend/serdes/export_utils.py (4)

AttachmentExporter (15-82)

AttachmentImporter (85-196)

create_attachments_zip (61-82)

extract_attachments_from_zip (86-196)

backend/core/views.py (5)

get (1005-1012)

get (6597-6598)

attachment (7156-7186)

attachment (7203-7228)

filter (190-216)

backend/core/models.py (21)

EvidenceRevision (3388-3508)

filename (3356-3361)

filename (3489-3490)

attachment_hash (3382-3385)

delete (971-984)

delete (5189-5196)

save (1654-1662)

save (1778-1781)

save (2779-2781)

save (3340-3342)

save (3431-3487)

save (3662-3666)

save (3876-3900)

save (4261-4263)

save (4398-4401)

save (4469-4471)

save (5198-5228)

save (5351-5357)

save (6295-6315)

save (6850-6863)

save (7023-7039)

backend/core/utils.py (2)

compare_schema_versions (180-266)

sha256 (37-41)

backend/core/migrations/0119_add_attachment_hash_to_evidencerevision.py (2)

backend/core/models.py (18)

EvidenceRevision (3388-3508)

update (874-886)

attachment_hash (3382-3385)

save (1654-1662)

save (1778-1781)

save (2779-2781)

save (3340-3342)

save (3431-3487)

save (3662-3666)

save (3876-3900)

save (4261-4263)

save (4398-4401)

save (4469-4471)

save (5198-5228)

save (5351-5357)

save (6295-6315)

save (6850-6863)

save (7023-7039)

backend/core/utils.py (1)

sha256 (37-41)

backend/serdes/urls.py (1)

backend/serdes/views.py (5)

ExportAttachmentsView (250-303)

FullRestoreView (387-703)

AttachmentMetadataView (706-793)

BatchDownloadAttachmentsView (796-917)

BatchUploadAttachmentsView (920-1131)

🪛 markdownlint-cli2 (0.18.1)

cli/README.md

321-321: Blank line inside blockquote

(MD028, no-blanks-blockquote)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: build_community_frontend
GitHub Check: enterprise-startup-docker-compose-test
GitHub Check: test (3.12)
GitHub Check: startup-functional-test (3.12)
GitHub Check: startup-docker-compose-test
GitHub Check: enterprise-startup-functional-test (3.12)
GitHub Check: build_enterprise_frontend
GitHub Check: build (3.12)
GitHub Check: Analyze (python)

🔇 Additional comments (5)

backend/core/migrations/0119_add_attachment_hash_to_evidencerevision.py (1)

8-43: Backfill migration looks well-designed.

The chunked reading (1MB), iterator with chunk_size=100, and per-record error handling ensure the migration runs safely on large datasets without OOM. Progress logging every 100 records is helpful for monitoring.

One minor consideration: if a storage backend is slow or unavailable during migration, this could take a long time or fail silently for many records. Consider adding a summary warning if errors > 0 at the end.

backend/serdes/urls.py (1)

12-37: URL patterns are well-structured.

The new endpoints follow REST conventions and are logically grouped. The comment on line 22 helps clarify the streaming batch endpoints section.

cli/README.md (1)

225-268: Documentation for backup_full is comprehensive and clear.

The explanation of the streaming approach, resume capability, and hash verification is well-documented. The technical details section provides useful context for understanding the protocol format.

backend/serdes/views.py (2)

60-64: Good fix: Excluding PAT from backup/restore.

Adding "iam.personalaccesstoken" to the exclude list addresses issue #2700 where restoring a backup with personal authentication tokens caused IntegrityError: NOT NULL constraint failed.

306-372: LoadAttachmentsView implementation looks correct.

Proper permission check, file type validation, and partial success handling with 207 status for mixed results. Error details are logged server-side while the response provides a sanitized count-only summary.

backend/serdes/views.py

coderabbitai · 2025-12-15T16:35:15Z

cli/clica.py

+    # Fetch all attachment metadata from API (paginated)
+    rprint("[dim]Fetching attachment metadata from server...[/dim]")
+    all_metadata = []
+    url = f"{API_URL}/serdes/attachment-metadata/"
+
+    while url:
+        res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
+
+        if res.status_code != 200:
+            rprint(
+                f"[bold red]Error fetching metadata: {res.status_code}[/bold red]",
+                file=sys.stderr,
+            )
+            rprint(res.text, file=sys.stderr)
+            sys.exit(1)
+
+        data = res.json()
+        all_metadata.extend(data["results"])
+        url = data.get("next")
+        if url and not url.startswith("http"):
+            # Convert relative URL to absolute
+            url = f"{API_URL}/serdes/attachment-metadata/{url}"
+
+    rprint(f"[cyan]Found {len(all_metadata)} total attachments[/cyan]")
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the relevant code section fd -e py -p "cli/clica.py" | head -1 | xargs -I {} sh -c 'wc -l {} && echo "---" && sed -n "515,550p" {}'

Repository: intuitem/ciso-assistant-community

Length of output: 1434

🏁 Script executed:

# Check for API_URL definition rg "API_URL\s*=" cli/clica.py -B 2 -A 2

Repository: intuitem/ciso-assistant-community

Length of output: 289

🏁 Script executed:

# Look for other pagination patterns in the codebase rg "\.get\(.*next" --type py -A 3 -B 1

Repository: intuitem/ciso-assistant-community

Length of output: 1388

🏁 Script executed:

# Check if there are tests for this pagination logic fd -e py -p "test" | xargs rg "attachment-metadata|pagination" 2>/dev/null

Repository: intuitem/ciso-assistant-community

Length of output: 61

🏁 Script executed:

# Find the attachment-metadata endpoint in the backend rg "attachment-metadata" --type py | head -20

Repository: intuitem/ciso-assistant-community

Length of output: 331

🏁 Script executed:

# Look for DRF pagination configuration rg "DEFAULT_PAGINATION_CLASS|pagination" backend/ --type py -B 2 -A 2 | head -40

Repository: intuitem/ciso-assistant-community

Length of output: 1067

🏁 Script executed:

# Search for the serializer or view handling attachment-metadata fd -e py -p "backend" | xargs rg "attachment.metadata|AttachmentMetadata" | head -20

Repository: intuitem/ciso-assistant-community

Length of output: 476

🏁 Script executed:

# Check Django REST Framework settings fd -e py -p "backend.*settings" | xargs cat 2>/dev/null | grep -A 10 -B 2 "REST_FRAMEWORK"

Repository: intuitem/ciso-assistant-community

Length of output: 1957

🏁 Script executed:

# Read the custom pagination class cat -n backend/core/pagination.py

Repository: intuitem/ciso-assistant-community

Length of output: 919

🏁 Script executed:

# Check the AttachmentMetadataView implementation rg "class AttachmentMetadataView" backend/serdes/views.py -A 50

Repository: intuitem/ciso-assistant-community

Length of output: 1890

🏁 Script executed:

# Get the complete AttachmentMetadataView to see how it returns paginated data rg "class AttachmentMetadataView" backend/serdes/views.py -A 100 | head -80

Repository: intuitem/ciso-assistant-community

Length of output: 3012

🏁 Script executed:

# Check what the paginator.get_paginated_response returns rg "get_paginated_response" backend/serdes/views.py -B 3 -A 3

Repository: intuitem/ciso-assistant-community

Length of output: 59

URL reconstruction logic doesn't safely handle absolute-path pagination links.

The code assumes next is either an absolute URL or query-only (?page=2), but if the backend ever returns an absolute path like /serdes/attachment-metadata/?page=2, the concatenation creates a malformed URL:

url = f"{API_URL}/serdes/attachment-metadata/{url}" # → {API_URL}/serdes/attachment-metadata//serdes/attachment-metadata/?page=2

Use urllib.parse.urljoin or check if the URL starts with / to handle both cases correctly.

🤖 Prompt for AI Agents

In cli/clica.py around lines 521 to 545, the pagination URL reconstruction blindly prepends API_URL plus the path, which breaks when the server returns an absolute-path next value (e.g. "/serdes/attachment-metadata/?page=2"); update the logic to build the next URL safely by using urllib.parse.urljoin(API_URL, next) or by checking for a leading slash and concatenating appropriately (i.e., if next startswith "http" use it as-is, elif next.startswith("/") use API_URL.rstrip("/") + next, else join the base path and next), so both relative-query and absolute-path pagination links produce valid absolute URLs.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

backend/serdes/export_utils.py (2)

26-59: [Duplicate] Exporter: still reading full files into memory and using OS-specific path separators.

The past review comment on lines 26-48 identified that this code reads entire attachment files into memory (line 35) and uses os.path.join (lines 42-44) which produces backslashes on Windows. These issues remain unaddressed and can cause out-of-memory errors for large attachments and cross-platform compatibility issues.

Apply the previously suggested fix:

 import io
 import os
+import posixpath
+import shutil
 import zipfile

                 try:
                     with default_storage.open(revision.attachment.name, "rb") as file:
-                        file_content = file.read()
-
                         filename = (
                             f"{revision.evidence_id}_v{revision.version}_"
                             f"{os.path.basename(revision.attachment.name)}"
                         )
 
-                        zip_path = os.path.join(
+                        zip_path = posixpath.join(
                             "attachments", "evidence-revisions", filename
                         )
 
-                        zipf.writestr(zip_path, file_content)
+                        with zipf.open(zip_path, "w") as zf:
+                            shutil.copyfileobj(file, zf, length=1024 * 1024)
                         count += 1

108-174: [Duplicate] Importer: restore is not idempotent and reads full files into memory.

The past review comment on lines 108-174 identified two issues that remain unaddressed:

Non-idempotent restores: At line 145, default_storage.save() will auto-rename if the path exists, leaving orphaned files and changing paths on each restore run.
Full in-memory reads: Line 138 reads the entire file content into memory with zipf.read().

Apply the previously suggested fix for idempotency:

                         if not dry_run:
                             # Find the corresponding EvidenceRevision
                             try:
                                 revision = EvidenceRevision.objects.get(
                                     evidence_id=evidence_id, version=version
                                 )
 
                                 file_content = zipf.read(file_path)
 
                                 storage_path = (
                                     f"evidence-revisions/{evidence_id}/"
                                     f"v{version}/{original_filename}"
                                 )
 
+                                # Prefer overwrite semantics for idempotent restores
+                                try:
+                                    if revision.attachment:
+                                        default_storage.delete(revision.attachment.name)
+                                except Exception:
+                                    logger.warning(
+                                        "Failed to delete previous attachment",
+                                        revision_id=revision.id,
+                                        attachment_name=getattr(revision.attachment, "name", None),
+                                    )
+                                try:
+                                    if default_storage.exists(storage_path):
+                                        default_storage.delete(storage_path)
+                                except Exception:
+                                    logger.warning(
+                                        "Failed to delete target path before restore",
+                                        storage_path=storage_path,
+                                    )
+
                                 saved_path = default_storage.save(
                                     storage_path, io.BytesIO(file_content)
                                 )

For the memory issue, consider streaming from the ZIP directly to storage to avoid loading entire files into RAM.

backend/serdes/views.py (2)

250-303: [Duplicate] ExportAttachmentsView: large in-memory ZIP can cause OOM.

The past review comment identified that returning zip_buffer.getvalue() (line 280) loads the entire ZIP into memory, risking out-of-memory errors for large evidence sets. This issue remains unaddressed.

Consider using a SpooledTemporaryFile or streaming response:

+import tempfile
+from django.http import FileResponse

         try:
             exporter = AttachmentExporter()
-            zip_buffer, count = exporter.create_attachments_zip()
+            revisions = exporter.collect_all_attachments()
 
-            if count == 0:
+            if revisions.count() == 0:
                 logger.info("No attachments found to export")
                 return Response(
                     {"message": "No attachments found"},
                     status=status.HTTP_404_NOT_FOUND,
                 )
 
             # Create response with ZIP file
             timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
             filename = f"ciso-assistant-attachments-{timestamp}.zip"
 
-            response = HttpResponse(
-                zip_buffer.getvalue(), content_type="application/zip"
-            )
+            # Use temp file to avoid OOM
+            temp_file = tempfile.SpooledTemporaryFile(max_size=10*1024*1024)
+            with zipfile.ZipFile(temp_file, "w", zipfile.ZIP_DEFLATED) as zipf:
+                count = exporter.package_attachments_to_zip(revisions, zipf)
+            temp_file.seek(0)
+            
+            response = FileResponse(
+                temp_file, content_type="application/zip", as_attachment=True
+            )
             response["Content-Disposition"] = f'attachment; filename="{filename}"'

439-455: [Duplicate] FullRestoreView: catch ValidationError from compare_schema_versions.

The past review comment noted that compare_schema_versions (line 441) can raise ValidationError on schema incompatibility, but the current code only catches ValueError and TypeError (line 446). This means validation errors will bubble to the outer exception handler and return an HTTP 500 instead of the more appropriate HTTP 400.

Apply the suggested fix:

+from django.core.exceptions import ValidationError

             try:
                 schema_version_int = int(schema_version)
                 compare_schema_versions(schema_version_int, backup_version)
                 if backup_version != VERSION:
                     raise ValueError(
                         "The version of the current instance and the one that generated the backup are not the same."
                     )
-            except (ValueError, TypeError) as e:
+            except (ValidationError, ValueError, TypeError) as e:
                 logger.error(
                     "Invalid schema version format",
                     schema_version=schema_version,
                     exc_info=e,
                 )
                 return Response(
                     {"error": "InvalidSchemaVersion"},
                     status=status.HTTP_400_BAD_REQUEST,
                 )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 346b00a and 23d41a5.

📒 Files selected for processing (2)

backend/serdes/export_utils.py (1 hunks)
backend/serdes/views.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

backend/serdes/export_utils.py (2)

backend/core/models.py (19)

Evidence (3305-3385)

EvidenceRevision (3388-3508)

filename (3356-3361)

filename (3489-3490)

save (1654-1662)

save (1778-1781)

save (2779-2781)

save (3340-3342)

save (3431-3487)

save (3662-3666)

save (3876-3900)

save (4261-4263)

save (4398-4401)

save (4469-4471)

save (5198-5228)

save (5351-5357)

save (6295-6315)

save (6850-6863)

save (7023-7039)

backend/serdes/views.py (3)

get (39-73)

get (251-303)

get (712-793)

backend/serdes/views.py (5)

backend/serdes/export_utils.py (4)

AttachmentExporter (15-82)

AttachmentImporter (85-194)

create_attachments_zip (61-82)

extract_attachments_from_zip (86-194)

backend/core/models.py (19)

EvidenceRevision (3388-3508)

filename (3356-3361)

filename (3489-3490)

attachment_hash (3382-3385)

save (1654-1662)

save (1778-1781)

save (2779-2781)

save (3340-3342)

save (3431-3487)

save (3662-3666)

save (3876-3900)

save (4261-4263)

save (4398-4401)

save (4469-4471)

save (5198-5228)

save (5351-5357)

save (6295-6315)

save (6850-6863)

save (7023-7039)

backend/core/utils.py (2)

compare_schema_versions (180-266)

sha256 (37-41)

backend/serdes/serializers.py (1)

LoadBackupSerializer (20-24)

backend/iam/models.py (1)

has_backup_permission (729-734)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: build_community_frontend
GitHub Check: build_enterprise_frontend
GitHub Check: enterprise-startup-docker-compose-test
GitHub Check: startup-docker-compose-test
GitHub Check: build (3.12)
GitHub Check: enterprise-startup-functional-test (3.12)
GitHub Check: startup-functional-test (3.12)
GitHub Check: test (3.12)
GitHub Check: Analyze (python)

🔇 Additional comments (3)

backend/serdes/views.py (3)

60-60: Good fix for personal authentication token issue.

Excluding iam.personalaccesstoken from the backup export addresses issue #2700 where importing a database with personal authentication tokens caused integrity constraint failures.

Based on PR objectives: This change resolves the reported issue where exports containing users with personal authentication tokens would fail to import with an IntegrityError.

306-384: LoadAttachmentsView: good error handling and sanitization.

The error handling properly avoids exposing raw exception details to clients (lines 360, 382) while logging full details server-side with exc_info=True. The use of HTTP 207 for partial success is appropriate.

1098-1118: BatchUploadAttachmentsView: proper error sanitization.

The error handling properly sanitizes exception details before returning to clients (lines 1100, 1115) while logging full details with exc_info=True. This addresses the previous security concerns about information exposure.

coderabbitai · 2025-12-16T09:41:52Z

backend/serdes/views.py

+                    # Read file in chunks
+                    file_buffer = io.BytesIO()
+                    with default_storage.open(revision.attachment.name, "rb") as f:
+                        for chunk in iter(lambda: f.read(10240 * 1024), b""):
+                            file_buffer.write(chunk)
+
+                    file_bytes = file_buffer.getvalue()
+
+                    # Calculate total size: header_length + header + file
+                    total_size = len(header_bytes) + len(file_bytes)
+
+                    # Yield: [4-byte total size][header bytes][file bytes]
+                    yield struct.pack(">I", total_size)
+                    yield header_bytes
+                    yield file_bytes


⚠️ Potential issue | 🟠 Major

BatchDownloadAttachmentsView: avoid buffering entire file before streaming.

Lines 870-876 read the entire attachment into an io.BytesIO buffer before yielding, which defeats the purpose of streaming and can cause memory issues for large files.

Stream the file content directly:

- # Read file in chunks - file_buffer = io.BytesIO() + # Calculate file size first + file_size = default_storage.size(revision.attachment.name) + + # Calculate total size: header + file + total_size = len(header_bytes) + file_size + + # Yield: [4-byte total size][header bytes] + yield struct.pack(">I", total_size) + yield header_bytes + + # Stream file in chunks with default_storage.open(revision.attachment.name, "rb") as f: - for chunk in iter(lambda: f.read(10240 * 1024), b""): - file_buffer.write(chunk) - - file_bytes = file_buffer.getvalue() - - # Calculate total size: header_length + header + file - total_size = len(header_bytes) + len(file_bytes) - - # Yield: [4-byte total size][header bytes][file bytes] - yield struct.pack(">I", total_size) - yield header_bytes - yield file_bytes + for chunk in iter(lambda: f.read(1024 * 1024), b""): + yield chunk

🤖 Prompt for AI Agents

In backend/serdes/views.py around lines 870 to 884, the code currently reads the entire attachment into an io.BytesIO before yielding which uses excessive memory; instead obtain the remote file size via default_storage.size(revision.attachment.name), compute total_size = len(header_bytes) + file_size, yield the 4-byte total size and header_bytes, then open the file with default_storage.open(revision.attachment.name, "rb") and stream its contents by iterating over f.read(CHUNK_SIZE) and yielding each chunk (no in-memory buffering); remove the temporary BytesIO and ensure CHUNK_SIZE matches previous value (e.g., 10*1024*1024).

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (6)

cli/clica.py (3)

488-504: Database backup still buffers entire response in memory and lacks timeout.

The res.content approach loads the entire database backup into memory before writing. For large instances this can cause OOM. Additionally, no timeout is specified.

Apply streaming and timeout as suggested in the previous review:

-    res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
+    res = requests.get(
+        url,
+        headers=headers,
+        verify=VERIFY_CERTIFICATE,
+        stream=True,
+        timeout=(5, 300),
+    )
 
     if res.status_code != 200:
         ...
 
     backup_file = dest_path / "backup.json.gz"
     with open(backup_file, "wb") as f:
-        f.write(res.content)
+        for chunk in res.iter_content(chunk_size=1024 * 1024):
+            if chunk:
+                f.write(chunk)

537-553: Missing timeout on metadata fetch; URL reconstruction still flawed.

The pagination requests lack timeouts. Additionally, the URL handling logic doesn't correctly handle absolute-path responses (e.g., /serdes/attachment-metadata/?page=2).

+from urllib.parse import urljoin
+
 while url:
-    res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
+    res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE, timeout=(5, 60))
     ...
     url = data.get("next")
-    if url and not url.startswith("http"):
-        # Convert relative URL to absolute
-        url = f"{API_URL}/serdes/attachment-metadata/{url}"
+    if url and not url.startswith("http"):
+        url = urljoin(API_URL + "/", url)

845-871: Restore request lacks timeout; bare except: catches too broadly.

These issues were flagged in the previous review and remain unaddressed.

     res = requests.post(
         url,
         headers=headers,
         files=files,
         verify=VERIFY_CERTIFICATE,
+        timeout=(5, 600),
     )
 ...
     try:
         error_data = res.json()
         rprint(f"[red]{error_data}[/red]", file=sys.stderr)
-    except:
+    except ValueError:
         rprint(res.text, file=sys.stderr)

backend/serdes/views.py (3)

279-288: Memory concern with full ZIP in RAM remains unaddressed.

The zip_buffer.getvalue() approach loads the entire ZIP into memory, which risks OOM on large attachment sets. This was flagged in a previous review.

870-884: File buffering defeats streaming purpose - previously flagged.

The entire attachment is read into file_buffer BytesIO before yielding, which defeats the purpose of streaming and can cause memory issues for large files. This was flagged in a previous review with a suggested fix to calculate file size first and stream directly.

439-455: ValidationError from compare_schema_versions not caught.

The compare_schema_versions function raises ValidationError for version mismatches (e.g., {"error": "backupGreaterVersionError"}), but only ValueError and TypeError are caught here. A version mismatch will fall through to the outer except Exception at line 490 and return a generic DatabaseRestoreFailed (500) instead of the appropriate InvalidSchemaVersion (400).

Apply this diff to handle ValidationError:

 from core.utils import compare_schema_versions
+from django.core.exceptions import ValidationError

             try:
                 schema_version_int = int(schema_version)
                 compare_schema_versions(schema_version_int, backup_version)
                 if backup_version != VERSION:
                     raise ValueError(
                         "The version of the current instance and the one that generated the backup are not the same."
                     )
-            except (ValueError, TypeError) as e:
+            except ValidationError as e:
+                # Handle version validation errors from compare_schema_versions
+                error_code = "InvalidSchemaVersion"
+                if hasattr(e, 'message_dict') and 'error' in e.message_dict:
+                    error_code = e.message_dict['error']
+                    if isinstance(error_code, list):
+                        error_code = error_code[0]
+                logger.error(
+                    "Schema version validation failed",
+                    schema_version=schema_version,
+                    error_code=error_code,
+                )
+                return Response(
+                    {"error": error_code},
+                    status=status.HTTP_400_BAD_REQUEST,
+                )
+            except (ValueError, TypeError) as e:
                 logger.error(
                     "Invalid schema version format",
                     schema_version=schema_version,
                     exc_info=e,
                 )
                 return Response(
                     {"error": "InvalidSchemaVersion"},
                     status=status.HTTP_400_BAD_REQUEST,
                 )

🧹 Nitpick comments (4)

cli/clica.py (1)

632-647: Header parsing is inefficient but functional.

The byte-by-byte JSON parsing attempts up to 1024 decode operations per block. This works but is O(n²) in the worst case.

If the protocol can be modified, consider having the backend send the header length as a prefix (similar to the block length) to enable direct parsing. This would be a coordinated change with the backend.

backend/serdes/views.py (3)

502-667: Consider extracting shared attachment processing logic.

The attachment restoration code here (binary parsing, header validation, hash verification, storage save) is largely duplicated in BatchUploadAttachmentsView. Extracting this into a utility function would reduce maintenance burden and ensure consistent behavior.

749-778: Per-item storage I/O may impact performance on large result sets.

Each iteration calls default_storage.size() and default_storage.exists(), which may be slow depending on the storage backend (especially for S3/remote storage). For pagination with many items, consider caching file sizes in the model or limiting default page size.

980-988: Header parsing is functional but inefficient.

The loop attempts to parse progressively longer byte sequences as JSON, which is O(n) in header length with n JSON parse attempts. Consider including a header length prefix in the binary format for more efficient parsing in future versions.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23d41a5 and b64efb2.

📒 Files selected for processing (3)

backend/serdes/views.py (4 hunks)
cli/.clica.env (0 hunks)
cli/clica.py (3 hunks)

💤 Files with no reviewable changes (1)

cli/.clica.env

🧰 Additional context used

🧬 Code graph analysis (1)

backend/serdes/views.py (4)

backend/serdes/export_utils.py (4)

AttachmentExporter (15-82)

AttachmentImporter (85-194)

create_attachments_zip (61-82)

extract_attachments_from_zip (86-194)

backend/core/views.py (5)

get (1005-1012)

get (6597-6598)

attachment (7156-7186)

attachment (7203-7228)

filter (190-216)

backend/core/models.py (21)

EvidenceRevision (3388-3508)

filename (3356-3361)

filename (3489-3490)

attachment_hash (3382-3385)

delete (971-984)

delete (5189-5196)

save (1654-1662)

save (1778-1781)

save (2779-2781)

save (3340-3342)

save (3431-3487)

save (3662-3666)

save (3876-3900)

save (4261-4263)

save (4398-4401)

save (4469-4471)

save (5198-5228)

save (5351-5357)

save (6295-6315)

save (6850-6863)

save (7023-7039)

backend/core/utils.py (2)

compare_schema_versions (180-266)

sha256 (37-41)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: test (3.12)
GitHub Check: startup-docker-compose-test
GitHub Check: enterprise-startup-functional-test (3.12)
GitHub Check: startup-functional-test (3.12)
GitHub Check: enterprise-startup-docker-compose-test
GitHub Check: build (3.12)
GitHub Check: build_enterprise_frontend
GitHub Check: build_community_frontend
GitHub Check: Analyze (python)

🔇 Additional comments (8)

cli/clica.py (2)

2-7: LGTM!

The new standard library imports are appropriate for the streaming protocol and integrity verification features.

906-907: LGTM!

Commands are properly registered following the existing pattern.

backend/serdes/views.py (6)

1-33: LGTM!

The imports are appropriately organized and all added modules are utilized by the new views.

60-60: Good fix for Issue #2700.

Excluding iam.personalaccesstoken from the backup prevents the IntegrityError: NOT NULL constraint failed: global_settings_globalsettings.folder_id that occurred when importing databases containing personal access tokens.

136-136: Consistent exclusion during import.

Correctly mirrors the export exclusion to ensure personal access tokens are excluded from the restore process.

354-363: Error handling looks appropriate.

Error details are logged server-side while clients receive sanitized messages. The "Check server logs for details" guidance is appropriate for operators.

1097-1117: Error handling improvements look good.

Exception details are now sanitized - using generic messages like "Error processing block" and "An internal error has occurred" instead of exposing str(e). Full details are appropriately logged server-side.

947-950: Reading entire request body into memory may cause issues with very large uploads, but Django's DATA_UPLOAD_MAX_MEMORY_SIZE setting (2.5 MB default) already protects against this. When accessing request.body, Django enforces a size limit calculated against the total request size excluding file upload data. For file uploads, Django's MemoryFileUploadHandler and TemporaryFileUploadHandler provide default behavior of reading small files into memory and large ones onto disk. However, if this code processes raw (non-multipart) request bodies or batches that bypass standard file upload handlers, verify that adequate size limits are configured for your use case. Consider whether FILE_UPLOAD_MAX_MEMORY_SIZE or DATA_UPLOAD_MAX_MEMORY_SIZE need adjustment based on typical batch sizes.

cli/clica.py

coderabbitai · 2025-12-17T09:54:44Z

cli/clica.py

+            # Request batch download
+            url = f"{API_URL}/serdes/batch-download-attachments/"
+            res = requests.post(
+                url,
+                headers=headers,
+                json={"revision_ids": batch_ids},
+                verify=VERIFY_CERTIFICATE,
+                stream=True,
+            )


⚠️ Potential issue | 🟡 Minor

Add timeout to batch download request.

The batch download uses streaming correctly but lacks a timeout, which can cause the CLI to hang indefinitely on network issues.

res = requests.post( url, headers=headers, json={"revision_ids": batch_ids}, verify=VERIFY_CERTIFICATE, stream=True, + timeout=(5, 300), )

🤖 Prompt for AI Agents

In cli/clica.py around lines 597 to 605, the requests.post call used for the batch download is missing a timeout which can cause the CLI to hang; add a timeout parameter (e.g., timeout=30 or a configurable constant) to the requests.post call and update error handling to catch requests.Timeout and requests.RequestException so the CLI fails fast and reports a clear error instead of hanging.

cli/clica.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (5)

cli/clica.py (5)

597-605: Add timeout to batch download request.

The batch download request lacks a timeout, which can cause the CLI to hang indefinitely on network issues.

Apply this diff:

 # Request batch download
 url = f"{API_URL}/serdes/batch-download-attachments/"
 res = requests.post(
     url,
     headers=headers,
     json={"revision_ids": batch_ids},
     verify=VERIFY_CERTIFICATE,
     stream=True,
+    timeout=(5, 300),
 )

488-504: Add timeout and stream the database backup to avoid OOM.

The database export uses res.content, which buffers the entire backup in memory before writing to disk. For large databases, this can cause OOM errors.

Apply this diff to stream the response and add a timeout:

 rprint("[bold blue]Step 1/2: Exporting database backup...[/bold blue]")
 url = f"{API_URL}/serdes/dump-db/"
-res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
+res = requests.get(
+    url,
+    headers=headers,
+    verify=VERIFY_CERTIFICATE,
+    stream=True,
+    timeout=(5, 300),
+)

 if res.status_code != 200:
     rprint(
         f"[bold red]Error exporting database: {res.status_code}[/bold red]",
         file=sys.stderr,
     )
     rprint(res.text, file=sys.stderr)
     sys.exit(1)

 backup_file = dest_path / "backup.json.gz"
 with open(backup_file, "wb") as f:
-    f.write(res.content)
+    for chunk in res.iter_content(chunk_size=1024 * 1024):
+        if chunk:
+            f.write(chunk)

841-867: Add timeout to restore request and use specific exception handler.

The restore POST request lacks a timeout (which can cause indefinite hangs), and the bare except: clause at line 865 catches all exceptions including system exits.

Apply this diff:

     try:
         res = requests.post(
             url,
             headers=headers,
             files=files,
             verify=VERIFY_CERTIFICATE,
+            timeout=(5, 600),
         )
     finally:
         # Close file handles
         for file_tuple in files.values():
             file_tuple[1].close()

         # Clean up temp file
         if attachments_data:
             Path(attachments_data).unlink()

     if res.status_code != 200:
         rprint(
             f"[bold red]Error during restore: {res.status_code}[/bold red]",
             file=sys.stderr,
         )
         try:
             error_data = res.json()
             rprint(f"[red]{error_data}[/red]", file=sys.stderr)
-        except:
+        except (ValueError, json.JSONDecodeError):
             rprint(res.text, file=sys.stderr)

537-554: Add timeout to metadata requests and fix URL reconstruction for absolute paths.

The metadata fetch lacks a timeout and the URL reconstruction at lines 551-553 doesn't handle absolute-path pagination links (e.g., /serdes/attachment-metadata/?page=2), which would produce a malformed URL.

Apply this diff:

 while url:
-    res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
+    res = requests.get(
+        url,
+        headers=headers,
+        verify=VERIFY_CERTIFICATE,
+        timeout=(5, 60),
+    )

     if res.status_code != 200:
         rprint(
             f"[bold red]Error fetching metadata: {res.status_code}[/bold red]",
             file=sys.stderr,
         )
         rprint(res.text, file=sys.stderr)
         sys.exit(1)

     data = res.json()
     all_metadata.extend(data["results"])
     url = data.get("next")
-    if url and not url.startswith("http"):
-        # Convert relative URL to absolute
-        url = f"{API_URL}/serdes/attachment-metadata/{url}"
+    if url:
+        if not url.startswith("http"):
+            if url.startswith("/"):
+                # Absolute path: join with base URL
+                url = API_URL.rstrip("/") + url
+            else:
+                # Relative query string: append to base path
+                url = f"{API_URL}/serdes/attachment-metadata/{url}"

789-820: Stream attachment files to temp file to avoid loading large files into memory.

The current code reads each attachment file fully into memory (line 796: file_bytes = f.read()) before writing to the temp file. For large attachments, this can cause OOM errors.

Apply this diff to stream files in chunks:

             # Stream to temp file with reasonable buffering
             temp_attachments = tempfile.NamedTemporaryFile(
                 suffix=".dat", delete=False, buffering=8 * 1024 * 1024
             )  # 8MB buffer

             for file_info in to_upload:
-                with open(file_info["path"], "rb") as f:
-                    file_bytes = f.read()
-
-                # Build header
+                # Get file size without reading entire file
+                file_size = file_info["path"].stat().st_size
+                
                 header = {
                     "id": file_info["id"],
                     "evidence_id": file_info["evidence_id"],
                     "version": file_info["version"],
                     "filename": file_info["filename"],
                     "hash": file_info["hash"],
-                    "size": len(file_bytes),
+                    "size": file_size,
                 }
                 header_bytes = json.dumps(header).encode("utf-8")
-                total_size = len(header_bytes) + len(file_bytes)
+                total_size = len(header_bytes) + file_size

                 # Write directly (OS will buffer intelligently)
                 temp_attachments.write(struct.pack(">I", total_size))
                 temp_attachments.write(header_bytes)
-                temp_attachments.write(file_bytes)
+                
+                # Stream file content in chunks
+                with open(file_info["path"], "rb") as f:
+                    for chunk in iter(lambda: f.read(10 * 1024 * 1024), b""):
+                        temp_attachments.write(chunk)

🧹 Nitpick comments (2)

cli/clica.py (2)
461-465: Consider adding batch size validation.

The CLI accepts a batch size but doesn't validate it against the backend's limit (default 200). Users won't discover an oversized batch until the request fails at runtime.

Consider adding a note in the help text or validating the batch size:
 @click.option(
     "--batch-size",
     default=200,
-    help="Number of files to download per batch (default: 200)",
+    help="Number of files to download per batch (default: 200, max: 200)",
     type=int,
 )
Or add runtime validation:
if batch_size > 200:
    rprint("[yellow]Warning: Batch size exceeds server limit (200), using 200[/yellow]")
    batch_size = 200
471-472: Consider adding a --quiet option for automation.

For scripted/automated backups, the rich formatting and progress messages may not be desired.

Add a quiet option:
 @click.option(
     "--resume/--no-resume",
     default=True,
     help="Resume from existing manifest (default: True)",
 )
-def backup_full(dest_dir, batch_size, resume):
+@click.option(
+    "--quiet",
+    is_flag=True,
+    default=False,
+    help="Suppress progress output (for automation)",
+)
+def backup_full(dest_dir, batch_size, resume, quiet):
     """Create a full backup including database and attachments using streaming"""
Then conditionally print progress messages based on the quiet flag.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b64efb2 and 63c2bed.

📒 Files selected for processing (1)

cli/clica.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

cli/clica.py (2)

backend/serdes/views.py (8)

get (39-73)

get (251-303)

get (712-793)

post (176-247)

post (309-384)

post (398-703)

post (803-917)

post (927-1133)

backend/core/utils.py (1)

sha256 (37-41)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: build (3.12)
GitHub Check: test (3.12)
GitHub Check: build_community_frontend
GitHub Check: build_enterprise_frontend
GitHub Check: startup-functional-test (3.12)
GitHub Check: startup-docker-compose-test
GitHub Check: enterprise-startup-functional-test (3.12)
GitHub Check: enterprise-startup-docker-compose-test
GitHub Check: Analyze (python)

🔇 Additional comments (3)

cli/clica.py (3)

762-773: LGTM: Hash verification streams files efficiently.

The hash verification correctly streams files in chunks and compares against the manifest, avoiding memory issues with large attachments.

509-531: LGTM: Resume logic correctly validates existing files.

The manifest loading opens the file in read mode and verifies that each downloaded file actually exists on disk before including it in the resume state, preventing issues with incomplete or missing downloads.

615-680: LGTM: Streaming protocol correctly handles chunked transfers.

The binary protocol implementation properly:

Buffers chunks until complete blocks are available

Bounds header parsing to prevent excessive iteration

Handles decode errors gracefully

Updates the manifest incrementally

feat: clica command for a full backup/restore

4d42fb4

tchoumi313 self-assigned this Dec 12, 2025

github-advanced-security bot found potential problems Dec 12, 2025

View reviewed changes

backend/serdes/views.py Fixed Show fixed Hide fixed

backend/serdes/views.py Fixed Show fixed Hide fixed

backend/serdes/views.py Fixed Show fixed Hide fixed

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

eric-intuitem self-requested a review December 13, 2025 20:12

eric-intuitem requested changes Dec 13, 2025

View reviewed changes

update complete backup of your cisoassistant instance

346b00a

github-advanced-security bot found potential problems Dec 15, 2025

View reviewed changes

backend/serdes/views.py Fixed Show fixed Hide fixed

backend/serdes/views.py Fixed Show fixed Hide fixed

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

update

23d41a5

coderabbitai bot reviewed Dec 16, 2025

View reviewed changes

update

b64efb2

coderabbitai bot reviewed Dec 17, 2025

View reviewed changes

update

63c2bed

coderabbitai bot reviewed Dec 17, 2025

View reviewed changes

tchoumi313 requested a review from eric-intuitem December 18, 2025 09:07

Update .gitignore

0a40a6c

feat: clica command for a full backup/restore #3042

Are you sure you want to change the base?

feat: clica command for a full backup/restore #3042

Uh oh!

Conversation

tchoumi313 commented Dec 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eric-intuitem left a comment

Choose a reason for hiding this comment

Uh oh!

tchoumi313 commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

tchoumi313 commented Dec 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 12, 2025 •

edited

Loading

tchoumi313 commented Dec 13, 2025 •

edited

Loading