Skip to content

Conversation

@tchoumi313
Copy link
Contributor

@tchoumi313 tchoumi313 commented Dec 12, 2025

Resolves #2869 , resolves #2700
Creates a complete backup of your CISO Assistant instance, including both the database and all evidence attachments.
Restores a complete backup of your CISO Assistant instance, including both the database and all evidence attachments.
image

Summary by CodeRabbit

  • New Features

    • Attachment export/import (ZIP + streaming) plus new REST endpoints for export, import, metadata, batch download/upload; full-restore supports streaming and partial-success reporting.
    • CLI: backup-full and restore-full commands with resume, batch-size, and optional hash verification.
  • Data integrity

    • SHA-256 attachment hashes added and backfilled to support verification during restore.
  • Documentation

    • CLI docs updated with backup/restore workflows, examples, manifest format, and troubleshooting.
  • Chores

    • Added CLI env template and gitignore entry.

✏️ Tip: You can customize this high-level summary in your review settings.

@tchoumi313 tchoumi313 self-assigned this Dec 12, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 12, 2025

Walkthrough

Adds end-to-end attachment export/import utilities, new serdes API endpoints and views (ZIP and streaming batch formats), CLI streaming backup/restore commands, attachment SHA‑256 support with migration/backfill, and supporting CLI env/gitignore changes.

Changes

Cohort / File(s) Summary
Attachment export/import utils
backend/serdes/export_utils.py
New module adding AttachmentExporter (collects evidence revisions with attachments; packages attachments into ZIP) and AttachmentImporter (validates and extracts attachments from ZIP, supports dry-run, aggregates per-file errors, saves to storage and assigns to revisions).
API views
backend/serdes/views.py
Added APIViews: ExportAttachmentsView, LoadAttachmentsView, FullRestoreView, AttachmentMetadataView, BatchDownloadAttachmentsView, BatchUploadAttachmentsView. Implement ZIP export/import, streaming batch format (length‑prefixed blocks with JSON headers + bytes), permission checks, integration with DB restore flow, partial-success/multi-status reporting, and logging.
URL routes
backend/serdes/urls.py
Added routes: export-attachments/, full-restore/, attachment-metadata/, batch-download-attachments/, batch-upload-attachments/.
CLI commands & docs
cli/clica.py, cli/README.md
New CLI commands backup_full(dest_dir, batch_size, resume) and restore_full(src_dir, verify_hashes) implementing streaming backups/restores, manifest handling, optional hash verification; README updated to describe streaming protocol, usage, and examples.
Attachment hashing field & migration
backend/core/models.py, backend/core/migrations/0119_add_attachment_hash_to_evidencerevision.py
Added attachment_hash CharField to EvidenceRevision; save() now computes SHA‑256 for attachments (storage-aware, chunked) with safe error handling. Migration adds the field and a RunPython backfill computing SHA‑256 from storage (chunked reads) and a reverse that clears hashes.
CLI env & gitignore
cli/.clica.env.template, cli/.gitignore, cli/.clica.env
Added .clica.env.template with TOKEN, VERIFY_CERTIFICATE, API_URL; added .clica.env to .gitignore; removed concrete .clica.env file from repository.
Misc gitignore
.gitignore
Added ignore entry for /cli/db_backup.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CLI as CLI (clica)
    participant ExportAPI as ExportAttachmentsView
    participant Exporter as AttachmentExporter
    participant DB as Database (EvidenceRevision)
    participant Storage as ObjectStorage
    participant RestoreAPI as FullRestoreView
    participant Importer as AttachmentImporter

    CLI->>ExportAPI: GET /serdes/export-attachments/
    ExportAPI->>Exporter: create_attachments_zip()
    Exporter->>DB: collect_all_attachments()
    DB-->>Exporter: queryset of revisions
    Exporter->>Storage: read attachment files
    Storage-->>Exporter: file bytes
    Exporter-->>ExportAPI: ZIP BytesIO
    ExportAPI-->>CLI: ZIP response

    CLI->>RestoreAPI: POST /serdes/full-restore/ (DB dump + attachments stream/ZIP)
    RestoreAPI->>DB: perform DB restore (existing load logic)
    DB-->>RestoreAPI: DB restore result
    RestoreAPI->>Importer: extract_attachments_from_zip(stream or zip, dry_run=false)
    Importer->>Storage: write attachment files (evidence-revisions/...)
    Storage-->>Importer: stored confirmation
    Importer->>DB: assign attachment to EvidenceRevision (save -> attachment_hash computed)
    DB-->>Importer: save confirmations
    Importer-->>RestoreAPI: stats (processed/restored/errors)
    RestoreAPI-->>CLI: consolidated restore response (207 if partial)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Focus areas:
    • backend/serdes/export_utils.py — ZIP read/write, filename parsing/validation, per-file error resilience.
    • backend/serdes/views.py — streaming length‑prefixed block parsing, multi-status responses, permission checks, and DB/attachment coordination.
    • backend/core/migrations/0119_* & backend/core/models.py — backfill performance, chunked storage reads, and correct/robust hash computation in save().
    • cli/clica.py — streaming protocol, resume logic, manifest correctness and error handling.

Poem

🐰 I packed the files with careful paws tonight,

Zips snug as burrows, hashes shining bright,
I hopped through streams and stitched each byte in place,
Carrots and crumbs now travel safe through space,
Hooray — the rabbit's backup dance takes flight!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly describes the main feature being added: a clica command for full backup/restore operations that includes both database and attachments.
Linked Issues check ✅ Passed The PR implements full backup/restore functionality addressing both #2869 (restore failures) and #2700 (import errors with user tokens) through comprehensive new export/import utilities and API endpoints with proper error handling.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing the full backup/restore feature including database export, attachment handling, CLI commands, migrations, and supporting infrastructure changes.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch CA-1432-add-clica-command-to-create-a-full-backup-with-attachments

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 63c2bed and 0a40a6c.

📒 Files selected for processing (1)
  • .gitignore (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (python)
🔇 Additional comments (2)
.gitignore (2)

22-22: Standard Python cache pattern.

The __pycache__ entry correctly ignores Python bytecode cache directories, which should not be version-controlled.


23-23: Verify directory naming aligns with PR requirements.

The entry /cli/db_backup ignores the CLI backup directory, which is appropriate. However, the PR objectives reference "a default directory named db", while this entry specifies db_backup. Please confirm that the actual backup implementation matches this directory name and that it aligns with the stated PR requirements.

Optionally, consider adding a trailing / to make the directory-specific intent explicit: /cli/db_backup/


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (1)
cli/README.md (1)

404-419: Examples: consider adding a note about secure storage/permissions for backup dirs.
Given backups can contain sensitive evidence, it’s worth explicitly recommending restrictive permissions (e.g., chmod 700 on the backup directory).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b7a141d and 4d42fb4.

📒 Files selected for processing (5)
  • backend/serdes/export_utils.py (1 hunks)
  • backend/serdes/urls.py (1 hunks)
  • backend/serdes/views.py (4 hunks)
  • cli/README.md (2 hunks)
  • cli/clica.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
backend/serdes/urls.py (1)
backend/serdes/views.py (2)
  • ExportAttachmentsView (245-298)
  • FullRestoreView (376-532)
backend/serdes/views.py (3)
backend/serdes/export_utils.py (4)
  • AttachmentExporter (15-82)
  • AttachmentImporter (85-185)
  • create_attachments_zip (61-82)
  • extract_attachments_from_zip (86-185)
backend/iam/models.py (2)
  • has_backup_permission (729-734)
  • username (742-743)
backend/core/utils.py (1)
  • compare_schema_versions (180-266)
cli/clica.py (1)
backend/serdes/views.py (5)
  • get (34-68)
  • get (246-298)
  • post (171-242)
  • post (304-373)
  • post (377-532)
backend/serdes/export_utils.py (2)
backend/core/models.py (2)
  • Evidence (3304-3384)
  • EvidenceRevision (3387-3450)
backend/serdes/views.py (2)
  • get (34-68)
  • get (246-298)
🪛 GitHub Check: CodeQL
backend/serdes/views.py

[warning] 345-350: Information exposure through an exception
Stack trace information flows to this location and may be exposed to an external user.
Stack trace information flows to this location and may be exposed to an external user.
Stack trace information flows to this location and may be exposed to an external user.


[warning] 355-359: Information exposure through an exception
Stack trace information flows to this location and may be exposed to an external user.
Stack trace information flows to this location and may be exposed to an external user.
Stack trace information flows to this location and may be exposed to an external user.


[warning] 503-507: Information exposure through an exception
Stack trace information flows to this location and may be exposed to an external user.

🪛 markdownlint-cli2 (0.18.1)
cli/README.md

288-288: Blank line inside blockquote

(MD028, no-blanks-blockquote)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: build_community_frontend
  • GitHub Check: test (3.12)
  • GitHub Check: enterprise-startup-functional-test (3.12)
  • GitHub Check: startup-functional-test (3.12)
  • GitHub Check: enterprise-startup-docker-compose-test
  • GitHub Check: startup-docker-compose-test
  • GitHub Check: build_enterprise_frontend
  • GitHub Check: build (3.12)
🔇 Additional comments (3)
backend/serdes/urls.py (1)

12-21: Routes look correct; confirm whether LoadAttachmentsView should also be exposed.
If LoadAttachmentsView is intended for standalone attachment restores (outside /full-restore/), it likely needs a path("load-attachments/", ...) entry; otherwise consider removing the view to avoid dead code.

cli/clica.py (1)

611-612: Command registration looks good.

backend/serdes/views.py (1)

51-63: Excluding iam.personalaccesstoken from dump/load is the right direction for restore robustness.
This aligns with the “regenerate PAT after restore” workflow and should help with token-related import/restore failures.

Also applies to: 127-136

Comment on lines +26 to +48
def package_attachments_to_zip(
self, revisions: QuerySet, zipf: zipfile.ZipFile
) -> int:
count = 0

for revision in revisions:
if revision.attachment and default_storage.exists(revision.attachment.name):
try:
with default_storage.open(revision.attachment.name, "rb") as file:
file_content = file.read()

filename = (
f"{revision.evidence_id}_v{revision.version}_"
f"{os.path.basename(revision.attachment.name)}"
)

zip_path = os.path.join(
"attachments", "evidence-revisions", filename
)

zipf.writestr(zip_path, file_content)
count += 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Exporter: avoid full in-RAM reads + use POSIX zip paths (Windows-safe).

 import io
 import os
+import posixpath
+import shutil
 import zipfile
@@
-                        zip_path = os.path.join(
-                            "attachments", "evidence-revisions", filename
-                        )
-
-                        zipf.writestr(zip_path, file_content)
+                        zip_path = posixpath.join(
+                            "attachments", "evidence-revisions", filename
+                        )
+                        with zipf.open(zip_path, "w") as zf:
+                            shutil.copyfileobj(file, zf, length=1024 * 1024)
                         count += 1

Also applies to: 61-82

🤖 Prompt for AI Agents
In backend/serdes/export_utils.py around lines 26-48 (and similarly 61-82), the
exporter currently reads entire attachment files into memory and builds zip
paths with os.path.join (which yields backslashes on Windows); change to stream
the file into the zip to avoid full in-RAM reads and use POSIX-style zip paths.
Specifically, open the storage file as a binary file-like object and create the
entry in the ZipFile via zipfile.ZipFile.open(zip_path, "w") and copy the file
contents into that entry in chunks (e.g., shutil.copyfileobj) instead of
file.read(); build zip_path using posixpath.join("attachments",
"evidence-revisions", filename) so the stored path uses forward slashes and
apply the same changes to the other block at lines 61-82.

Comment on lines 108 to 170
for file_path in attachment_files:
try:
filename = os.path.basename(file_path)
parts = filename.split("_", 2)

if len(parts) < 3:
stats["errors"].append(
f"Invalid filename format: {filename}"
)
continue

evidence_id = parts[0]
version_str = parts[1]
original_filename = parts[2]

if not version_str.startswith("v"):
stats["errors"].append(
f"Invalid version format in: {filename}"
)
continue

version = int(version_str[1:])

if not dry_run:
# Find the corresponding EvidenceRevision
try:
revision = EvidenceRevision.objects.get(
evidence_id=evidence_id, version=version
)

file_content = zipf.read(file_path)

storage_path = (
f"evidence-revisions/{evidence_id}/"
f"v{version}/{original_filename}"
)

saved_path = default_storage.save(
storage_path, io.BytesIO(file_content)
)

revision.attachment = saved_path
revision.save(update_fields=["attachment"])

stats["restored"] += 1

except EvidenceRevision.DoesNotExist:
stats["errors"].append(
f"EvidenceRevision not found: "
f"evidence_id={evidence_id}, version={version}"
)
except Exception as e:
stats["errors"].append(
f"Failed to restore {filename}: {str(e)}"
)
else:
stats["restored"] += 1

except Exception as e:
stats["errors"].append(
f"Error processing {file_path}: {str(e)}"
)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Importer: make restores idempotent (avoid default_storage.save auto-renames + orphaned files).
If storage_path already exists, many storage backends will rename on save(), leaving old files behind and changing paths each restore run.

                         if not dry_run:
@@
-                                saved_path = default_storage.save(
-                                    storage_path, io.BytesIO(file_content)
-                                )
+                                # Prefer overwrite semantics for idempotent restores
+                                try:
+                                    if revision.attachment:
+                                        default_storage.delete(revision.attachment.name)
+                                except Exception:
+                                    logger.warning(
+                                        "Failed to delete previous attachment",
+                                        revision_id=revision.id,
+                                        attachment_name=getattr(revision.attachment, "name", None),
+                                    )
+                                try:
+                                    if default_storage.exists(storage_path):
+                                        default_storage.delete(storage_path)
+                                except Exception:
+                                    logger.warning(
+                                        "Failed to delete target path before restore",
+                                        storage_path=storage_path,
+                                    )
+
+                                saved_path = default_storage.save(
+                                    storage_path, io.BytesIO(file_content)
+                                )

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +245 to +299
class ExportAttachmentsView(APIView):
def get(self, request, *args, **kwargs):
if not request.user.has_backup_permission:
logger.warning(
"Unauthorized attachment export attempt",
user=request.user.username,
)
return Response(status=status.HTTP_403_FORBIDDEN)

logger.info(
"Starting attachment export",
user=request.user.username,
)

try:
exporter = AttachmentExporter()
zip_buffer, count = exporter.create_attachments_zip()

if count == 0:
logger.info("No attachments found to export")
return Response(
{"message": "No attachments found"},
status=status.HTTP_404_NOT_FOUND,
)

# Create response with ZIP file
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
filename = f"ciso-assistant-attachments-{timestamp}.zip"

response = HttpResponse(
zip_buffer.getvalue(), content_type="application/zip"
)
response["Content-Disposition"] = f'attachment; filename="{filename}"'

logger.info(
"Attachment export completed successfully",
attachments_count=count,
filename=filename,
zip_size=len(zip_buffer.getvalue()),
)

return response

except Exception as e:
logger.error(
"Attachment export failed",
user=request.user.username,
error=str(e),
exc_info=True,
)
return Response(
{"error": "AttachmentExportFailed"},
status=status.HTTP_500_INTERNAL_SERVER_ERROR,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

ExportAttachmentsView: watch out for large in-memory ZIP responses.
This returns zip_buffer.getvalue() (full ZIP in RAM). Consider streaming/FileResponse (or at least spooling to a temp file) to prevent server OOM on large evidence sets.

🤖 Prompt for AI Agents
In backend/serdes/views.py around lines 245 to 299, the current implementation
builds the entire ZIP in memory with zip_buffer.getvalue(), risking OOM for
large exports; change to stream the ZIP to the client by writing the archive to
a spooled temporary file (e.g., tempfile.SpooledTemporaryFile with a sensible
max_size) or directly to a StreamingHttpResponse/FileResponse that yields
chunks, set the Content-Disposition and Content-Type headers from the temp
file/stream, ensure the temp file is rewound before streaming, and close/delete
the temp file after the response is sent to avoid retaining large buffers in
memory.

Comment on lines 376 to 474
class FullRestoreView(APIView):
def post(self, request, *args, **kwargs):
if not request.user.has_backup_permission:
logger.error("Unauthorized full restore attempt", user=request.user)
return Response(status=status.HTTP_403_FORBIDDEN)

logger.info(
"Starting full restore (database + attachments)",
user=request.user.username,
)

# Get uploaded files from multipart form data
backup_file = request.FILES.get("backup")
attachments_file = request.FILES.get("attachments")

if not backup_file:
return Response(
{"error": "NoBackupFileProvided"}, status=status.HTTP_400_BAD_REQUEST
)

# Step 1: Restore database backup
logger.info("Step 1/2: Restoring database backup")

try:
data = backup_file.read()
is_gzip = data.startswith(GZIP_MAGIC_NUMBER)
full_decompressed_data = gzip.decompress(data) if is_gzip else data

full_decompressed_data = json.loads(full_decompressed_data)
metadata, decompressed_data = full_decompressed_data
metadata = metadata["meta"]

current_version = VERSION.split("-")[0]
backup_version = None
schema_version = 0

for metadata_part in metadata:
backup_version = metadata_part.get("media_version")
schema_version = metadata_part.get("schema_version")
if backup_version is not None or schema_version is not None:
break

try:
schema_version_int = int(schema_version)
compare_schema_versions(schema_version_int, backup_version)
if backup_version != VERSION:
raise ValueError(
"The version of the current instance and the one that generated the backup are not the same."
)
except (ValueError, TypeError) as e:
logger.error(
"Invalid schema version format",
schema_version=schema_version,
exc_info=e,
)
return Response(
{"error": "InvalidSchemaVersion"},
status=status.HTTP_400_BAD_REQUEST,
)

is_enterprise = apps.is_installed("enterprise_core")
if not is_enterprise:
for obj in decompressed_data:
if obj["model"] != "iam.role":
continue
permissions = obj["fields"]["permissions"]
enterprise_perms_indices = [
i
for i, perm in enumerate(permissions)
if perm[1] == "enterprise_core"
]
for perm_index in reversed(enterprise_perms_indices):
permissions.pop(perm_index)

decompressed_data = [
obj
for obj in decompressed_data
if obj["model"].split(".", 1)[0] != "enterprise_core"
]

decompressed_data = json.dumps(decompressed_data)

# Reuse existing load_backup logic
load_backup_view = LoadBackupView()
db_response = load_backup_view.load_backup(
request, decompressed_data, backup_version, current_version
)

if db_response.status_code != 200:
return db_response

logger.info("Database backup restored successfully")

except Exception as e:
logger.error("Database restore failed", exc_info=e)
return Response(
{"error": "DatabaseRestoreFailed"},
status=status.HTTP_500_INTERNAL_SERVER_ERROR,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

FullRestoreView: catch ValidationError from compare_schema_versions and avoid returning exception details.
As written, a schema/version incompatibility can bubble into the outer except Exception and become a 500 (DatabaseRestoreFailed) instead of a 400. Also, returning details: str(e) is a client info-leak.

+from django.core.exceptions import ValidationError
@@
-            except (ValueError, TypeError) as e:
+            except (ValidationError, ValueError, TypeError) as e:
                 logger.error(
                     "Invalid schema version format",
                     schema_version=schema_version,
                     exc_info=e,
                 )
+                if isinstance(e, ValidationError):
+                    # compare_schema_versions raises ValidationError({"error": "<code>"})
+                    code = None
+                    try:
+                        code = e.message_dict.get("error")
+                        if isinstance(code, list):
+                            code = code[0]
+                    except Exception:
+                        code = None
                 return Response(
-                    {"error": "InvalidSchemaVersion"},
+                    {"error": code or "InvalidSchemaVersion"},
                     status=status.HTTP_400_BAD_REQUEST,
                 )
@@
-                return Response(
-                    {
-                        "error": "AttachmentRestoreFailed",
-                        "message": "Database restored but attachments failed",
-                        "details": str(e),
-                    },
-                    status=status.HTTP_500_INTERNAL_SERVER_ERROR,
-                )
+                return Response(
+                    {
+                        "error": "AttachmentRestoreFailed",
+                        "message": "Database restored but attachments failed",
+                    },
+                    status=status.HTTP_500_INTERNAL_SERVER_ERROR,
+                )

Also applies to: 500-509

🤖 Prompt for AI Agents
backend/serdes/views.py around lines 376-474 (and similarly handle 500-509): the
compare_schema_versions call can raise a ValidationError that is currently
falling through to the broad except and resulting in a 500 and/or leaking
exception details; update the code to catch ValidationError (and similar
schema/version errors) separately after compare_schema_versions, log a concise
message without including sensitive exception details, and return a 400 response
like {"error":"InvalidSchemaVersion"}; ensure the outer broad except still
returns a 500 but does not include exception strings in the response body (only
log them if needed).

Comment on lines 450 to 507
@click.command()
@click.option(
"--dest-dir",
required=True,
help="Destination directory to save backup files",
type=click.Path(file_okay=False, dir_okay=True),
)
def backup_full(dest_dir):
"""Create a full backup including database and attachments"""
if not TOKEN:
print(
"No authentication token available. Please set PAT token in .clica.env.",
file=sys.stderr,
)
sys.exit(1)

headers = {"Authorization": f"Token {TOKEN}"}
dest_path = Path(dest_dir)
dest_path.mkdir(parents=True, exist_ok=True)

rprint("[bold blue]Exporting database backup...[/bold blue]")
url = f"{API_URL}/serdes/dump-db/"
res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)

if res.status_code != 200:
rprint(
f"[bold red]Error exporting database: {res.status_code}[/bold red]",
file=sys.stderr,
)
rprint(res.text, file=sys.stderr)
sys.exit(1)

backup_file = dest_path / "backup.json.gz"
with open(backup_file, "wb") as f:
f.write(res.content)
rprint(f"[green]✓ Database backup saved to {backup_file}[/green]")

rprint("[bold blue]Exporting attachments...[/bold blue]")
url = f"{API_URL}/serdes/export-attachments/"
res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)

if res.status_code == 404:
rprint("[yellow]No attachments found, skipping attachments backup[/yellow]")
elif res.status_code != 200:
rprint(
f"[bold red]Error exporting attachments: {res.status_code}[/bold red]",
file=sys.stderr,
)
rprint(res.text, file=sys.stderr)
sys.exit(1)
else:
attachments_file = dest_path / "attachments.zip"
with open(attachments_file, "wb") as f:
f.write(res.content)
rprint(f"[green]✓ Attachments backup saved to {attachments_file}[/green]")

rprint("[bold green]Full backup completed successfully![/bold green]")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

backup_full: add timeouts + stream downloads to avoid OOM on large backups.
Current res.content buffers the full DB/ZIP in memory.

 def backup_full(dest_dir):
@@
-    res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
+    res = requests.get(
+        url,
+        headers=headers,
+        verify=VERIFY_CERTIFICATE,
+        stream=True,
+        timeout=(5, 300),
+    )
@@
-    with open(backup_file, "wb") as f:
-        f.write(res.content)
+    with open(backup_file, "wb") as f:
+        for chunk in res.iter_content(chunk_size=1024 * 1024):
+            if chunk:
+                f.write(chunk)
@@
-    res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
+    res = requests.get(
+        url,
+        headers=headers,
+        verify=VERIFY_CERTIFICATE,
+        stream=True,
+        timeout=(5, 300),
+    )
@@
-        with open(attachments_file, "wb") as f:
-            f.write(res.content)
+        with open(attachments_file, "wb") as f:
+            for chunk in res.iter_content(chunk_size=1024 * 1024):
+                if chunk:
+                    f.write(chunk)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@click.command()
@click.option(
"--dest-dir",
required=True,
help="Destination directory to save backup files",
type=click.Path(file_okay=False, dir_okay=True),
)
def backup_full(dest_dir):
"""Create a full backup including database and attachments"""
if not TOKEN:
print(
"No authentication token available. Please set PAT token in .clica.env.",
file=sys.stderr,
)
sys.exit(1)
headers = {"Authorization": f"Token {TOKEN}"}
dest_path = Path(dest_dir)
dest_path.mkdir(parents=True, exist_ok=True)
rprint("[bold blue]Exporting database backup...[/bold blue]")
url = f"{API_URL}/serdes/dump-db/"
res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
if res.status_code != 200:
rprint(
f"[bold red]Error exporting database: {res.status_code}[/bold red]",
file=sys.stderr,
)
rprint(res.text, file=sys.stderr)
sys.exit(1)
backup_file = dest_path / "backup.json.gz"
with open(backup_file, "wb") as f:
f.write(res.content)
rprint(f"[green]✓ Database backup saved to {backup_file}[/green]")
rprint("[bold blue]Exporting attachments...[/bold blue]")
url = f"{API_URL}/serdes/export-attachments/"
res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
if res.status_code == 404:
rprint("[yellow]No attachments found, skipping attachments backup[/yellow]")
elif res.status_code != 200:
rprint(
f"[bold red]Error exporting attachments: {res.status_code}[/bold red]",
file=sys.stderr,
)
rprint(res.text, file=sys.stderr)
sys.exit(1)
else:
attachments_file = dest_path / "attachments.zip"
with open(attachments_file, "wb") as f:
f.write(res.content)
rprint(f"[green]✓ Attachments backup saved to {attachments_file}[/green]")
rprint("[bold green]Full backup completed successfully![/bold green]")
@click.command()
@click.option(
"--dest-dir",
required=True,
help="Destination directory to save backup files",
type=click.Path(file_okay=False, dir_okay=True),
)
def backup_full(dest_dir):
"""Create a full backup including database and attachments"""
if not TOKEN:
print(
"No authentication token available. Please set PAT token in .clica.env.",
file=sys.stderr,
)
sys.exit(1)
headers = {"Authorization": f"Token {TOKEN}"}
dest_path = Path(dest_dir)
dest_path.mkdir(parents=True, exist_ok=True)
rprint("[bold blue]Exporting database backup...[/bold blue]")
url = f"{API_URL}/serdes/dump-db/"
res = requests.get(
url,
headers=headers,
verify=VERIFY_CERTIFICATE,
stream=True,
timeout=(5, 300),
)
if res.status_code != 200:
rprint(
f"[bold red]Error exporting database: {res.status_code}[/bold red]",
file=sys.stderr,
)
rprint(res.text, file=sys.stderr)
sys.exit(1)
backup_file = dest_path / "backup.json.gz"
with open(backup_file, "wb") as f:
for chunk in res.iter_content(chunk_size=1024 * 1024):
if chunk:
f.write(chunk)
rprint(f"[green]✓ Database backup saved to {backup_file}[/green]")
rprint("[bold blue]Exporting attachments...[/bold blue]")
url = f"{API_URL}/serdes/export-attachments/"
res = requests.get(
url,
headers=headers,
verify=VERIFY_CERTIFICATE,
stream=True,
timeout=(5, 300),
)
if res.status_code == 404:
rprint("[yellow]No attachments found, skipping attachments backup[/yellow]")
elif res.status_code != 200:
rprint(
f"[bold red]Error exporting attachments: {res.status_code}[/bold red]",
file=sys.stderr,
)
rprint(res.text, file=sys.stderr)
sys.exit(1)
else:
attachments_file = dest_path / "attachments.zip"
with open(attachments_file, "wb") as f:
for chunk in res.iter_content(chunk_size=1024 * 1024):
if chunk:
f.write(chunk)
rprint(f"[green]✓ Attachments backup saved to {attachments_file}[/green]")
rprint("[bold green]Full backup completed successfully![/bold green]")
🤖 Prompt for AI Agents
In cli/clica.py around lines 450 to 507, replace the current requests.get(...);
writing of res.content with streaming + timeouts: call requests.get with a
timeout (e.g. timeout=(connect_timeout, read_timeout)) and stream=True for both
the DB and attachments endpoints, check the response status code before
streaming, and write the response to file using
res.iter_content(chunk_size=<reasonable size>) inside a with-open, so you never
access res.content; ensure you close the response (or use it as a context
manager) and keep existing headers/verify args and existing 404 and error
handling logic.

Comment on lines 509 to 601
@click.command()
@click.option(
"--src-dir",
required=True,
help="Source directory containing backup files",
type=click.Path(exists=True, file_okay=False, dir_okay=True),
)
def restore_full(src_dir):
"""Restore a full backup including database and attachments"""
if not TOKEN:
print(
"No authentication token available. Please set PAT token in .clica.env.",
file=sys.stderr,
)
sys.exit(1)

headers = {"Authorization": f"Token {TOKEN}"}
src_path = Path(src_dir)

# Check for required backup file
backup_file = src_path / "backup.json.gz"
if not backup_file.exists():
rprint(
f"[bold red]Error: backup.json.gz not found in {src_dir}[/bold red]",
file=sys.stderr,
)
sys.exit(1)

attachments_file = src_path / "attachments.zip"
has_attachments = attachments_file.exists()

rprint("[bold blue]Starting full restore (database + attachments)...[/bold blue]")

# Prepare multipart form data
files = {
"backup": ("backup.json.gz", open(backup_file, "rb"), "application/gzip"),
}

if has_attachments:
files["attachments"] = (
"attachments.zip",
open(attachments_file, "rb"),
"application/zip",
)
rprint("[dim]Including attachments in restore[/dim]")
else:
rprint("[yellow]No attachments.zip found, restoring database only[/yellow]")

url = f"{API_URL}/serdes/full-restore/"
try:
res = requests.post(
url, headers=headers, files=files, verify=VERIFY_CERTIFICATE
)
finally:
for file_tuple in files.values():
file_tuple[1].close()

if res.status_code != 200:
rprint(
f"[bold red]Error during restore: {res.status_code}[/bold red]",
file=sys.stderr,
)
try:
error_data = res.json()
rprint(f"[red]{error_data}[/red]", file=sys.stderr)
except:
rprint(res.text, file=sys.stderr)
sys.exit(1)

result = res.json()

rprint(f"[green]✓ Database restored successfully[/green]")

if has_attachments:
if "attachments_restored" in result:
restored = result["attachments_restored"]
processed = result.get("attachments_processed", restored)
rprint(
f"[green]✓ Attachments restored: {restored} of {processed} processed[/green]"
)

if result.get("attachment_errors"):
rprint("[yellow]Errors encountered:[/yellow]")
for error in result["attachment_errors"]:
rprint(f" - {error}")

if result.get("status") == "partial_success":
rprint("[bold yellow]Full restore completed with warnings[/bold yellow]")
else:
rprint("[bold green]Full restore completed successfully![/bold green]")

rprint("[dim]Note: You may need to regenerate your Personal Access Token[/dim]")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

restore_full: add timeout; avoid bare except: when parsing error responses.

-        res = requests.post(
-            url, headers=headers, files=files, verify=VERIFY_CERTIFICATE
-        )
+        res = requests.post(
+            url,
+            headers=headers,
+            files=files,
+            verify=VERIFY_CERTIFICATE,
+            timeout=(5, 600),
+        )
@@
-        except:
+        except ValueError:
             rprint(res.text, file=sys.stderr)
🤖 Prompt for AI Agents
In cli/clica.py around lines 509 to 601, the restore_full HTTP call lacks a
timeout and the error parsing uses a bare except; change the requests.post call
to include a timeout (e.g. timeout=REQUEST_TIMEOUT or timeout=30) and keep
verify=headers as-is, and replace the bare except when calling res.json() with a
specific exception handler (except ValueError as e:) to handle JSON decode
errors and fall back to printing res.text; ensure files are still closed in the
finally block.

@eric-intuitem eric-intuitem self-requested a review December 13, 2025 20:12
Copy link
Collaborator

@eric-intuitem eric-intuitem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job. To finish it:

  • unzip the attachments
  • the default directory shall be db (create it if missing)
  • the db should be restored in sqlite3 format (ciso-assistant.sqlite3)

@tchoumi313
Copy link
Contributor Author

tchoumi313 commented Dec 13, 2025

Nice job. To finish it:

  • unzip the attachments
  • the default directory shall be db (create it if missing)
  • the db should be restored in sqlite3 format (ciso-assistant.sqlite3)

Thanks!

  • Okay.
  • that's the backup default directory right? Okay.
  • The restore doesn't care about the database system behind, it just flushes then restore the data (just like the ui restore flow works). If the ciso-assistant instance uses sqlite3 or postgres the result shall be the same (database fully restored).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/core/models.py (1)

3381-3385: Use stored attachment_hash in Evidence property and avoid redundant logger initialization.

Evidence.attachment_hash (lines 3381–3385) currently reads the entire attachment file in a single operation using .read(), ignoring the new EvidenceRevision.attachment_hash stored field that is now computed efficiently with chunked I/O. Update the property to:

  1. Return the stored hash from last_revision.attachment_hash when available
  2. Fall back to chunked computation only for legacy rows without a stored hash
 @property
 def attachment_hash(self):
-    if not self.last_revision or not self.last_revision.attachment:
-        return None
-    return hashlib.sha256(self.last_revision.attachment.read()).hexdigest()
+    last = self.last_revision
+    if not last:
+        return None
+
+    # Prefer stored hash on the revision if present
+    if getattr(last, "attachment_hash", None):
+        return last.attachment_hash
+
+    # Fallback for legacy rows: compute on the fly (streamed)
+    if not last.attachment:
+        return None
+
+    hash_obj = hashlib.sha256()
+    for chunk in last.attachment.chunks():
+        hash_obj.update(chunk)
+    return hash_obj.hexdigest()

Also, in EvidenceRevision.save() around line 3487: remove the logger re-instantiation in the except block. A module-level logger is already imported; reuse it instead of calling get_logger(__name__) inside the exception handler.

♻️ Duplicate comments (4)
cli/clica.py (2)

488-505: backup_full still buffers DB in memory and lacks timeouts on heavy HTTP calls.

  • requests.get for /serdes/dump-db/ (Line 491) downloads the entire DB backup into memory via res.content, and has no timeout.
  • requests.get for /serdes/attachment-metadata/ (Line 527) and requests.post for /serdes/batch-download-attachments/ (Line 580) also have no timeouts.
  • For large instances this can hang indefinitely and/or OOM the CLI process.

Consider:

  • Adding stream=True + a reasonable timeout to the DB export and attachments batch calls.
  • Writing the DB backup to disk via iter_content instead of res.content.
  • Adding timeouts to metadata pagination requests.

Example for the DB export section:

-    url = f"{API_URL}/serdes/dump-db/"
-    res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
-
-    if res.status_code != 200:
-        rprint(
-            f"[bold red]Error exporting database: {res.status_code}[/bold red]",
-            file=sys.stderr,
-        )
-        rprint(res.text, file=sys.stderr)
-        sys.exit(1)
-
-    backup_file = dest_path / "backup.json.gz"
-    with open(backup_file, "wb") as f:
-        f.write(res.content)
+    url = f"{API_URL}/serdes/dump-db/"
+    with requests.get(
+        url,
+        headers=headers,
+        verify=VERIFY_CERTIFICATE,
+        stream=True,
+        timeout=(5, 300),
+    ) as res:
+        if res.status_code != 200:
+            rprint(
+                f"[bold red]Error exporting database: {res.status_code}[/bold red]",
+                file=sys.stderr,
+            )
+            rprint(res.text, file=sys.stderr)
+            sys.exit(1)
+
+        backup_file = dest_path / "backup.json.gz"
+        with open(backup_file, "wb") as f:
+            for chunk in res.iter_content(chunk_size=1024 * 1024):
+                if chunk:
+                    f.write(chunk)

And for metadata + batch download:

-        res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
+        res = requests.get(
+            url,
+            headers=headers,
+            verify=VERIFY_CERTIFICATE,
+            timeout=(5, 60),
+        )
@@
-            res = requests.post(
+            res = requests.post(
                 url,
                 headers=headers,
                 json={"revision_ids": batch_ids},
                 verify=VERIFY_CERTIFICATE,
-                stream=True,
+                stream=True,
+                timeout=(5, 600),
             )

Also applies to: 521-545, 570-588


710-801: restore_full: high memory usage when building attachments.dat, missing timeout, and bare except on JSON errors.

  1. Memory/OOM risk
    restore_full reads all attachment files into memory and concatenates them before writing to the temp file:

    body_parts = []
    ...
    file_bytes = f.read()
    ...
    body_parts.append(struct.pack(">I", total_size))
    body_parts.append(header_bytes)
    body_parts.append(file_bytes)
    ...
    temp_attachments.write(b"".join(body_parts))

    For large backups this is effectively O(total_attachments_size) RAM and can easily OOM.

    You can avoid this by streaming per file and using Path.stat() (or os.path.getsize) to compute the length prefix:

  •        # Build streaming data for ALL attachments
    
  •        body_parts = []
    
  •        for file_info in to_upload:
    
  •            # Read file
    
  •            with open(file_info["path"], "rb") as f:
    
  •                file_bytes = f.read()
    
  •            # Build JSON header
    
  •            header = {
    
  •                "id": file_info["id"],
    
  •                "evidence_id": file_info["evidence_id"],
    
  •                "version": file_info["version"],
    
  •                "filename": file_info["filename"],
    
  •                "hash": file_info["hash"],
    
  •                "size": len(file_bytes),
    
  •            }
    
  •            header_bytes = json.dumps(header).encode("utf-8")
    
  •            # Calculate total size: header + file
    
  •            total_size = len(header_bytes) + len(file_bytes)
    
  •            # Add: [4-byte length][header][file]
    
  •            body_parts.append(struct.pack(">I", total_size))
    
  •            body_parts.append(header_bytes)
    
  •            body_parts.append(file_bytes)
    
  •        # Create temporary file for attachments binary data
    
  •        temp_attachments = tempfile.NamedTemporaryFile(suffix=".dat", delete=False)
    
  •        temp_attachments.write(b"".join(body_parts))
    
  •        temp_attachments.close()
    
  •        attachments_data = temp_attachments.name
    
  •        rprint(
    
  •            f"[dim]Prepared {len(to_upload)} attachments for upload ({sum(len(p) for p in body_parts) / 1024 / 1024:.1f} MB)[/dim]"
    
  •        )
    
  •        temp_attachments = tempfile.NamedTemporaryFile(suffix=".dat", delete=False)
    
  •        total_bytes = 0
    
  •        for file_info in to_upload:
    
  •            file_path = file_info["path"]
    
  •            file_size = file_path.stat().st_size
    
  •            header = {
    
  •                "id": file_info["id"],
    
  •                "evidence_id": file_info["evidence_id"],
    
  •                "version": file_info["version"],
    
  •                "filename": file_info["filename"],
    
  •                "hash": file_info["hash"],
    
  •                "size": file_size,
    
  •            }
    
  •            header_bytes = json.dumps(header).encode("utf-8")
    
  •            total_size = len(header_bytes) + file_size
    
  •            # [4-byte length][header][file]
    
  •            temp_attachments.write(struct.pack(">I", total_size))
    
  •            temp_attachments.write(header_bytes)
    
  •            with open(file_path, "rb") as f:
    
  •                for chunk in iter(lambda: f.read(1024 * 1024), b""):
    
  •                    temp_attachments.write(chunk)
    
  •            total_bytes += total_size
    
  •        temp_attachments.flush()
    
  •        temp_attachments.close()
    
  •        attachments_data = temp_attachments.name
    
  •        rprint(
    
  •            f"[dim]Prepared {len(to_upload)} attachments for upload ({total_bytes / 1024 / 1024:.1f} MB)[/dim]"
    
  •        )
    
    
    
  1. No timeout on restore request
    The requests.post to /serdes/full-restore/ (Line 821) has no timeout; a stuck backend could hang the CLI indefinitely. Add a reasonable timeout:

  •    res = requests.post(
    
  •        url,
    
  •        headers=headers,
    
  •        files=files,
    
  •        verify=VERIFY_CERTIFICATE,
    
  •    )
    
  •    res = requests.post(
    
  •        url,
    
  •        headers=headers,
    
  •        files=files,
    
  •        verify=VERIFY_CERTIFICATE,
    
  •        timeout=(5, 600),
    
  •    )
    
    
    
  1. Bare except when parsing error response
    The error path still uses a bare except:

    try:
        error_data = res.json()
        rprint(f"[red]{error_data}[/red]", file=sys.stderr)
    except:
        rprint(res.text, file=sys.stderr)

    This can mask unrelated errors and makes debugging harder. Narrow it to JSON decode errors:

  •    try:
    
  •        error_data = res.json()
    
  •        rprint(f"[red]{error_data}[/red]", file=sys.stderr)
    
  •    except:
    
  •        rprint(res.text, file=sys.stderr)
    
  •    try:
    
  •        error_data = res.json()
    
  •        rprint(f"[red]{error_data}[/red]", file=sys.stderr)
    
  •    except ValueError:
    
  •        # Not JSON, fallback to raw text
    
  •        rprint(res.text, file=sys.stderr)
    
    
    

These changes will make full restores more robust on large datasets and avoid CLI hangs or silent masking of non‑JSON errors.

Also applies to: 802-872

backend/serdes/views.py (2)

279-281: Consider streaming for large ZIP exports.

Returning zip_buffer.getvalue() loads the entire ZIP into memory. For instances with many/large attachments, consider using StreamingHttpResponse or a spooled temp file to avoid OOM.

This was flagged in a previous review. The current implementation works but may not scale for very large attachment sets.


439-455: Catch ValidationError from compare_schema_versions.

compare_schema_versions raises ValidationError for schema version mismatches (per backend/core/utils.py lines 179-265), but only ValueError and TypeError are caught here. This causes version mismatches to bubble up to the generic exception handler at line 490, returning a 500 instead of a proper 400 with the error code.

+from django.core.exceptions import ValidationError
 ...
             try:
                 schema_version_int = int(schema_version)
                 compare_schema_versions(schema_version_int, backup_version)
                 if backup_version != VERSION:
                     raise ValueError(
                         "The version of the current instance and the one that generated the backup are not the same."
                     )
-            except (ValueError, TypeError) as e:
+            except (ValidationError, ValueError, TypeError) as e:
                 logger.error(
                     "Invalid schema version format",
                     schema_version=schema_version,
                     exc_info=e,
                 )
+                error_code = "InvalidSchemaVersion"
+                if isinstance(e, ValidationError):
+                    try:
+                        error_code = e.message_dict.get("error", error_code)
+                        if isinstance(error_code, list):
+                            error_code = error_code[0]
+                    except Exception:
+                        pass
                 return Response(
-                    {"error": "InvalidSchemaVersion"},
+                    {"error": error_code},
                     status=status.HTTP_400_BAD_REQUEST,
                 )
🧹 Nitpick comments (6)
cli/.clica.env.template (1)

1-3: VERIFY_CERTIFICATE default here weakens TLS verification compared to code default.

The CLI defaults VERIFY_CERTIFICATE to true when unset, but this template sets it to false, which will disable TLS verification for anyone copying it as‑is. Consider either:

  • defaulting to true in the template and documenting how to turn it off for local/self‑signed setups, or
  • adding a strong warning comment in this template about using false only for trusted local environments.
cli/README.md (1)

319-322: Fix blank line between blockquotes (MD028).

There's a blank line between the WARNING and TIP blockquotes which triggers markdownlint MD028. Remove the blank line at line 320.

 > [!WARNING]
 > Restoring a backup will **replace all existing data** in your CISO Assistant instance. Make sure you have a current backup before performing a restore operation.
-
 > [!TIP]
backend/serdes/export_utils.py (3)

34-46: Use POSIX paths for ZIP entries and consider streaming writes.

os.path.join produces backslashes on Windows, which violates the ZIP specification requiring forward slashes. Also, file.read() loads the entire file into memory.

+import posixpath
+import shutil
 ...
                 try:
                     with default_storage.open(revision.attachment.name, "rb") as file:
-                        file_content = file.read()
-
                         filename = (
                             f"{revision.evidence_id}_v{revision.version}_"
                             f"{os.path.basename(revision.attachment.name)}"
                         )
 
-                        zip_path = os.path.join(
+                        zip_path = posixpath.join(
                             "attachments", "evidence-revisions", filename
                         )
 
-                        zipf.writestr(zip_path, file_content)
+                        with zipf.open(zip_path, "w") as zf:
+                            shutil.copyfileobj(file, zf, length=1024 * 1024)
                         count += 1

145-150: Make restore idempotent by deleting existing files before save.

default_storage.save() may auto-rename files if the path exists, causing orphaned files and path drift on repeated restores.

                             storage_path = (
                                 f"evidence-revisions/{evidence_id}/"
                                 f"v{version}/{original_filename}"
                             )
 
+                            # Delete existing file for idempotent restore
+                            if revision.attachment:
+                                try:
+                                    default_storage.delete(revision.attachment.name)
+                                except Exception:
+                                    pass
+                            if default_storage.exists(storage_path):
+                                try:
+                                    default_storage.delete(storage_path)
+                                except Exception:
+                                    pass
+
                             saved_path = default_storage.save(
                                 storage_path, io.BytesIO(file_content)
                             )

159-162: Avoid exposing raw exception details in API responses.

str(e) may contain sensitive internal information. Log the full exception server-side and return a sanitized message.

                             except Exception as e:
+                                logger.error(
+                                    "Failed to restore attachment",
+                                    filename=filename,
+                                    evidence_id=evidence_id,
+                                    version=version,
+                                    error=str(e),
+                                    exc_info=True,
+                                )
                                 stats["errors"].append(
-                                    f"Failed to restore {filename}: {str(e)}"
+                                    f"Failed to restore {filename}"
                                 )
backend/serdes/views.py (1)

870-884: File is fully loaded into memory before streaming.

Despite the comment "Read file in chunks", the entire file is buffered in file_buffer before yielding. For large files, this defeats the streaming benefit. However, fixing this requires knowing total_size upfront for the length prefix, which complicates true streaming.

One approach: compute size via default_storage.size() first, then stream chunks directly:

file_size = default_storage.size(revision.attachment.name)
total_size = len(header_bytes) + file_size
yield struct.pack(">I", total_size)
yield header_bytes
with default_storage.open(revision.attachment.name, "rb") as f:
    for chunk in iter(lambda: f.read(1024 * 1024), b""):
        yield chunk
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4d42fb4 and 346b00a.

📒 Files selected for processing (9)
  • backend/core/migrations/0119_add_attachment_hash_to_evidencerevision.py (1 hunks)
  • backend/core/models.py (3 hunks)
  • backend/serdes/export_utils.py (1 hunks)
  • backend/serdes/urls.py (1 hunks)
  • backend/serdes/views.py (4 hunks)
  • cli/.clica.env.template (1 hunks)
  • cli/.gitignore (1 hunks)
  • cli/README.md (2 hunks)
  • cli/clica.py (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • cli/.gitignore
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-19T08:46:34.495Z
Learnt from: Mohamed-Hacene
Repo: intuitem/ciso-assistant-community PR: 2550
File: backend/core/serializers.py:1277-1292
Timestamp: 2025-09-19T08:46:34.495Z
Learning: In EvidenceWriteSerializer, the link and attachment fields are read-only, so updates to these fields should be done through EvidenceRevisionWriteSerializer to maintain proper versioning.

Applied to files:

  • backend/core/models.py
📚 Learning: 2025-09-19T09:08:52.306Z
Learnt from: Mohamed-Hacene
Repo: intuitem/ciso-assistant-community PR: 2550
File: backend/core/migrations/0098_evidence_lifecycle_data_migration.py:21-28
Timestamp: 2025-09-19T09:08:52.306Z
Learning: EvidenceRevision.save() method automatically sets is_published from the parent Evidence object and handles folder assignment from the evidence's folder with proper safety checks.

Applied to files:

  • backend/core/models.py
🧬 Code graph analysis (4)
backend/core/models.py (1)
backend/core/utils.py (1)
  • sha256 (37-41)
backend/serdes/views.py (4)
backend/serdes/export_utils.py (4)
  • AttachmentExporter (15-82)
  • AttachmentImporter (85-196)
  • create_attachments_zip (61-82)
  • extract_attachments_from_zip (86-196)
backend/core/views.py (5)
  • get (1005-1012)
  • get (6597-6598)
  • attachment (7156-7186)
  • attachment (7203-7228)
  • filter (190-216)
backend/core/models.py (21)
  • EvidenceRevision (3388-3508)
  • filename (3356-3361)
  • filename (3489-3490)
  • attachment_hash (3382-3385)
  • delete (971-984)
  • delete (5189-5196)
  • save (1654-1662)
  • save (1778-1781)
  • save (2779-2781)
  • save (3340-3342)
  • save (3431-3487)
  • save (3662-3666)
  • save (3876-3900)
  • save (4261-4263)
  • save (4398-4401)
  • save (4469-4471)
  • save (5198-5228)
  • save (5351-5357)
  • save (6295-6315)
  • save (6850-6863)
  • save (7023-7039)
backend/core/utils.py (2)
  • compare_schema_versions (180-266)
  • sha256 (37-41)
backend/core/migrations/0119_add_attachment_hash_to_evidencerevision.py (2)
backend/core/models.py (18)
  • EvidenceRevision (3388-3508)
  • update (874-886)
  • attachment_hash (3382-3385)
  • save (1654-1662)
  • save (1778-1781)
  • save (2779-2781)
  • save (3340-3342)
  • save (3431-3487)
  • save (3662-3666)
  • save (3876-3900)
  • save (4261-4263)
  • save (4398-4401)
  • save (4469-4471)
  • save (5198-5228)
  • save (5351-5357)
  • save (6295-6315)
  • save (6850-6863)
  • save (7023-7039)
backend/core/utils.py (1)
  • sha256 (37-41)
backend/serdes/urls.py (1)
backend/serdes/views.py (5)
  • ExportAttachmentsView (250-303)
  • FullRestoreView (387-703)
  • AttachmentMetadataView (706-793)
  • BatchDownloadAttachmentsView (796-917)
  • BatchUploadAttachmentsView (920-1131)
🪛 markdownlint-cli2 (0.18.1)
cli/README.md

321-321: Blank line inside blockquote

(MD028, no-blanks-blockquote)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build_community_frontend
  • GitHub Check: enterprise-startup-docker-compose-test
  • GitHub Check: test (3.12)
  • GitHub Check: startup-functional-test (3.12)
  • GitHub Check: startup-docker-compose-test
  • GitHub Check: enterprise-startup-functional-test (3.12)
  • GitHub Check: build_enterprise_frontend
  • GitHub Check: build (3.12)
  • GitHub Check: Analyze (python)
🔇 Additional comments (5)
backend/core/migrations/0119_add_attachment_hash_to_evidencerevision.py (1)

8-43: Backfill migration looks well-designed.

The chunked reading (1MB), iterator with chunk_size=100, and per-record error handling ensure the migration runs safely on large datasets without OOM. Progress logging every 100 records is helpful for monitoring.

One minor consideration: if a storage backend is slow or unavailable during migration, this could take a long time or fail silently for many records. Consider adding a summary warning if errors > 0 at the end.

backend/serdes/urls.py (1)

12-37: URL patterns are well-structured.

The new endpoints follow REST conventions and are logically grouped. The comment on line 22 helps clarify the streaming batch endpoints section.

cli/README.md (1)

225-268: Documentation for backup_full is comprehensive and clear.

The explanation of the streaming approach, resume capability, and hash verification is well-documented. The technical details section provides useful context for understanding the protocol format.

backend/serdes/views.py (2)

60-64: Good fix: Excluding PAT from backup/restore.

Adding "iam.personalaccesstoken" to the exclude list addresses issue #2700 where restoring a backup with personal authentication tokens caused IntegrityError: NOT NULL constraint failed.


306-372: LoadAttachmentsView implementation looks correct.

Proper permission check, file type validation, and partial success handling with 207 status for mixed results. Error details are logged server-side while the response provides a sanitized count-only summary.

Comment on lines +521 to +545
# Fetch all attachment metadata from API (paginated)
rprint("[dim]Fetching attachment metadata from server...[/dim]")
all_metadata = []
url = f"{API_URL}/serdes/attachment-metadata/"

while url:
res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)

if res.status_code != 200:
rprint(
f"[bold red]Error fetching metadata: {res.status_code}[/bold red]",
file=sys.stderr,
)
rprint(res.text, file=sys.stderr)
sys.exit(1)

data = res.json()
all_metadata.extend(data["results"])
url = data.get("next")
if url and not url.startswith("http"):
# Convert relative URL to absolute
url = f"{API_URL}/serdes/attachment-metadata/{url}"

rprint(f"[cyan]Found {len(all_metadata)} total attachments[/cyan]")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the relevant code section
fd -e py -p "cli/clica.py" | head -1 | xargs -I {} sh -c 'wc -l {} && echo "---" && sed -n "515,550p" {}'

Repository: intuitem/ciso-assistant-community

Length of output: 1434


🏁 Script executed:

# Check for API_URL definition
rg "API_URL\s*=" cli/clica.py -B 2 -A 2

Repository: intuitem/ciso-assistant-community

Length of output: 289


🏁 Script executed:

# Look for other pagination patterns in the codebase
rg "\.get\(.*next" --type py -A 3 -B 1

Repository: intuitem/ciso-assistant-community

Length of output: 1388


🏁 Script executed:

# Check if there are tests for this pagination logic
fd -e py -p "test" | xargs rg "attachment-metadata|pagination" 2>/dev/null

Repository: intuitem/ciso-assistant-community

Length of output: 61


🏁 Script executed:

# Find the attachment-metadata endpoint in the backend
rg "attachment-metadata" --type py | head -20

Repository: intuitem/ciso-assistant-community

Length of output: 331


🏁 Script executed:

# Look for DRF pagination configuration
rg "DEFAULT_PAGINATION_CLASS|pagination" backend/ --type py -B 2 -A 2 | head -40

Repository: intuitem/ciso-assistant-community

Length of output: 1067


🏁 Script executed:

# Search for the serializer or view handling attachment-metadata
fd -e py -p "backend" | xargs rg "attachment.metadata|AttachmentMetadata" | head -20

Repository: intuitem/ciso-assistant-community

Length of output: 476


🏁 Script executed:

# Check Django REST Framework settings
fd -e py -p "backend.*settings" | xargs cat 2>/dev/null | grep -A 10 -B 2 "REST_FRAMEWORK"

Repository: intuitem/ciso-assistant-community

Length of output: 1957


🏁 Script executed:

# Read the custom pagination class
cat -n backend/core/pagination.py

Repository: intuitem/ciso-assistant-community

Length of output: 919


🏁 Script executed:

# Check the AttachmentMetadataView implementation
rg "class AttachmentMetadataView" backend/serdes/views.py -A 50

Repository: intuitem/ciso-assistant-community

Length of output: 1890


🏁 Script executed:

# Get the complete AttachmentMetadataView to see how it returns paginated data
rg "class AttachmentMetadataView" backend/serdes/views.py -A 100 | head -80

Repository: intuitem/ciso-assistant-community

Length of output: 3012


🏁 Script executed:

# Check what the paginator.get_paginated_response returns
rg "get_paginated_response" backend/serdes/views.py -B 3 -A 3

Repository: intuitem/ciso-assistant-community

Length of output: 59


URL reconstruction logic doesn't safely handle absolute-path pagination links.

The code assumes next is either an absolute URL or query-only (?page=2), but if the backend ever returns an absolute path like /serdes/attachment-metadata/?page=2, the concatenation creates a malformed URL:

url = f"{API_URL}/serdes/attachment-metadata/{url}"  # → {API_URL}/serdes/attachment-metadata//serdes/attachment-metadata/?page=2

Use urllib.parse.urljoin or check if the URL starts with / to handle both cases correctly.

🤖 Prompt for AI Agents
In cli/clica.py around lines 521 to 545, the pagination URL reconstruction
blindly prepends API_URL plus the path, which breaks when the server returns an
absolute-path next value (e.g. "/serdes/attachment-metadata/?page=2"); update
the logic to build the next URL safely by using urllib.parse.urljoin(API_URL,
next) or by checking for a leading slash and concatenating appropriately (i.e.,
if next startswith "http" use it as-is, elif next.startswith("/") use
API_URL.rstrip("/") + next, else join the base path and next), so both
relative-query and absolute-path pagination links produce valid absolute URLs.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
backend/serdes/export_utils.py (2)

26-59: [Duplicate] Exporter: still reading full files into memory and using OS-specific path separators.

The past review comment on lines 26-48 identified that this code reads entire attachment files into memory (line 35) and uses os.path.join (lines 42-44) which produces backslashes on Windows. These issues remain unaddressed and can cause out-of-memory errors for large attachments and cross-platform compatibility issues.

Apply the previously suggested fix:

 import io
 import os
+import posixpath
+import shutil
 import zipfile
                 try:
                     with default_storage.open(revision.attachment.name, "rb") as file:
-                        file_content = file.read()
-
                         filename = (
                             f"{revision.evidence_id}_v{revision.version}_"
                             f"{os.path.basename(revision.attachment.name)}"
                         )
 
-                        zip_path = os.path.join(
+                        zip_path = posixpath.join(
                             "attachments", "evidence-revisions", filename
                         )
 
-                        zipf.writestr(zip_path, file_content)
+                        with zipf.open(zip_path, "w") as zf:
+                            shutil.copyfileobj(file, zf, length=1024 * 1024)
                         count += 1

108-174: [Duplicate] Importer: restore is not idempotent and reads full files into memory.

The past review comment on lines 108-174 identified two issues that remain unaddressed:

  1. Non-idempotent restores: At line 145, default_storage.save() will auto-rename if the path exists, leaving orphaned files and changing paths on each restore run.
  2. Full in-memory reads: Line 138 reads the entire file content into memory with zipf.read().

Apply the previously suggested fix for idempotency:

                         if not dry_run:
                             # Find the corresponding EvidenceRevision
                             try:
                                 revision = EvidenceRevision.objects.get(
                                     evidence_id=evidence_id, version=version
                                 )
 
                                 file_content = zipf.read(file_path)
 
                                 storage_path = (
                                     f"evidence-revisions/{evidence_id}/"
                                     f"v{version}/{original_filename}"
                                 )
 
+                                # Prefer overwrite semantics for idempotent restores
+                                try:
+                                    if revision.attachment:
+                                        default_storage.delete(revision.attachment.name)
+                                except Exception:
+                                    logger.warning(
+                                        "Failed to delete previous attachment",
+                                        revision_id=revision.id,
+                                        attachment_name=getattr(revision.attachment, "name", None),
+                                    )
+                                try:
+                                    if default_storage.exists(storage_path):
+                                        default_storage.delete(storage_path)
+                                except Exception:
+                                    logger.warning(
+                                        "Failed to delete target path before restore",
+                                        storage_path=storage_path,
+                                    )
+
                                 saved_path = default_storage.save(
                                     storage_path, io.BytesIO(file_content)
                                 )

For the memory issue, consider streaming from the ZIP directly to storage to avoid loading entire files into RAM.

backend/serdes/views.py (2)

250-303: [Duplicate] ExportAttachmentsView: large in-memory ZIP can cause OOM.

The past review comment identified that returning zip_buffer.getvalue() (line 280) loads the entire ZIP into memory, risking out-of-memory errors for large evidence sets. This issue remains unaddressed.

Consider using a SpooledTemporaryFile or streaming response:

+import tempfile
+from django.http import FileResponse
         try:
             exporter = AttachmentExporter()
-            zip_buffer, count = exporter.create_attachments_zip()
+            revisions = exporter.collect_all_attachments()
 
-            if count == 0:
+            if revisions.count() == 0:
                 logger.info("No attachments found to export")
                 return Response(
                     {"message": "No attachments found"},
                     status=status.HTTP_404_NOT_FOUND,
                 )
 
             # Create response with ZIP file
             timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
             filename = f"ciso-assistant-attachments-{timestamp}.zip"
 
-            response = HttpResponse(
-                zip_buffer.getvalue(), content_type="application/zip"
-            )
+            # Use temp file to avoid OOM
+            temp_file = tempfile.SpooledTemporaryFile(max_size=10*1024*1024)
+            with zipfile.ZipFile(temp_file, "w", zipfile.ZIP_DEFLATED) as zipf:
+                count = exporter.package_attachments_to_zip(revisions, zipf)
+            temp_file.seek(0)
+            
+            response = FileResponse(
+                temp_file, content_type="application/zip", as_attachment=True
+            )
             response["Content-Disposition"] = f'attachment; filename="{filename}"'

439-455: [Duplicate] FullRestoreView: catch ValidationError from compare_schema_versions.

The past review comment noted that compare_schema_versions (line 441) can raise ValidationError on schema incompatibility, but the current code only catches ValueError and TypeError (line 446). This means validation errors will bubble to the outer exception handler and return an HTTP 500 instead of the more appropriate HTTP 400.

Apply the suggested fix:

+from django.core.exceptions import ValidationError
             try:
                 schema_version_int = int(schema_version)
                 compare_schema_versions(schema_version_int, backup_version)
                 if backup_version != VERSION:
                     raise ValueError(
                         "The version of the current instance and the one that generated the backup are not the same."
                     )
-            except (ValueError, TypeError) as e:
+            except (ValidationError, ValueError, TypeError) as e:
                 logger.error(
                     "Invalid schema version format",
                     schema_version=schema_version,
                     exc_info=e,
                 )
                 return Response(
                     {"error": "InvalidSchemaVersion"},
                     status=status.HTTP_400_BAD_REQUEST,
                 )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 346b00a and 23d41a5.

📒 Files selected for processing (2)
  • backend/serdes/export_utils.py (1 hunks)
  • backend/serdes/views.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
backend/serdes/export_utils.py (2)
backend/core/models.py (19)
  • Evidence (3305-3385)
  • EvidenceRevision (3388-3508)
  • filename (3356-3361)
  • filename (3489-3490)
  • save (1654-1662)
  • save (1778-1781)
  • save (2779-2781)
  • save (3340-3342)
  • save (3431-3487)
  • save (3662-3666)
  • save (3876-3900)
  • save (4261-4263)
  • save (4398-4401)
  • save (4469-4471)
  • save (5198-5228)
  • save (5351-5357)
  • save (6295-6315)
  • save (6850-6863)
  • save (7023-7039)
backend/serdes/views.py (3)
  • get (39-73)
  • get (251-303)
  • get (712-793)
backend/serdes/views.py (5)
backend/serdes/export_utils.py (4)
  • AttachmentExporter (15-82)
  • AttachmentImporter (85-194)
  • create_attachments_zip (61-82)
  • extract_attachments_from_zip (86-194)
backend/core/models.py (19)
  • EvidenceRevision (3388-3508)
  • filename (3356-3361)
  • filename (3489-3490)
  • attachment_hash (3382-3385)
  • save (1654-1662)
  • save (1778-1781)
  • save (2779-2781)
  • save (3340-3342)
  • save (3431-3487)
  • save (3662-3666)
  • save (3876-3900)
  • save (4261-4263)
  • save (4398-4401)
  • save (4469-4471)
  • save (5198-5228)
  • save (5351-5357)
  • save (6295-6315)
  • save (6850-6863)
  • save (7023-7039)
backend/core/utils.py (2)
  • compare_schema_versions (180-266)
  • sha256 (37-41)
backend/serdes/serializers.py (1)
  • LoadBackupSerializer (20-24)
backend/iam/models.py (1)
  • has_backup_permission (729-734)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build_community_frontend
  • GitHub Check: build_enterprise_frontend
  • GitHub Check: enterprise-startup-docker-compose-test
  • GitHub Check: startup-docker-compose-test
  • GitHub Check: build (3.12)
  • GitHub Check: enterprise-startup-functional-test (3.12)
  • GitHub Check: startup-functional-test (3.12)
  • GitHub Check: test (3.12)
  • GitHub Check: Analyze (python)
🔇 Additional comments (3)
backend/serdes/views.py (3)

60-60: Good fix for personal authentication token issue.

Excluding iam.personalaccesstoken from the backup export addresses issue #2700 where importing a database with personal authentication tokens caused integrity constraint failures.

Based on PR objectives: This change resolves the reported issue where exports containing users with personal authentication tokens would fail to import with an IntegrityError.


306-384: LoadAttachmentsView: good error handling and sanitization.

The error handling properly avoids exposing raw exception details to clients (lines 360, 382) while logging full details server-side with exc_info=True. The use of HTTP 207 for partial success is appropriate.


1098-1118: BatchUploadAttachmentsView: proper error sanitization.

The error handling properly sanitizes exception details before returning to clients (lines 1100, 1115) while logging full details with exc_info=True. This addresses the previous security concerns about information exposure.

Comment on lines +870 to +884
# Read file in chunks
file_buffer = io.BytesIO()
with default_storage.open(revision.attachment.name, "rb") as f:
for chunk in iter(lambda: f.read(10240 * 1024), b""):
file_buffer.write(chunk)

file_bytes = file_buffer.getvalue()

# Calculate total size: header_length + header + file
total_size = len(header_bytes) + len(file_bytes)

# Yield: [4-byte total size][header bytes][file bytes]
yield struct.pack(">I", total_size)
yield header_bytes
yield file_bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

BatchDownloadAttachmentsView: avoid buffering entire file before streaming.

Lines 870-876 read the entire attachment into an io.BytesIO buffer before yielding, which defeats the purpose of streaming and can cause memory issues for large files.

Stream the file content directly:

-                    # Read file in chunks
-                    file_buffer = io.BytesIO()
+                    # Calculate file size first
+                    file_size = default_storage.size(revision.attachment.name)
+                    
+                    # Calculate total size: header + file
+                    total_size = len(header_bytes) + file_size
+                    
+                    # Yield: [4-byte total size][header bytes]
+                    yield struct.pack(">I", total_size)
+                    yield header_bytes
+                    
+                    # Stream file in chunks
                     with default_storage.open(revision.attachment.name, "rb") as f:
-                        for chunk in iter(lambda: f.read(10240 * 1024), b""):
-                            file_buffer.write(chunk)
-
-                    file_bytes = file_buffer.getvalue()
-
-                    # Calculate total size: header_length + header + file
-                    total_size = len(header_bytes) + len(file_bytes)
-
-                    # Yield: [4-byte total size][header bytes][file bytes]
-                    yield struct.pack(">I", total_size)
-                    yield header_bytes
-                    yield file_bytes
+                        for chunk in iter(lambda: f.read(1024 * 1024), b""):
+                            yield chunk
🤖 Prompt for AI Agents
In backend/serdes/views.py around lines 870 to 884, the code currently reads the
entire attachment into an io.BytesIO before yielding which uses excessive
memory; instead obtain the remote file size via
default_storage.size(revision.attachment.name), compute total_size =
len(header_bytes) + file_size, yield the 4-byte total size and header_bytes,
then open the file with default_storage.open(revision.attachment.name, "rb") and
stream its contents by iterating over f.read(CHUNK_SIZE) and yielding each chunk
(no in-memory buffering); remove the temporary BytesIO and ensure CHUNK_SIZE
matches previous value (e.g., 10*1024*1024).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (6)
cli/clica.py (3)

488-504: Database backup still buffers entire response in memory and lacks timeout.

The res.content approach loads the entire database backup into memory before writing. For large instances this can cause OOM. Additionally, no timeout is specified.

Apply streaming and timeout as suggested in the previous review:

-    res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
+    res = requests.get(
+        url,
+        headers=headers,
+        verify=VERIFY_CERTIFICATE,
+        stream=True,
+        timeout=(5, 300),
+    )
 
     if res.status_code != 200:
         ...
 
     backup_file = dest_path / "backup.json.gz"
     with open(backup_file, "wb") as f:
-        f.write(res.content)
+        for chunk in res.iter_content(chunk_size=1024 * 1024):
+            if chunk:
+                f.write(chunk)

537-553: Missing timeout on metadata fetch; URL reconstruction still flawed.

The pagination requests lack timeouts. Additionally, the URL handling logic doesn't correctly handle absolute-path responses (e.g., /serdes/attachment-metadata/?page=2).

+from urllib.parse import urljoin
+
 while url:
-    res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
+    res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE, timeout=(5, 60))
     ...
     url = data.get("next")
-    if url and not url.startswith("http"):
-        # Convert relative URL to absolute
-        url = f"{API_URL}/serdes/attachment-metadata/{url}"
+    if url and not url.startswith("http"):
+        url = urljoin(API_URL + "/", url)

845-871: Restore request lacks timeout; bare except: catches too broadly.

These issues were flagged in the previous review and remain unaddressed.

     res = requests.post(
         url,
         headers=headers,
         files=files,
         verify=VERIFY_CERTIFICATE,
+        timeout=(5, 600),
     )
 ...
     try:
         error_data = res.json()
         rprint(f"[red]{error_data}[/red]", file=sys.stderr)
-    except:
+    except ValueError:
         rprint(res.text, file=sys.stderr)
backend/serdes/views.py (3)

279-288: Memory concern with full ZIP in RAM remains unaddressed.

The zip_buffer.getvalue() approach loads the entire ZIP into memory, which risks OOM on large attachment sets. This was flagged in a previous review.


870-884: File buffering defeats streaming purpose - previously flagged.

The entire attachment is read into file_buffer BytesIO before yielding, which defeats the purpose of streaming and can cause memory issues for large files. This was flagged in a previous review with a suggested fix to calculate file size first and stream directly.


439-455: ValidationError from compare_schema_versions not caught.

The compare_schema_versions function raises ValidationError for version mismatches (e.g., {"error": "backupGreaterVersionError"}), but only ValueError and TypeError are caught here. A version mismatch will fall through to the outer except Exception at line 490 and return a generic DatabaseRestoreFailed (500) instead of the appropriate InvalidSchemaVersion (400).

Apply this diff to handle ValidationError:

 from core.utils import compare_schema_versions
+from django.core.exceptions import ValidationError
             try:
                 schema_version_int = int(schema_version)
                 compare_schema_versions(schema_version_int, backup_version)
                 if backup_version != VERSION:
                     raise ValueError(
                         "The version of the current instance and the one that generated the backup are not the same."
                     )
-            except (ValueError, TypeError) as e:
+            except ValidationError as e:
+                # Handle version validation errors from compare_schema_versions
+                error_code = "InvalidSchemaVersion"
+                if hasattr(e, 'message_dict') and 'error' in e.message_dict:
+                    error_code = e.message_dict['error']
+                    if isinstance(error_code, list):
+                        error_code = error_code[0]
+                logger.error(
+                    "Schema version validation failed",
+                    schema_version=schema_version,
+                    error_code=error_code,
+                )
+                return Response(
+                    {"error": error_code},
+                    status=status.HTTP_400_BAD_REQUEST,
+                )
+            except (ValueError, TypeError) as e:
                 logger.error(
                     "Invalid schema version format",
                     schema_version=schema_version,
                     exc_info=e,
                 )
                 return Response(
                     {"error": "InvalidSchemaVersion"},
                     status=status.HTTP_400_BAD_REQUEST,
                 )
🧹 Nitpick comments (4)
cli/clica.py (1)

632-647: Header parsing is inefficient but functional.

The byte-by-byte JSON parsing attempts up to 1024 decode operations per block. This works but is O(n²) in the worst case.

If the protocol can be modified, consider having the backend send the header length as a prefix (similar to the block length) to enable direct parsing. This would be a coordinated change with the backend.

backend/serdes/views.py (3)

502-667: Consider extracting shared attachment processing logic.

The attachment restoration code here (binary parsing, header validation, hash verification, storage save) is largely duplicated in BatchUploadAttachmentsView. Extracting this into a utility function would reduce maintenance burden and ensure consistent behavior.


749-778: Per-item storage I/O may impact performance on large result sets.

Each iteration calls default_storage.size() and default_storage.exists(), which may be slow depending on the storage backend (especially for S3/remote storage). For pagination with many items, consider caching file sizes in the model or limiting default page size.


980-988: Header parsing is functional but inefficient.

The loop attempts to parse progressively longer byte sequences as JSON, which is O(n) in header length with n JSON parse attempts. Consider including a header length prefix in the binary format for more efficient parsing in future versions.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23d41a5 and b64efb2.

📒 Files selected for processing (3)
  • backend/serdes/views.py (4 hunks)
  • cli/.clica.env (0 hunks)
  • cli/clica.py (3 hunks)
💤 Files with no reviewable changes (1)
  • cli/.clica.env
🧰 Additional context used
🧬 Code graph analysis (1)
backend/serdes/views.py (4)
backend/serdes/export_utils.py (4)
  • AttachmentExporter (15-82)
  • AttachmentImporter (85-194)
  • create_attachments_zip (61-82)
  • extract_attachments_from_zip (86-194)
backend/core/views.py (5)
  • get (1005-1012)
  • get (6597-6598)
  • attachment (7156-7186)
  • attachment (7203-7228)
  • filter (190-216)
backend/core/models.py (21)
  • EvidenceRevision (3388-3508)
  • filename (3356-3361)
  • filename (3489-3490)
  • attachment_hash (3382-3385)
  • delete (971-984)
  • delete (5189-5196)
  • save (1654-1662)
  • save (1778-1781)
  • save (2779-2781)
  • save (3340-3342)
  • save (3431-3487)
  • save (3662-3666)
  • save (3876-3900)
  • save (4261-4263)
  • save (4398-4401)
  • save (4469-4471)
  • save (5198-5228)
  • save (5351-5357)
  • save (6295-6315)
  • save (6850-6863)
  • save (7023-7039)
backend/core/utils.py (2)
  • compare_schema_versions (180-266)
  • sha256 (37-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: test (3.12)
  • GitHub Check: startup-docker-compose-test
  • GitHub Check: enterprise-startup-functional-test (3.12)
  • GitHub Check: startup-functional-test (3.12)
  • GitHub Check: enterprise-startup-docker-compose-test
  • GitHub Check: build (3.12)
  • GitHub Check: build_enterprise_frontend
  • GitHub Check: build_community_frontend
  • GitHub Check: Analyze (python)
🔇 Additional comments (8)
cli/clica.py (2)

2-7: LGTM!

The new standard library imports are appropriate for the streaming protocol and integrity verification features.


906-907: LGTM!

Commands are properly registered following the existing pattern.

backend/serdes/views.py (6)

1-33: LGTM!

The imports are appropriately organized and all added modules are utilized by the new views.


60-60: Good fix for Issue #2700.

Excluding iam.personalaccesstoken from the backup prevents the IntegrityError: NOT NULL constraint failed: global_settings_globalsettings.folder_id that occurred when importing databases containing personal access tokens.


136-136: Consistent exclusion during import.

Correctly mirrors the export exclusion to ensure personal access tokens are excluded from the restore process.


354-363: Error handling looks appropriate.

Error details are logged server-side while clients receive sanitized messages. The "Check server logs for details" guidance is appropriate for operators.


1097-1117: Error handling improvements look good.

Exception details are now sanitized - using generic messages like "Error processing block" and "An internal error has occurred" instead of exposing str(e). Full details are appropriately logged server-side.


947-950: Reading entire request body into memory may cause issues with very large uploads, but Django's DATA_UPLOAD_MAX_MEMORY_SIZE setting (2.5 MB default) already protects against this. When accessing request.body, Django enforces a size limit calculated against the total request size excluding file upload data. For file uploads, Django's MemoryFileUploadHandler and TemporaryFileUploadHandler provide default behavior of reading small files into memory and large ones onto disk. However, if this code processes raw (non-multipart) request bodies or batches that bypass standard file upload handlers, verify that adequate size limits are configured for your use case. Consider whether FILE_UPLOAD_MAX_MEMORY_SIZE or DATA_UPLOAD_MAX_MEMORY_SIZE need adjustment based on typical batch sizes.

Comment on lines +597 to +605
# Request batch download
url = f"{API_URL}/serdes/batch-download-attachments/"
res = requests.post(
url,
headers=headers,
json={"revision_ids": batch_ids},
verify=VERIFY_CERTIFICATE,
stream=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add timeout to batch download request.

The batch download uses streaming correctly but lacks a timeout, which can cause the CLI to hang indefinitely on network issues.

 res = requests.post(
     url,
     headers=headers,
     json={"revision_ids": batch_ids},
     verify=VERIFY_CERTIFICATE,
     stream=True,
+    timeout=(5, 300),
 )
🤖 Prompt for AI Agents
In cli/clica.py around lines 597 to 605, the requests.post call used for the
batch download is missing a timeout which can cause the CLI to hang; add a
timeout parameter (e.g., timeout=30 or a configurable constant) to the
requests.post call and update error handling to catch requests.Timeout and
requests.RequestException so the CLI fails fast and reports a clear error
instead of hanging.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
cli/clica.py (5)

597-605: Add timeout to batch download request.

The batch download request lacks a timeout, which can cause the CLI to hang indefinitely on network issues.

Apply this diff:

 # Request batch download
 url = f"{API_URL}/serdes/batch-download-attachments/"
 res = requests.post(
     url,
     headers=headers,
     json={"revision_ids": batch_ids},
     verify=VERIFY_CERTIFICATE,
     stream=True,
+    timeout=(5, 300),
 )

488-504: Add timeout and stream the database backup to avoid OOM.

The database export uses res.content, which buffers the entire backup in memory before writing to disk. For large databases, this can cause OOM errors.

Apply this diff to stream the response and add a timeout:

 rprint("[bold blue]Step 1/2: Exporting database backup...[/bold blue]")
 url = f"{API_URL}/serdes/dump-db/"
-res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
+res = requests.get(
+    url,
+    headers=headers,
+    verify=VERIFY_CERTIFICATE,
+    stream=True,
+    timeout=(5, 300),
+)

 if res.status_code != 200:
     rprint(
         f"[bold red]Error exporting database: {res.status_code}[/bold red]",
         file=sys.stderr,
     )
     rprint(res.text, file=sys.stderr)
     sys.exit(1)

 backup_file = dest_path / "backup.json.gz"
 with open(backup_file, "wb") as f:
-    f.write(res.content)
+    for chunk in res.iter_content(chunk_size=1024 * 1024):
+        if chunk:
+            f.write(chunk)

841-867: Add timeout to restore request and use specific exception handler.

The restore POST request lacks a timeout (which can cause indefinite hangs), and the bare except: clause at line 865 catches all exceptions including system exits.

Apply this diff:

     try:
         res = requests.post(
             url,
             headers=headers,
             files=files,
             verify=VERIFY_CERTIFICATE,
+            timeout=(5, 600),
         )
     finally:
         # Close file handles
         for file_tuple in files.values():
             file_tuple[1].close()

         # Clean up temp file
         if attachments_data:
             Path(attachments_data).unlink()

     if res.status_code != 200:
         rprint(
             f"[bold red]Error during restore: {res.status_code}[/bold red]",
             file=sys.stderr,
         )
         try:
             error_data = res.json()
             rprint(f"[red]{error_data}[/red]", file=sys.stderr)
-        except:
+        except (ValueError, json.JSONDecodeError):
             rprint(res.text, file=sys.stderr)

537-554: Add timeout to metadata requests and fix URL reconstruction for absolute paths.

The metadata fetch lacks a timeout and the URL reconstruction at lines 551-553 doesn't handle absolute-path pagination links (e.g., /serdes/attachment-metadata/?page=2), which would produce a malformed URL.

Apply this diff:

 while url:
-    res = requests.get(url, headers=headers, verify=VERIFY_CERTIFICATE)
+    res = requests.get(
+        url,
+        headers=headers,
+        verify=VERIFY_CERTIFICATE,
+        timeout=(5, 60),
+    )

     if res.status_code != 200:
         rprint(
             f"[bold red]Error fetching metadata: {res.status_code}[/bold red]",
             file=sys.stderr,
         )
         rprint(res.text, file=sys.stderr)
         sys.exit(1)

     data = res.json()
     all_metadata.extend(data["results"])
     url = data.get("next")
-    if url and not url.startswith("http"):
-        # Convert relative URL to absolute
-        url = f"{API_URL}/serdes/attachment-metadata/{url}"
+    if url:
+        if not url.startswith("http"):
+            if url.startswith("/"):
+                # Absolute path: join with base URL
+                url = API_URL.rstrip("/") + url
+            else:
+                # Relative query string: append to base path
+                url = f"{API_URL}/serdes/attachment-metadata/{url}"

789-820: Stream attachment files to temp file to avoid loading large files into memory.

The current code reads each attachment file fully into memory (line 796: file_bytes = f.read()) before writing to the temp file. For large attachments, this can cause OOM errors.

Apply this diff to stream files in chunks:

             # Stream to temp file with reasonable buffering
             temp_attachments = tempfile.NamedTemporaryFile(
                 suffix=".dat", delete=False, buffering=8 * 1024 * 1024
             )  # 8MB buffer

             for file_info in to_upload:
-                with open(file_info["path"], "rb") as f:
-                    file_bytes = f.read()
-
-                # Build header
+                # Get file size without reading entire file
+                file_size = file_info["path"].stat().st_size
+                
                 header = {
                     "id": file_info["id"],
                     "evidence_id": file_info["evidence_id"],
                     "version": file_info["version"],
                     "filename": file_info["filename"],
                     "hash": file_info["hash"],
-                    "size": len(file_bytes),
+                    "size": file_size,
                 }
                 header_bytes = json.dumps(header).encode("utf-8")
-                total_size = len(header_bytes) + len(file_bytes)
+                total_size = len(header_bytes) + file_size

                 # Write directly (OS will buffer intelligently)
                 temp_attachments.write(struct.pack(">I", total_size))
                 temp_attachments.write(header_bytes)
-                temp_attachments.write(file_bytes)
+                
+                # Stream file content in chunks
+                with open(file_info["path"], "rb") as f:
+                    for chunk in iter(lambda: f.read(10 * 1024 * 1024), b""):
+                        temp_attachments.write(chunk)
🧹 Nitpick comments (2)
cli/clica.py (2)

461-465: Consider adding batch size validation.

The CLI accepts a batch size but doesn't validate it against the backend's limit (default 200). Users won't discover an oversized batch until the request fails at runtime.

Consider adding a note in the help text or validating the batch size:

 @click.option(
     "--batch-size",
     default=200,
-    help="Number of files to download per batch (default: 200)",
+    help="Number of files to download per batch (default: 200, max: 200)",
     type=int,
 )

Or add runtime validation:

if batch_size > 200:
    rprint("[yellow]Warning: Batch size exceeds server limit (200), using 200[/yellow]")
    batch_size = 200

471-472: Consider adding a --quiet option for automation.

For scripted/automated backups, the rich formatting and progress messages may not be desired.

Add a quiet option:

 @click.option(
     "--resume/--no-resume",
     default=True,
     help="Resume from existing manifest (default: True)",
 )
-def backup_full(dest_dir, batch_size, resume):
+@click.option(
+    "--quiet",
+    is_flag=True,
+    default=False,
+    help="Suppress progress output (for automation)",
+)
+def backup_full(dest_dir, batch_size, resume, quiet):
     """Create a full backup including database and attachments using streaming"""

Then conditionally print progress messages based on the quiet flag.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b64efb2 and 63c2bed.

📒 Files selected for processing (1)
  • cli/clica.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cli/clica.py (2)
backend/serdes/views.py (8)
  • get (39-73)
  • get (251-303)
  • get (712-793)
  • post (176-247)
  • post (309-384)
  • post (398-703)
  • post (803-917)
  • post (927-1133)
backend/core/utils.py (1)
  • sha256 (37-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build (3.12)
  • GitHub Check: test (3.12)
  • GitHub Check: build_community_frontend
  • GitHub Check: build_enterprise_frontend
  • GitHub Check: startup-functional-test (3.12)
  • GitHub Check: startup-docker-compose-test
  • GitHub Check: enterprise-startup-functional-test (3.12)
  • GitHub Check: enterprise-startup-docker-compose-test
  • GitHub Check: Analyze (python)
🔇 Additional comments (3)
cli/clica.py (3)

762-773: LGTM: Hash verification streams files efficiently.

The hash verification correctly streams files in chunks and compares against the manifest, avoiding memory issues with large attachments.


509-531: LGTM: Resume logic correctly validates existing files.

The manifest loading opens the file in read mode and verifies that each downloaded file actually exists on disk before including it in the resume state, preventing issues with incomplete or missing downloads.


615-680: LGTM: Streaming protocol correctly handles chunked transfers.

The binary protocol implementation properly:

  • Buffers chunks until complete blocks are available
  • Bounds header parsing to prevent excessive iteration
  • Handles decode errors gracefully
  • Updates the manifest incrementally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Restore is not possible Import database fails if presence of a user with Personnal Authentification Token set

3 participants