JIM-45: New script to replace datastreams. #30

chrismacdonaldw · 2025-06-26T01:53:57Z

Summary by CodeRabbit

Refactor
- Improved XML handling for datastream updates, resulting in better formatting and reliability.
- Streamlined command-line interface with clearer argument names and options.
- Enhanced error messages for easier troubleshooting.
- Output now uses UTF-8 encoding with proper XML declaration.

coderabbitai · 2025-06-26T01:54:15Z

Walkthrough

The script was refactored to use Python's built-in xml.etree.ElementTree instead of lxml for XML processing. Multiple functions were consolidated into a single function, and the command-line interface was updated with clearer argument names and improved error handling. The XML output formatting and datastream update logic were also revised.

Changes

File(s)	Change Summary
scripts/datastream_updater.py	Replaced `lxml` with `xml.etree.ElementTree`, consolidated multiple helper functions into `update_foxml_datastream`, updated CLI arguments, improved error handling and XML formatting, removed separate base64 encoding and XML namespace registration functions.

Poem

A hop and a skip through XML trees,
No more lxml, just built-ins with ease.
One function to rule the datastream's fate,
Arguments clearer, formatting first-rate.
With base64 lines tidy and neat,
This rabbit’s update is quite the treat! 🐇✨

Warning

Review ran into problems

🔥 Problems

Errors were encountered while retrieving linked issues.

Errors (1)

JIRA integration encountered authorization issues. Please disconnect and reconnect the integration in the CodeRabbit UI.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

scripts/datastream_updater.py (1)

17-128: Consider refactoring this function to reduce complexity.

The function exceeds recommended complexity thresholds with 7 parameters, 34 local variables, and 69 statements. This makes it harder to maintain and test.

Consider extracting these responsibilities into separate functions:

File reading and Base64 encoding

Datastream element creation/finding

Version element creation with attributes

Binary content formatting

This would improve readability, testability, and adherence to the Single Responsibility Principle.

Would you like me to propose a refactored version that breaks this down into smaller, focused functions?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb0b497 and 9b4fbda.

📒 Files selected for processing (1)

scripts/datastream_updater.py (1 hunks)

🧰 Additional context used

🪛 Flake8 (7.2.0)

scripts/datastream_updater.py

[error] 12-12: expected 2 blank lines, found 1

(E302)

[error] 17-17: expected 2 blank lines, found 1

(E302)

[error] 82-82: indentation is not a multiple of 4

(E111)

[error] 82-82: over-indented

(E117)

[error] 84-84: indentation is not a multiple of 4

(E111)

[error] 84-84: over-indented

(E117)

🪛 Pylint (3.3.7)

scripts/datastream_updater.py

[refactor] 17-17: Too many arguments (7/5)

(R0913)

[refactor] 17-17: Too many positional arguments (7/5)

(R0917)

[refactor] 17-17: Too many local variables (34/15)

(R0914)

[refactor] 17-17: Too many statements (69/50)

(R0915)

🔇 Additional comments (2)

scripts/datastream_updater.py (2)

111-118: Well-implemented Base64 formatting.

The code correctly formats Base64 content with 76-character lines and precise indentation, maintaining consistency with FOXML formatting standards.

31-33: Good error handling implementation.

The code properly handles various error scenarios with informative messages:

File existence validation

XML parsing errors

File writing errors

This provides a good user experience with clear error reporting.

Also applies to: 44-49, 122-127

scripts/datastream_updater.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

scripts/datastream_updater.py (1)
1-11: Add missing blank lines to comply with PEP 8.

The import section and namespace definitions look good functionally, but there are missing blank lines as flagged by static analysis.

Apply this diff to fix the formatting:
 import mimetypes
 from datetime import datetime
 import xml.etree.ElementTree as ET
+
+
 
 NAMESPACES = {
     'foxml': 'info:fedora/fedora-system:def/foxml#'
 }
+
+

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b4fbda and 7f7ae75.

📒 Files selected for processing (1)

scripts/datastream_updater.py (1 hunks)

🧰 Additional context used

🪛 Flake8 (7.2.0)

scripts/datastream_updater.py

[error] 12-12: expected 2 blank lines, found 1

(E302)

[error] 17-17: expected 2 blank lines, found 1

(E302)

🪛 Pylint (3.3.7)

scripts/datastream_updater.py

[refactor] 17-17: Too many arguments (7/5)

(R0913)

[refactor] 17-17: Too many positional arguments (7/5)

(R0917)

[refactor] 17-17: Too many local variables (34/15)

(R0914)

[refactor] 17-17: Too many statements (69/50)

(R0915)

🔇 Additional comments (7)

scripts/datastream_updater.py (7)

12-16: LGTM! Clean namespace registration helper.

The namespace registration function is well-implemented and properly documented. This will ensure clean XML output with proper namespace prefixes.

86-108: LGTM! Proper timestamp and attribute handling.

The timestamp generation with millisecond precision and the attribute setup for the datastream version are well-implemented. The MIME type auto-detection fallback is a nice touch.

109-121: Good base64 content formatting with proper indentation.

The implementation correctly splits base64 content into 76-character lines and applies proper indentation to preserve FOXML formatting style. This addresses the formatting requirements well.

122-128: LGTM! Proper error handling for file operations.

The XML writing with UTF-8 encoding and XML declaration is correct, and the error handling provides clear feedback to users.

130-172: Well-structured CLI with clear argument names.

The command-line interface improvements are excellent - the argument names are more explicit (--input-foxml, --output-foxml, --file) and the help text is comprehensive. The control group validation with choices is a good addition.

174-182: LGTM! Clean function call with keyword arguments.

The function call properly uses keyword arguments for clarity, making it easy to understand what each parameter represents.

51-85: Verify the indentation fixes were applied correctly.

The indentation on lines 82 and 84 appears to have been corrected from the previous review, but I want to confirm the formatting is consistent throughout this section.
#!/bin/bash
# Description: Check indentation consistency in the datastream handling section
# Expected: All lines should use 4-space indentation consistently

# Check for indentation issues around lines 82-84
sed -n '80,86p' scripts/datastream_updater.py | cat -A

coderabbitai · 2025-06-26T02:02:37Z

scripts/datastream_updater.py

+def update_foxml_datastream(input_path, output_path, dsid, content_file, label, mimetype, control_group):
    """
-    Compresses and encodes the binary data from the given file path.
+    Adds or replaces a datastream in a FOXML file with Base64 encoded content,
+    with precise indentation and multi-line formatting that preserves the original document's style.

    Args:
-      file_path (str): The path to the file containing the binary data.
-
-    Returns:
-      tuple: A tuple containing the indented base64-encoded data and the original size of the binary data.
+        input_path (str): Path to the source FOXML file.
+        output_path (str): Path to save the modified FOXML file.
+        dsid (str): The ID of the datastream to add/update (e.g., 'OBJ', 'MODS').
+        content_file (str): Path to the file containing the new content.
+        label (str): The label for the new datastream version.
+        mimetype (str): The MIME type of the content file.
+        control_group (str): The control group for the datastream (e.g., 'M', 'X').
    """
-    with open(file_path, "rb") as f_in:
-        binary_data = f_in.read()
-        original_size = len(binary_data)
-        base64_data = base64.b64encode(binary_data)
-        base64_lines = [
-            base64_data[i : i + 80].decode("utf-8")
-            for i in range(0, len(base64_data), 80)
-        ]
-        indented_base64 = "\n              ".join(base64_lines)
-        return indented_base64, original_size
-
-
-def register_namespaces(xml_path):
-    """
-    Registers XML namespaces from the given XML file.
-
-    Args:
-      xml_path (str): The path to the XML file.
+    if not os.path.exists(content_file):
+        print(f"Error: Content file not found at '{content_file}'")
+        return

-    Raises:
-      Exception: If there is an error registering the namespaces.
-    """
-    try:
-        namespaces = dict(
-            [node for _, node in ET.iterparse(xml_path, events=["start-ns"])]
-        )
-        for ns in namespaces:
-            ET.register_namespace(ns, namespaces[ns])
-    except Exception as e:
-        logging.error(f"Error registering namespaces: {e}")
-        raise
-
-
-def add_datastream_version(
-    xml_path, dsid, base64_data, original_size, mimetype, label=None
-):
-    """
-    Adds a new version of a datastream to an XML file.
+    print(f"Reading content from '{content_file}'...")
+    with open(content_file, 'rb') as f:
+        binary_content_bytes = f.read()
+
+    encoded_content_string = base64.b64encode(binary_content_bytes).decode('ascii')
+    content_size = os.path.getsize(content_file)
+    print(f"Content read successfully. Original size: {content_size} bytes.")

-    Args:
-      xml_path (str): The path to the XML file.
-      dsid (str): The ID of the datastream.
-      base64_data (str): The base64-encoded content of the datastream.
-      original_size (int): The original size of the datastream in bytes.
-      mimetype (str): The MIME type of the datastream.
-      label (str, optional): The label for the datastream version. If not provided, a default label will be used.
-
-    Returns:
-      str: The XML string with the new datastream version added.
-
-    Raises:
-      ET.ParseError: If there is an error parsing the XML file.
-      Exception: If there is an error creating the XML string.
-    """
+    register_namespaces()
    try:
-        root = ET.parse(xml_path).getroot()
+        tree = ET.parse(input_path)
+        root = tree.getroot()
    except ET.ParseError as e:
-        logging.exception(f"XML parsing error: {e}")
+        print(f"Error parsing XML file '{input_path}': {e}")
        return



🛠️ Refactor suggestion

Consider refactoring this function due to complexity.

The function signature and initial file handling logic are correct, but static analysis indicates this function has too many arguments (7/5), local variables (34/15), and statements (69/50), making it difficult to maintain.

Consider breaking this large function into smaller, focused functions:

File reading and encoding logic

XML parsing and datastream lookup

Datastream creation/update logic

XML output writing

This would improve readability, testability, and maintainability.

🧰 Tools

🪛 Flake8 (7.2.0)

[error] 17-17: expected 2 blank lines, found 1

(E302)

🪛 Pylint (3.3.7)

[refactor] 17-17: Too many arguments (7/5)

(R0913)

[refactor] 17-17: Too many positional arguments (7/5)

(R0917)

[refactor] 17-17: Too many local variables (34/15)

(R0914)

[refactor] 17-17: Too many statements (69/50)

(R0915)

🤖 Prompt for AI Agents

In scripts/datastream_updater.py around lines 17 to 50, the update_foxml_datastream function is too complex with excessive arguments, local variables, and statements. Refactor by splitting it into smaller functions: one for reading and base64 encoding the content file, another for parsing the XML and locating the datastream, a third for creating or updating the datastream element, and a final one for writing the modified XML output. This modular approach will reduce complexity and improve readability and maintainability.

New script to replace datastreams.

9b4fbda

chrismacdonaldw added the patch Backwards compatible bug fixes. label Jun 26, 2025

coderabbitai bot reviewed Jun 26, 2025

View reviewed changes

scripts/datastream_updater.py Outdated Show resolved Hide resolved

Fix some bad indentation.

7f7ae75

coderabbitai bot reviewed Jun 26, 2025

View reviewed changes

JojoVes approved these changes Jun 26, 2025

View reviewed changes

chrismacdonaldw merged commit c76bf27 into main Jun 26, 2025
2 checks passed

chrismacdonaldw deleted the JIM-45 branch June 26, 2025 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JIM-45: New script to replace datastreams. #30

JIM-45: New script to replace datastreams. #30

Uh oh!

chrismacdonaldw commented Jun 26, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jun 26, 2025 •

edited

Loading

Review ran into problems

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jun 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JIM-45: New script to replace datastreams. #30

JIM-45: New script to replace datastreams. #30

Uh oh!

Conversation

chrismacdonaldw commented Jun 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

Review ran into problems

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chrismacdonaldw commented Jun 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 26, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)