Skip to content

Conversation

@chrismacdonaldw
Copy link
Contributor

@chrismacdonaldw chrismacdonaldw commented Jun 26, 2025

Summary by CodeRabbit

  • Refactor
    • Improved XML handling for datastream updates, resulting in better formatting and reliability.
    • Streamlined command-line interface with clearer argument names and options.
    • Enhanced error messages for easier troubleshooting.
    • Output now uses UTF-8 encoding with proper XML declaration.

@chrismacdonaldw chrismacdonaldw added the patch Backwards compatible bug fixes. label Jun 26, 2025
@coderabbitai
Copy link

coderabbitai bot commented Jun 26, 2025

Walkthrough

The script was refactored to use Python's built-in xml.etree.ElementTree instead of lxml for XML processing. Multiple functions were consolidated into a single function, and the command-line interface was updated with clearer argument names and improved error handling. The XML output formatting and datastream update logic were also revised.

Changes

File(s) Change Summary
scripts/datastream_updater.py Replaced lxml with xml.etree.ElementTree, consolidated multiple helper functions into update_foxml_datastream, updated CLI arguments, improved error handling and XML formatting, removed separate base64 encoding and XML namespace registration functions.

Poem

A hop and a skip through XML trees,
No more lxml, just built-ins with ease.
One function to rule the datastream's fate,
Arguments clearer, formatting first-rate.
With base64 lines tidy and neat,
This rabbit’s update is quite the treat! 🐇✨

Warning

Review ran into problems

🔥 Problems

Errors were encountered while retrieving linked issues.

Errors (1)
  • JIRA integration encountered authorization issues. Please disconnect and reconnect the integration in the CodeRabbit UI.
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
scripts/datastream_updater.py (1)

17-128: Consider refactoring this function to reduce complexity.

The function exceeds recommended complexity thresholds with 7 parameters, 34 local variables, and 69 statements. This makes it harder to maintain and test.

Consider extracting these responsibilities into separate functions:

  • File reading and Base64 encoding
  • Datastream element creation/finding
  • Version element creation with attributes
  • Binary content formatting

This would improve readability, testability, and adherence to the Single Responsibility Principle.

Would you like me to propose a refactored version that breaks this down into smaller, focused functions?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb0b497 and 9b4fbda.

📒 Files selected for processing (1)
  • scripts/datastream_updater.py (1 hunks)
🧰 Additional context used
🪛 Flake8 (7.2.0)
scripts/datastream_updater.py

[error] 12-12: expected 2 blank lines, found 1

(E302)


[error] 17-17: expected 2 blank lines, found 1

(E302)


[error] 82-82: indentation is not a multiple of 4

(E111)


[error] 82-82: over-indented

(E117)


[error] 84-84: indentation is not a multiple of 4

(E111)


[error] 84-84: over-indented

(E117)

🪛 Pylint (3.3.7)
scripts/datastream_updater.py

[refactor] 17-17: Too many arguments (7/5)

(R0913)


[refactor] 17-17: Too many positional arguments (7/5)

(R0917)


[refactor] 17-17: Too many local variables (34/15)

(R0914)


[refactor] 17-17: Too many statements (69/50)

(R0915)

🔇 Additional comments (2)
scripts/datastream_updater.py (2)

111-118: Well-implemented Base64 formatting.

The code correctly formats Base64 content with 76-character lines and precise indentation, maintaining consistency with FOXML formatting standards.


31-33: Good error handling implementation.

The code properly handles various error scenarios with informative messages:

  • File existence validation
  • XML parsing errors
  • File writing errors

This provides a good user experience with clear error reporting.

Also applies to: 44-49, 122-127

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
scripts/datastream_updater.py (1)

1-11: Add missing blank lines to comply with PEP 8.

The import section and namespace definitions look good functionally, but there are missing blank lines as flagged by static analysis.

Apply this diff to fix the formatting:

 import mimetypes
 from datetime import datetime
 import xml.etree.ElementTree as ET
+
+
 
 NAMESPACES = {
     'foxml': 'info:fedora/fedora-system:def/foxml#'
 }
+
+
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b4fbda and 7f7ae75.

📒 Files selected for processing (1)
  • scripts/datastream_updater.py (1 hunks)
🧰 Additional context used
🪛 Flake8 (7.2.0)
scripts/datastream_updater.py

[error] 12-12: expected 2 blank lines, found 1

(E302)


[error] 17-17: expected 2 blank lines, found 1

(E302)

🪛 Pylint (3.3.7)
scripts/datastream_updater.py

[refactor] 17-17: Too many arguments (7/5)

(R0913)


[refactor] 17-17: Too many positional arguments (7/5)

(R0917)


[refactor] 17-17: Too many local variables (34/15)

(R0914)


[refactor] 17-17: Too many statements (69/50)

(R0915)

🔇 Additional comments (7)
scripts/datastream_updater.py (7)

12-16: LGTM! Clean namespace registration helper.

The namespace registration function is well-implemented and properly documented. This will ensure clean XML output with proper namespace prefixes.


86-108: LGTM! Proper timestamp and attribute handling.

The timestamp generation with millisecond precision and the attribute setup for the datastream version are well-implemented. The MIME type auto-detection fallback is a nice touch.


109-121: Good base64 content formatting with proper indentation.

The implementation correctly splits base64 content into 76-character lines and applies proper indentation to preserve FOXML formatting style. This addresses the formatting requirements well.


122-128: LGTM! Proper error handling for file operations.

The XML writing with UTF-8 encoding and XML declaration is correct, and the error handling provides clear feedback to users.


130-172: Well-structured CLI with clear argument names.

The command-line interface improvements are excellent - the argument names are more explicit (--input-foxml, --output-foxml, --file) and the help text is comprehensive. The control group validation with choices is a good addition.


174-182: LGTM! Clean function call with keyword arguments.

The function call properly uses keyword arguments for clarity, making it easy to understand what each parameter represents.


51-85: Verify the indentation fixes were applied correctly.

The indentation on lines 82 and 84 appears to have been corrected from the previous review, but I want to confirm the formatting is consistent throughout this section.

#!/bin/bash
# Description: Check indentation consistency in the datastream handling section
# Expected: All lines should use 4-space indentation consistently

# Check for indentation issues around lines 82-84
sed -n '80,86p' scripts/datastream_updater.py | cat -A

Comment on lines +17 to 50
def update_foxml_datastream(input_path, output_path, dsid, content_file, label, mimetype, control_group):
"""
Compresses and encodes the binary data from the given file path.
Adds or replaces a datastream in a FOXML file with Base64 encoded content,
with precise indentation and multi-line formatting that preserves the original document's style.
Args:
file_path (str): The path to the file containing the binary data.
Returns:
tuple: A tuple containing the indented base64-encoded data and the original size of the binary data.
input_path (str): Path to the source FOXML file.
output_path (str): Path to save the modified FOXML file.
dsid (str): The ID of the datastream to add/update (e.g., 'OBJ', 'MODS').
content_file (str): Path to the file containing the new content.
label (str): The label for the new datastream version.
mimetype (str): The MIME type of the content file.
control_group (str): The control group for the datastream (e.g., 'M', 'X').
"""
with open(file_path, "rb") as f_in:
binary_data = f_in.read()
original_size = len(binary_data)
base64_data = base64.b64encode(binary_data)
base64_lines = [
base64_data[i : i + 80].decode("utf-8")
for i in range(0, len(base64_data), 80)
]
indented_base64 = "\n ".join(base64_lines)
return indented_base64, original_size


def register_namespaces(xml_path):
"""
Registers XML namespaces from the given XML file.
Args:
xml_path (str): The path to the XML file.
if not os.path.exists(content_file):
print(f"Error: Content file not found at '{content_file}'")
return

Raises:
Exception: If there is an error registering the namespaces.
"""
try:
namespaces = dict(
[node for _, node in ET.iterparse(xml_path, events=["start-ns"])]
)
for ns in namespaces:
ET.register_namespace(ns, namespaces[ns])
except Exception as e:
logging.error(f"Error registering namespaces: {e}")
raise


def add_datastream_version(
xml_path, dsid, base64_data, original_size, mimetype, label=None
):
"""
Adds a new version of a datastream to an XML file.
print(f"Reading content from '{content_file}'...")
with open(content_file, 'rb') as f:
binary_content_bytes = f.read()

encoded_content_string = base64.b64encode(binary_content_bytes).decode('ascii')
content_size = os.path.getsize(content_file)
print(f"Content read successfully. Original size: {content_size} bytes.")

Args:
xml_path (str): The path to the XML file.
dsid (str): The ID of the datastream.
base64_data (str): The base64-encoded content of the datastream.
original_size (int): The original size of the datastream in bytes.
mimetype (str): The MIME type of the datastream.
label (str, optional): The label for the datastream version. If not provided, a default label will be used.
Returns:
str: The XML string with the new datastream version added.
Raises:
ET.ParseError: If there is an error parsing the XML file.
Exception: If there is an error creating the XML string.
"""
register_namespaces()
try:
root = ET.parse(xml_path).getroot()
tree = ET.parse(input_path)
root = tree.getroot()
except ET.ParseError as e:
logging.exception(f"XML parsing error: {e}")
print(f"Error parsing XML file '{input_path}': {e}")
return

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider refactoring this function due to complexity.

The function signature and initial file handling logic are correct, but static analysis indicates this function has too many arguments (7/5), local variables (34/15), and statements (69/50), making it difficult to maintain.

Consider breaking this large function into smaller, focused functions:

  • File reading and encoding logic
  • XML parsing and datastream lookup
  • Datastream creation/update logic
  • XML output writing

This would improve readability, testability, and maintainability.

🧰 Tools
🪛 Flake8 (7.2.0)

[error] 17-17: expected 2 blank lines, found 1

(E302)

🪛 Pylint (3.3.7)

[refactor] 17-17: Too many arguments (7/5)

(R0913)


[refactor] 17-17: Too many positional arguments (7/5)

(R0917)


[refactor] 17-17: Too many local variables (34/15)

(R0914)


[refactor] 17-17: Too many statements (69/50)

(R0915)

🤖 Prompt for AI Agents
In scripts/datastream_updater.py around lines 17 to 50, the
update_foxml_datastream function is too complex with excessive arguments, local
variables, and statements. Refactor by splitting it into smaller functions: one
for reading and base64 encoding the content file, another for parsing the XML
and locating the datastream, a third for creating or updating the datastream
element, and a final one for writing the modified XML output. This modular
approach will reduce complexity and improve readability and maintainability.

@chrismacdonaldw chrismacdonaldw merged commit c76bf27 into main Jun 26, 2025
2 checks passed
@chrismacdonaldw chrismacdonaldw deleted the JIM-45 branch June 26, 2025 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch Backwards compatible bug fixes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants