-
Notifications
You must be signed in to change notification settings - Fork 2
JIM-45: New script to replace datastreams. #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe script was refactored to use Python's built-in Changes
Poem
Warning Review ran into problems🔥 ProblemsErrors were encountered while retrieving linked issues. Errors (1)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
scripts/datastream_updater.py (1)
17-128: Consider refactoring this function to reduce complexity.The function exceeds recommended complexity thresholds with 7 parameters, 34 local variables, and 69 statements. This makes it harder to maintain and test.
Consider extracting these responsibilities into separate functions:
- File reading and Base64 encoding
- Datastream element creation/finding
- Version element creation with attributes
- Binary content formatting
This would improve readability, testability, and adherence to the Single Responsibility Principle.
Would you like me to propose a refactored version that breaks this down into smaller, focused functions?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/datastream_updater.py(1 hunks)
🧰 Additional context used
🪛 Flake8 (7.2.0)
scripts/datastream_updater.py
[error] 12-12: expected 2 blank lines, found 1
(E302)
[error] 17-17: expected 2 blank lines, found 1
(E302)
[error] 82-82: indentation is not a multiple of 4
(E111)
[error] 82-82: over-indented
(E117)
[error] 84-84: indentation is not a multiple of 4
(E111)
[error] 84-84: over-indented
(E117)
🪛 Pylint (3.3.7)
scripts/datastream_updater.py
[refactor] 17-17: Too many arguments (7/5)
(R0913)
[refactor] 17-17: Too many positional arguments (7/5)
(R0917)
[refactor] 17-17: Too many local variables (34/15)
(R0914)
[refactor] 17-17: Too many statements (69/50)
(R0915)
🔇 Additional comments (2)
scripts/datastream_updater.py (2)
111-118: Well-implemented Base64 formatting.The code correctly formats Base64 content with 76-character lines and precise indentation, maintaining consistency with FOXML formatting standards.
31-33: Good error handling implementation.The code properly handles various error scenarios with informative messages:
- File existence validation
- XML parsing errors
- File writing errors
This provides a good user experience with clear error reporting.
Also applies to: 44-49, 122-127
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
scripts/datastream_updater.py (1)
1-11: Add missing blank lines to comply with PEP 8.The import section and namespace definitions look good functionally, but there are missing blank lines as flagged by static analysis.
Apply this diff to fix the formatting:
import mimetypes from datetime import datetime import xml.etree.ElementTree as ET + + NAMESPACES = { 'foxml': 'info:fedora/fedora-system:def/foxml#' } + +
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/datastream_updater.py(1 hunks)
🧰 Additional context used
🪛 Flake8 (7.2.0)
scripts/datastream_updater.py
[error] 12-12: expected 2 blank lines, found 1
(E302)
[error] 17-17: expected 2 blank lines, found 1
(E302)
🪛 Pylint (3.3.7)
scripts/datastream_updater.py
[refactor] 17-17: Too many arguments (7/5)
(R0913)
[refactor] 17-17: Too many positional arguments (7/5)
(R0917)
[refactor] 17-17: Too many local variables (34/15)
(R0914)
[refactor] 17-17: Too many statements (69/50)
(R0915)
🔇 Additional comments (7)
scripts/datastream_updater.py (7)
12-16: LGTM! Clean namespace registration helper.The namespace registration function is well-implemented and properly documented. This will ensure clean XML output with proper namespace prefixes.
86-108: LGTM! Proper timestamp and attribute handling.The timestamp generation with millisecond precision and the attribute setup for the datastream version are well-implemented. The MIME type auto-detection fallback is a nice touch.
109-121: Good base64 content formatting with proper indentation.The implementation correctly splits base64 content into 76-character lines and applies proper indentation to preserve FOXML formatting style. This addresses the formatting requirements well.
122-128: LGTM! Proper error handling for file operations.The XML writing with UTF-8 encoding and XML declaration is correct, and the error handling provides clear feedback to users.
130-172: Well-structured CLI with clear argument names.The command-line interface improvements are excellent - the argument names are more explicit (
--input-foxml,--output-foxml,--file) and the help text is comprehensive. The control group validation with choices is a good addition.
174-182: LGTM! Clean function call with keyword arguments.The function call properly uses keyword arguments for clarity, making it easy to understand what each parameter represents.
51-85: Verify the indentation fixes were applied correctly.The indentation on lines 82 and 84 appears to have been corrected from the previous review, but I want to confirm the formatting is consistent throughout this section.
#!/bin/bash # Description: Check indentation consistency in the datastream handling section # Expected: All lines should use 4-space indentation consistently # Check for indentation issues around lines 82-84 sed -n '80,86p' scripts/datastream_updater.py | cat -A
| def update_foxml_datastream(input_path, output_path, dsid, content_file, label, mimetype, control_group): | ||
| """ | ||
| Compresses and encodes the binary data from the given file path. | ||
| Adds or replaces a datastream in a FOXML file with Base64 encoded content, | ||
| with precise indentation and multi-line formatting that preserves the original document's style. | ||
| Args: | ||
| file_path (str): The path to the file containing the binary data. | ||
| Returns: | ||
| tuple: A tuple containing the indented base64-encoded data and the original size of the binary data. | ||
| input_path (str): Path to the source FOXML file. | ||
| output_path (str): Path to save the modified FOXML file. | ||
| dsid (str): The ID of the datastream to add/update (e.g., 'OBJ', 'MODS'). | ||
| content_file (str): Path to the file containing the new content. | ||
| label (str): The label for the new datastream version. | ||
| mimetype (str): The MIME type of the content file. | ||
| control_group (str): The control group for the datastream (e.g., 'M', 'X'). | ||
| """ | ||
| with open(file_path, "rb") as f_in: | ||
| binary_data = f_in.read() | ||
| original_size = len(binary_data) | ||
| base64_data = base64.b64encode(binary_data) | ||
| base64_lines = [ | ||
| base64_data[i : i + 80].decode("utf-8") | ||
| for i in range(0, len(base64_data), 80) | ||
| ] | ||
| indented_base64 = "\n ".join(base64_lines) | ||
| return indented_base64, original_size | ||
|
|
||
|
|
||
| def register_namespaces(xml_path): | ||
| """ | ||
| Registers XML namespaces from the given XML file. | ||
| Args: | ||
| xml_path (str): The path to the XML file. | ||
| if not os.path.exists(content_file): | ||
| print(f"Error: Content file not found at '{content_file}'") | ||
| return | ||
|
|
||
| Raises: | ||
| Exception: If there is an error registering the namespaces. | ||
| """ | ||
| try: | ||
| namespaces = dict( | ||
| [node for _, node in ET.iterparse(xml_path, events=["start-ns"])] | ||
| ) | ||
| for ns in namespaces: | ||
| ET.register_namespace(ns, namespaces[ns]) | ||
| except Exception as e: | ||
| logging.error(f"Error registering namespaces: {e}") | ||
| raise | ||
|
|
||
|
|
||
| def add_datastream_version( | ||
| xml_path, dsid, base64_data, original_size, mimetype, label=None | ||
| ): | ||
| """ | ||
| Adds a new version of a datastream to an XML file. | ||
| print(f"Reading content from '{content_file}'...") | ||
| with open(content_file, 'rb') as f: | ||
| binary_content_bytes = f.read() | ||
|
|
||
| encoded_content_string = base64.b64encode(binary_content_bytes).decode('ascii') | ||
| content_size = os.path.getsize(content_file) | ||
| print(f"Content read successfully. Original size: {content_size} bytes.") | ||
|
|
||
| Args: | ||
| xml_path (str): The path to the XML file. | ||
| dsid (str): The ID of the datastream. | ||
| base64_data (str): The base64-encoded content of the datastream. | ||
| original_size (int): The original size of the datastream in bytes. | ||
| mimetype (str): The MIME type of the datastream. | ||
| label (str, optional): The label for the datastream version. If not provided, a default label will be used. | ||
| Returns: | ||
| str: The XML string with the new datastream version added. | ||
| Raises: | ||
| ET.ParseError: If there is an error parsing the XML file. | ||
| Exception: If there is an error creating the XML string. | ||
| """ | ||
| register_namespaces() | ||
| try: | ||
| root = ET.parse(xml_path).getroot() | ||
| tree = ET.parse(input_path) | ||
| root = tree.getroot() | ||
| except ET.ParseError as e: | ||
| logging.exception(f"XML parsing error: {e}") | ||
| print(f"Error parsing XML file '{input_path}': {e}") | ||
| return | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider refactoring this function due to complexity.
The function signature and initial file handling logic are correct, but static analysis indicates this function has too many arguments (7/5), local variables (34/15), and statements (69/50), making it difficult to maintain.
Consider breaking this large function into smaller, focused functions:
- File reading and encoding logic
- XML parsing and datastream lookup
- Datastream creation/update logic
- XML output writing
This would improve readability, testability, and maintainability.
🧰 Tools
🪛 Flake8 (7.2.0)
[error] 17-17: expected 2 blank lines, found 1
(E302)
🪛 Pylint (3.3.7)
[refactor] 17-17: Too many arguments (7/5)
(R0913)
[refactor] 17-17: Too many positional arguments (7/5)
(R0917)
[refactor] 17-17: Too many local variables (34/15)
(R0914)
[refactor] 17-17: Too many statements (69/50)
(R0915)
🤖 Prompt for AI Agents
In scripts/datastream_updater.py around lines 17 to 50, the
update_foxml_datastream function is too complex with excessive arguments, local
variables, and statements. Refactor by splitting it into smaller functions: one
for reading and base64 encoding the content file, another for parsing the XML
and locating the datastream, a third for creating or updating the datastream
element, and a final one for writing the modified XML output. This modular
approach will reduce complexity and improve readability and maintainability.
Summary by CodeRabbit