-
-
Notifications
You must be signed in to change notification settings - Fork 402
PICARD-2750: Handle invalid LINK frames when saving MP3 files #2630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
PICARD-2750: Handle invalid LINK frames when saving MP3 files #2630
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it should work, but I think the code can be simplified (too much code redundancy, too long methods).
Also please fix linting errors.
MP3 files with LINK frames that do not follow the ID3 specification cause mutagen exceptions when saving with ID3 v2.3 format. This fix handles invalid LINK frames to prevent errors during save operations.
6eee65a
to
ee68982
Compare
Thanks for your insights, @zas! I think that I've addressed your concerns for redundancy, as well fixed linting errors! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks much better after recent changes, few more comments though.
for frame_id, frame in malformed_link_frames: | ||
for url in self._extract_and_convert_link_frame(frame, frame_id): | ||
metadata.add('user_website', url) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I miss something but can't this method be simplified even more:
def _process_link_frames_on_load(self, tags, metadata):
"""Process malformed LINK frames during initial file load and add them as website URLs."""
for frame_id, frame in list(tags.items()):
if self._is_malformed_frame(frame_id, frame):
for url in self._extract_and_convert_link_frame(frame, frame_id):
metadata.add('user_website', url)
|
||
return confirmed_problematic_frames | ||
|
||
def _extract_and_convert_link_frame(self, frame, frame_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor thing, but everywhere we have frame_id, frame
but here, perhaps swap arguments.
to_add = [] | ||
confirmed_problematic_frames = [] | ||
extracted_urls = [] | ||
for frame_id, frame in list(tags.items()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to convert tags.items()
to list
here?
Elsewhere tags
are modified during the loop, but not here.
@@ -1035,6 +1162,53 @@ def _remove_other_supported_tag(self, tags, real_name): | |||
"""Remove other supported tag from ID3 frames.""" | |||
del tags[real_name] | |||
|
|||
def _contains_url_pattern(self, text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it doesn't depend on self
it could be made a simple utility function at module level:
def _contains_url_pattern(text):
"""Check if the given text contains URL-like patterns."""
return any(term in text for term in ('http', 'www.', '://'))
It would shorten calls:
if part.strip() and self._contains_url_pattern(part):
would become:
if part.strip() and _contains_url_pattern(part):
If you prefer to keep it at class level, make it a static method at least (using @staticmethod
decoration), but I think the code would be more readable with shorter calls.
for frame_id in list(tags.keys()): | ||
if not (frame_id.isalnum() and frame_id.isupper()): | ||
log.warning("Removing invalid frame ID: %r", frame_id) | ||
del tags[frame_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phw Perhaps we should ask user about what do to in this case? Or is it ok to just delete invalid frames and move on?
We could also have an option for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An option seems a bit cleaner to me so that the user has some measure of control, but I don't have a feel for how common this situation is encountered. It may not be worth having another setting (that most users won't understand).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this could be an option. The question is however what we do, if the user disables the removal? We can't keep it as it is, as it will fail saving.
I actually also don't quite see why this code is part of the PR, it seems to try to solve an entirely different problem. Are there cases where the frame IDs are invalid like this?
|
||
log.warning("Could not extract URL from malformed LINK frame %r", frame_id) | ||
|
||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty broad exception capture. What exceptions do you anticipate could occur (and where), and could they be specifically captured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree fully - this is way way way way way way way too broad.
Overall I would really like to see all exception handling like this removed, and instead for all things that would throw exceptions to be handled properly in normal code, leaving exceptions to be exactly that i.e. exceptions!!
combined_url = ''.join(url_parts).strip(':;,. \t\n\r') | ||
|
||
if combined_url: | ||
url_match = re.search(r'(?:https?:?(?:/+|\\+)|www\.)[a-zA-Z0-9][-a-zA-Z0-9\.]+\.[a-zA-Z]{2,}(?:/[^\s:;,]*)?', combined_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question because I'm really not sure, but would it be more efficient if the search/match string was pre-compiled as a constant?
parts = frame_data.split('\x00') | ||
for part in parts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment: What you have currently is likely clearer to read and understand, but since you only use parts
in the for
line, it could be shortened to for part in frame_data.split('\x00'):
.
"""Extract URL from a malformed LINK frame and convert it to a WXXX frame. Handles cases where URLs | ||
are split between frame ID and data or separated by null bytes (common for archive.org)""" | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does everything need to be included within the try:
block?
if self._is_malformed_frame(frame_id, frame): | ||
invalid_id = True | ||
for url in self._extract_and_convert_link_frame(frame, frame_id): | ||
log.debug("Successfully converted malformed LINK frame %r to URL: %r", frame_id, url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an unusual enough situation that it might warrant logging at one level above debug
(i.e. info
)?
invalid_id = False | ||
if '\x00' in frame_id: | ||
invalid_id = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be shortened to:
invalid_id = False | |
if '\x00' in frame_id: | |
invalid_id = True | |
invalid_id = '\x00' in frame_id |
for frame_id in list(tags.keys()): | ||
if not (frame_id.isalnum() and frame_id.isupper()): | ||
log.warning("Removing invalid frame ID: %r", frame_id) | ||
del tags[frame_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An option seems a bit cleaner to me so that the user has some measure of control, but I don't have a feel for how common this situation is encountered. It may not be worth having another setting (that most users won't understand).
@@ -170,6 +170,7 @@ class ID3File(File): | |||
'TSOT': 'titlesort', | |||
'WCOP': 'license', | |||
'WOAR': 'website', | |||
'WXXX': 'user_website', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to standardize on the use of this tag, is it something that should be documented in https://picard-docs.musicbrainz.org/en/appendices/tag_mapping.html and/or https://picard-docs.musicbrainz.org/en/variables/variables_other_information.html or should we remain silent about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this? I mean it creates complications and the best would to avoid it totally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is problematic for me for several reasons:
-
Perhaps I don't understand exactly what cases this is supposed to fix. Is it just the LINK tag, or all URL tags (W000-WZZZ + LINK according to the ID3 v2.4 spec)? Why does it process every tag and not just these?
-
Is this simply an issue over the misplacing of a single 0x00 byte or something more complicated. I cannot understand how it needs 200 lines of new code (which is c. 15% of the size of the existing file) just to handle a simple issue over a mis-placed 0x00 byte.
-
I have a real problem about the use of exceptions to handle failures in the new code rather than avoiding them by checks in the code, leaving exceptions to indicate that something unexpected has happened as a genuine exception to what is expected to happen. But if you are going to use exception handling, it needs to be confined to as few lines as possible and catch THE specific exception that is expected and not just any old exception that might happen.
I cannot really review the bulk of the code until points 1. and 2. are addressed - because this seems to be a massive sledgehammer for a very tiny nut.
|
||
try: | ||
self._sanitize_id3_frames(tags) | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I have never liked the idea of using global exceptions as it can hide all sorts of unexpected issues.
Personally I would prefer that the
_sanitize_id3_frames
code to check for things that would cause exceptions and logs them. -
I think the methods should be
_sanitize_link_frames
since this is in id3.py (so id3 is implied) and it doesn't santitize any frames other than link frames.
tags.save(filename, v2_version=3, v1=v1, v23_sep=separator) | ||
try: | ||
tags.save(filename, v2_version=3, v1=v1, v23_sep=separator) | ||
except ValueError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike the idea of this try/except too.
Surely this is simply another form of sanitization that should be done before attempting to save the file.
|
||
log.warning("Could not extract URL from malformed LINK frame %r", frame_id) | ||
|
||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree fully - this is way way way way way way way too broad.
Overall I would really like to see all exception handling like this removed, and instead for all things that would throw exceptions to be handled properly in normal code, leaving exceptions to be exactly that i.e. exceptions!!
@@ -136,6 +136,7 @@ | |||
'totaltracks': N_('Total Tracks'), | |||
'tracknumber': N_('Track Number'), | |||
'website': N_('Artist Website'), | |||
'user_website': N_('User Defined Website'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about all the other non-ID3 formats - how will they handle this completely new tag?????
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this needs to be defined. We shouldn't just write USER_WEBSITE
to Vorbis tags, this seems odd. I actually wouldn't expect a fix for the LINK frame issue to introduce a new tag. If we add a new tag it should be well handled from the start. If in doubt I would rather not write a new tag to some formats initially until we have figured out a proper mapping.
else: | ||
tags.update_to_v24() | ||
tags.save(filename, v2_version=4, v1=v1) | ||
|
||
def _sanitize_id3_frames(self, tags): | ||
"""This method attempts to fix various issues with ID3 frames: | ||
1. Handle invalid LINK frames (convert to WXXX if possible) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I don't understand exactly what cases this is supposed to fix - so it might be helpful if it was spelled out in the comments here.
Perhaps the most obvious confusion is that this PR talks specifically about issues with the LINK tag (see section 4.20 of the ID3 2.4 spec but appears to be looking at all tags and not just LINK tags (and not even just tags starting with W (see section 4.3 of the ID3 2.4 spec which might potentially have this issue).
I would like to see the specific tags to be sanitized listed here and the specific issues and fixes spelled out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. I'd also prefer if this would fix the specific case we have.
Also something is strange here: For some reason this issue seems only to happen when saving ID3 v2.3, but not witth v2.4 (at least if what I wrote on the ticket is still true). Looking at the specs for "Linked information" in both ID3 v2.3 and v2.4 I don't really understand why it should be. The specs are very similar. According to the problem description (files containing a LINK frame without frame ID, only the null byte) the LINK frame would be invalid for both.
So maybe mutagen does something different here, but for some reason can successfully save it to ID3 v2.4.
@@ -170,6 +170,7 @@ class ID3File(File): | |||
'TSOT': 'titlesort', | |||
'WCOP': 'license', | |||
'WOAR': 'website', | |||
'WXXX': 'user_website', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for frame_id in list(tags.keys()): | ||
if not (frame_id.isalnum() and frame_id.isupper()): | ||
log.warning("Removing invalid frame ID: %r", frame_id) | ||
del tags[frame_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this could be an option. The question is however what we do, if the user disables the removal? We can't keep it as it is, as it will fail saving.
I actually also don't quite see why this code is part of the PR, it seems to try to solve an entirely different problem. Are there cases where the frame IDs are invalid like this?
else: | ||
tags.update_to_v24() | ||
tags.save(filename, v2_version=4, v1=v1) | ||
|
||
def _sanitize_id3_frames(self, tags): | ||
"""This method attempts to fix various issues with ID3 frames: | ||
1. Handle invalid LINK frames (convert to WXXX if possible) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. I'd also prefer if this would fix the specific case we have.
Also something is strange here: For some reason this issue seems only to happen when saving ID3 v2.3, but not witth v2.4 (at least if what I wrote on the ticket is still true). Looking at the specs for "Linked information" in both ID3 v2.3 and v2.4 I don't really understand why it should be. The specs are very similar. According to the problem description (files containing a LINK frame without frame ID, only the null byte) the LINK frame would be invalid for both.
So maybe mutagen does something different here, but for some reason can successfully save it to ID3 v2.4.
@@ -136,6 +136,7 @@ | |||
'totaltracks': N_('Total Tracks'), | |||
'tracknumber': N_('Track Number'), | |||
'website': N_('Artist Website'), | |||
'user_website': N_('User Defined Website'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this needs to be defined. We shouldn't just write USER_WEBSITE
to Vorbis tags, this seems odd. I actually wouldn't expect a fix for the LINK frame issue to introduce a new tag. If we add a new tag it should be well handled from the start. If in doubt I would rather not write a new tag to some formats initially until we have figured out a proper mapping.
@francescouteiro any update on this? |
Very sorry for the delay! Last week was my exams' week, so I just started looking into all the things mentioned here recently! I'll address them between today and this Friday! |
I'm again very sorry for the delay! I've been out of Wi-Fi at my address since Apr. 24 (plus no mobile data remaining), and with the recent power outage in Portugal it's still not solved. Thankfully, my data plan renewed today, so I can get back to this at 100%! TL;DR: This issue has not been forgotten or abandoned; I am still working on it! |
@Sophist-UK waves from the Algarve. As a fellow Portuguese resident, I can confirm just how big an impact the power outage has had. We also had an internet outage of over a week after a huge storm when multiple lightening strikes took out the radio broadband tower, and high winds took out multiple telegraph poles carrying the fibre optics. |
Summary
This is a…
Description: MP3 files with LINK frames that do not follow the ID3 specification cause mutagen exceptions when saving with ID3 v2.3 format. This fix handles invalid LINK frames to prevent errors during save operations.
Problem
When processing MP3 files containing LINK frames that don't follow the ID3 specification, Picard returns an error of "Invalid frame ID" exception when saving with ID3 v2.3 format enabled.
Files from archive.org (like https://archive.org/details/yallcomesoundrec00page) store a null byte followed by a URL in the LINK frame instead of the required frame ID followed by URL. This invalid format causes mutagen to raise an exception during the ID3 v2.3 downgrade process.
Current workarounds include setting ID3 format to v2.4 or enabling "Clear existing tags" to remove the problematic frames. This fix proposes a handle for malformed LINK frames, preventing exceptions when saving with ID3 v2.3.
Solution
This fix addresses the exception by detecting and handling malformed LINK frames in MP3 files during the ID3 v2.3 saving process. When an invalid LINK frame is encountered (containing a null byte followed by URL instead of proper frame ID), it's transformed into a WXXX frame (user-defined website metadata/label).
This approach preserves the URL data from the malformed LINK frames by converting them to the appropriate frame type for URLs, rather than simply discarding the information. The conversion only happens when saving in ID3 v2.3 format, which would otherwise cause errors with these malformed frames.
While not the most comprehensive solution, this approach specifically targets the common issue with archive.org files without requiring UI changes or affecting performance. A more robust solution as suggested in the community discussion could involve user prompts to review and decide on invalid tag data, but that would be a larger feature enhancement rather than a focused bug fix.
This implementation prioritizes stability for users of archive.org files while intelligently preserving the URL data in a proper ID3 frame format.
Action
Additional actions required: