Skip to content

PICARD-2750: Handle invalid LINK frames when saving MP3 files #2630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

francescouteiro
Copy link

Summary

  • This is a…

    • Bug fix
    • Feature addition
    • Refactoring
    • Minor / simple change (like a typo)
    • Other
  • Description: MP3 files with LINK frames that do not follow the ID3 specification cause mutagen exceptions when saving with ID3 v2.3 format. This fix handles invalid LINK frames to prevent errors during save operations.

Problem

When processing MP3 files containing LINK frames that don't follow the ID3 specification, Picard returns an error of "Invalid frame ID" exception when saving with ID3 v2.3 format enabled.

Files from archive.org (like https://archive.org/details/yallcomesoundrec00page) store a null byte followed by a URL in the LINK frame instead of the required frame ID followed by URL. This invalid format causes mutagen to raise an exception during the ID3 v2.3 downgrade process.

Current workarounds include setting ID3 format to v2.4 or enabling "Clear existing tags" to remove the problematic frames. This fix proposes a handle for malformed LINK frames, preventing exceptions when saving with ID3 v2.3.

Solution

This fix addresses the exception by detecting and handling malformed LINK frames in MP3 files during the ID3 v2.3 saving process. When an invalid LINK frame is encountered (containing a null byte followed by URL instead of proper frame ID), it's transformed into a WXXX frame (user-defined website metadata/label).

This approach preserves the URL data from the malformed LINK frames by converting them to the appropriate frame type for URLs, rather than simply discarding the information. The conversion only happens when saving in ID3 v2.3 format, which would otherwise cause errors with these malformed frames.

While not the most comprehensive solution, this approach specifically targets the common issue with archive.org files without requiring UI changes or affecting performance. A more robust solution as suggested in the community discussion could involve user prompts to review and decide on invalid tag data, but that would be a larger feature enhancement rather than a focused bug fix.

This implementation prioritizes stability for users of archive.org files while intelligently preserving the URL data in a proper ID3 frame format.

Action

Additional actions required:

  • Update Picard documentation (please include a reference to this PR)
  • Other (please specify below)

Copy link
Collaborator

@zas zas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it should work, but I think the code can be simplified (too much code redundancy, too long methods).

Also please fix linting errors.

@zas zas requested a review from phw April 7, 2025 14:04
MP3 files with LINK frames that do not follow the ID3 specification
cause mutagen exceptions when saving with ID3 v2.3 format. This fix
handles invalid LINK frames to prevent errors during save operations.
@francescouteiro francescouteiro force-pushed the picard-2750-fix-invalid-link-frame branch from 6eee65a to ee68982 Compare April 7, 2025 14:37
@francescouteiro
Copy link
Author

Thanks for your insights, @zas! I think that I've addressed your concerns for redundancy, as well fixed linting errors!

@francescouteiro francescouteiro requested a review from zas April 7, 2025 14:43
Copy link
Collaborator

@zas zas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks much better after recent changes, few more comments though.

for frame_id, frame in malformed_link_frames:
for url in self._extract_and_convert_link_frame(frame, frame_id):
metadata.add('user_website', url)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I miss something but can't this method be simplified even more:

    def _process_link_frames_on_load(self, tags, metadata):
        """Process malformed LINK frames during initial file load and add them as website URLs."""

        for frame_id, frame in list(tags.items()):
            if self._is_malformed_frame(frame_id, frame):
                for url in self._extract_and_convert_link_frame(frame, frame_id):
                    metadata.add('user_website', url)


return confirmed_problematic_frames

def _extract_and_convert_link_frame(self, frame, frame_id):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor thing, but everywhere we have frame_id, frame but here, perhaps swap arguments.

to_add = []
confirmed_problematic_frames = []
extracted_urls = []
for frame_id, frame in list(tags.items()):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to convert tags.items() to list here?
Elsewhere tags are modified during the loop, but not here.

@@ -1035,6 +1162,53 @@ def _remove_other_supported_tag(self, tags, real_name):
"""Remove other supported tag from ID3 frames."""
del tags[real_name]

def _contains_url_pattern(self, text):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it doesn't depend on self it could be made a simple utility function at module level:

def _contains_url_pattern(text):
     """Check if the given text contains URL-like patterns."""
    return any(term in text for term in ('http', 'www.', '://'))

It would shorten calls:

if part.strip() and self._contains_url_pattern(part):

would become:

if part.strip() and _contains_url_pattern(part):

If you prefer to keep it at class level, make it a static method at least (using @staticmethod decoration), but I think the code would be more readable with shorter calls.

for frame_id in list(tags.keys()):
if not (frame_id.isalnum() and frame_id.isupper()):
log.warning("Removing invalid frame ID: %r", frame_id)
del tags[frame_id]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phw Perhaps we should ask user about what do to in this case? Or is it ok to just delete invalid frames and move on?
We could also have an option for that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An option seems a bit cleaner to me so that the user has some measure of control, but I don't have a feel for how common this situation is encountered. It may not be worth having another setting (that most users won't understand).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this could be an option. The question is however what we do, if the user disables the removal? We can't keep it as it is, as it will fail saving.

I actually also don't quite see why this code is part of the PR, it seems to try to solve an entirely different problem. Are there cases where the frame IDs are invalid like this?

@zas zas requested a review from rdswift April 8, 2025 07:43

log.warning("Could not extract URL from malformed LINK frame %r", frame_id)

except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty broad exception capture. What exceptions do you anticipate could occur (and where), and could they be specifically captured?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree fully - this is way way way way way way way too broad.

Overall I would really like to see all exception handling like this removed, and instead for all things that would throw exceptions to be handled properly in normal code, leaving exceptions to be exactly that i.e. exceptions!!

combined_url = ''.join(url_parts).strip(':;,. \t\n\r')

if combined_url:
url_match = re.search(r'(?:https?:?(?:/+|\\+)|www\.)[a-zA-Z0-9][-a-zA-Z0-9\.]+\.[a-zA-Z]{2,}(?:/[^\s:;,]*)?', combined_url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question because I'm really not sure, but would it be more efficient if the search/match string was pre-compiled as a constant?

Comment on lines +719 to +720
parts = frame_data.split('\x00')
for part in parts:
Copy link
Collaborator

@rdswift rdswift Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment: What you have currently is likely clearer to read and understand, but since you only use parts in the for line, it could be shortened to for part in frame_data.split('\x00'):.

"""Extract URL from a malformed LINK frame and convert it to a WXXX frame. Handles cases where URLs
are split between frame ID and data or separated by null bytes (common for archive.org)"""

try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does everything need to be included within the try: block?

if self._is_malformed_frame(frame_id, frame):
invalid_id = True
for url in self._extract_and_convert_link_frame(frame, frame_id):
log.debug("Successfully converted malformed LINK frame %r to URL: %r", frame_id, url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an unusual enough situation that it might warrant logging at one level above debug (i.e. info)?

Comment on lines +657 to +659
invalid_id = False
if '\x00' in frame_id:
invalid_id = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be shortened to:

Suggested change
invalid_id = False
if '\x00' in frame_id:
invalid_id = True
invalid_id = '\x00' in frame_id

for frame_id in list(tags.keys()):
if not (frame_id.isalnum() and frame_id.isupper()):
log.warning("Removing invalid frame ID: %r", frame_id)
del tags[frame_id]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An option seems a bit cleaner to me so that the user has some measure of control, but I don't have a feel for how common this situation is encountered. It may not be worth having another setting (that most users won't understand).

@@ -170,6 +170,7 @@ class ID3File(File):
'TSOT': 'titlesort',
'WCOP': 'license',
'WOAR': 'website',
'WXXX': 'user_website',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to standardize on the use of this tag, is it something that should be documented in https://picard-docs.musicbrainz.org/en/appendices/tag_mapping.html and/or https://picard-docs.musicbrainz.org/en/variables/variables_other_information.html or should we remain silent about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really happy with the name here. Also not sure if this should be written to other file formats just like this. Maybe just use url? And we should consider how this can be mapped to other formats or prevent writing it there.

@zas @rdswift What are your thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this? I mean it creates complications and the best would to avoid it totally.

@francescouteiro
Copy link
Author

Thanks a lot for the insightful observations, @zas & @rdswift ! I’ll take some time to work through these suggestions and update the PR accordingly as soon as I can!

Copy link
Contributor

@Sophist-UK Sophist-UK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is problematic for me for several reasons:

  1. Perhaps I don't understand exactly what cases this is supposed to fix. Is it just the LINK tag, or all URL tags (W000-WZZZ + LINK according to the ID3 v2.4 spec)? Why does it process every tag and not just these?

  2. Is this simply an issue over the misplacing of a single 0x00 byte or something more complicated. I cannot understand how it needs 200 lines of new code (which is c. 15% of the size of the existing file) just to handle a simple issue over a mis-placed 0x00 byte.

  3. I have a real problem about the use of exceptions to handle failures in the new code rather than avoiding them by checks in the code, leaving exceptions to indicate that something unexpected has happened as a genuine exception to what is expected to happen. But if you are going to use exception handling, it needs to be confined to as few lines as possible and catch THE specific exception that is expected and not just any old exception that might happen.

I cannot really review the bulk of the code until points 1. and 2. are addressed - because this seems to be a massive sledgehammer for a very tiny nut.


try:
self._sanitize_id3_frames(tags)
except Exception as e:
Copy link
Contributor

@Sophist-UK Sophist-UK Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I have never liked the idea of using global exceptions as it can hide all sorts of unexpected issues.

    Personally I would prefer that the _sanitize_id3_frames code to check for things that would cause exceptions and logs them.

  2. I think the methods should be _sanitize_link_frames since this is in id3.py (so id3 is implied) and it doesn't santitize any frames other than link frames.

tags.save(filename, v2_version=3, v1=v1, v23_sep=separator)
try:
tags.save(filename, v2_version=3, v1=v1, v23_sep=separator)
except ValueError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike the idea of this try/except too.

Surely this is simply another form of sanitization that should be done before attempting to save the file.


log.warning("Could not extract URL from malformed LINK frame %r", frame_id)

except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree fully - this is way way way way way way way too broad.

Overall I would really like to see all exception handling like this removed, and instead for all things that would throw exceptions to be handled properly in normal code, leaving exceptions to be exactly that i.e. exceptions!!

@@ -136,6 +136,7 @@
'totaltracks': N_('Total Tracks'),
'tracknumber': N_('Track Number'),
'website': N_('Artist Website'),
'user_website': N_('User Defined Website'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about all the other non-ID3 formats - how will they handle this completely new tag?????

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this needs to be defined. We shouldn't just write USER_WEBSITE to Vorbis tags, this seems odd. I actually wouldn't expect a fix for the LINK frame issue to introduce a new tag. If we add a new tag it should be well handled from the start. If in doubt I would rather not write a new tag to some formats initially until we have figured out a proper mapping.

else:
tags.update_to_v24()
tags.save(filename, v2_version=4, v1=v1)

def _sanitize_id3_frames(self, tags):
"""This method attempts to fix various issues with ID3 frames:
1. Handle invalid LINK frames (convert to WXXX if possible)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I don't understand exactly what cases this is supposed to fix - so it might be helpful if it was spelled out in the comments here.

Perhaps the most obvious confusion is that this PR talks specifically about issues with the LINK tag (see section 4.20 of the ID3 2.4 spec but appears to be looking at all tags and not just LINK tags (and not even just tags starting with W (see section 4.3 of the ID3 2.4 spec which might potentially have this issue).

I would like to see the specific tags to be sanitized listed here and the specific issues and fixes spelled out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I'd also prefer if this would fix the specific case we have.

Also something is strange here: For some reason this issue seems only to happen when saving ID3 v2.3, but not witth v2.4 (at least if what I wrote on the ticket is still true). Looking at the specs for "Linked information" in both ID3 v2.3 and v2.4 I don't really understand why it should be. The specs are very similar. According to the problem description (files containing a LINK frame without frame ID, only the null byte) the LINK frame would be invalid for both.

So maybe mutagen does something different here, but for some reason can successfully save it to ID3 v2.4.

@@ -170,6 +170,7 @@ class ID3File(File):
'TSOT': 'titlesort',
'WCOP': 'license',
'WOAR': 'website',
'WXXX': 'user_website',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really happy with the name here. Also not sure if this should be written to other file formats just like this. Maybe just use url? And we should consider how this can be mapped to other formats or prevent writing it there.

@zas @rdswift What are your thoughts?

for frame_id in list(tags.keys()):
if not (frame_id.isalnum() and frame_id.isupper()):
log.warning("Removing invalid frame ID: %r", frame_id)
del tags[frame_id]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this could be an option. The question is however what we do, if the user disables the removal? We can't keep it as it is, as it will fail saving.

I actually also don't quite see why this code is part of the PR, it seems to try to solve an entirely different problem. Are there cases where the frame IDs are invalid like this?

else:
tags.update_to_v24()
tags.save(filename, v2_version=4, v1=v1)

def _sanitize_id3_frames(self, tags):
"""This method attempts to fix various issues with ID3 frames:
1. Handle invalid LINK frames (convert to WXXX if possible)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I'd also prefer if this would fix the specific case we have.

Also something is strange here: For some reason this issue seems only to happen when saving ID3 v2.3, but not witth v2.4 (at least if what I wrote on the ticket is still true). Looking at the specs for "Linked information" in both ID3 v2.3 and v2.4 I don't really understand why it should be. The specs are very similar. According to the problem description (files containing a LINK frame without frame ID, only the null byte) the LINK frame would be invalid for both.

So maybe mutagen does something different here, but for some reason can successfully save it to ID3 v2.4.

@@ -136,6 +136,7 @@
'totaltracks': N_('Total Tracks'),
'tracknumber': N_('Track Number'),
'website': N_('Artist Website'),
'user_website': N_('User Defined Website'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this needs to be defined. We shouldn't just write USER_WEBSITE to Vorbis tags, this seems odd. I actually wouldn't expect a fix for the LINK frame issue to introduce a new tag. If we add a new tag it should be well handled from the start. If in doubt I would rather not write a new tag to some formats initially until we have figured out a proper mapping.

@zas
Copy link
Collaborator

zas commented Apr 23, 2025

@francescouteiro any update on this?

@francescouteiro
Copy link
Author

@francescouteiro any update on this?

Very sorry for the delay! Last week was my exams' week, so I just started looking into all the things mentioned here recently! I'll address them between today and this Friday!

@francescouteiro
Copy link
Author

I'm again very sorry for the delay! I've been out of Wi-Fi at my address since Apr. 24 (plus no mobile data remaining), and with the recent power outage in Portugal it's still not solved. Thankfully, my data plan renewed today, so I can get back to this at 100%! TL;DR: This issue has not been forgotten or abandoned; I am still working on it!

@Sophist-UK
Copy link
Contributor

@francescouteiro

@Sophist-UK waves from the Algarve.

As a fellow Portuguese resident, I can confirm just how big an impact the power outage has had. We also had an internet outage of over a week after a huge storm when multiple lightening strikes took out the radio broadband tower, and high winds took out multiple telegraph poles carrying the fibre optics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants