Skip to content

PICARD-2750: Handle invalid LINK frames when saving MP3 files #2630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
191 changes: 189 additions & 2 deletions picard/formats/id3.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@ class ID3File(File):
'TSOT': 'titlesort',
'WCOP': 'license',
'WOAR': 'website',
'WXXX': 'user_website',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to standardize on the use of this tag, is it something that should be documented in https://picard-docs.musicbrainz.org/en/appendices/tag_mapping.html and/or https://picard-docs.musicbrainz.org/en/variables/variables_other_information.html or should we remain silent about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really happy with the name here. Also not sure if this should be written to other file formats just like this. Maybe just use url? And we should consider how this can be mapped to other formats or prevent writing it there.

@zas @rdswift What are your thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this? I mean it creates complications and the best would to avoid it totally.

'COMM': 'comment',
'TOAL': 'originalalbum',
'TOPE': 'originalartist',
Expand Down Expand Up @@ -300,6 +301,8 @@ def _load(self, filename):
for frame in tags.values():
self._process_frame(frame, metadata, config_params)

self._process_link_frames_on_load(tags, metadata)

if 'date' in metadata:
self._sanitize_date(metadata)

Expand Down Expand Up @@ -614,6 +617,12 @@ def _get_tags(self, filename):

def _save_tags(self, tags, filename):
config = get_config()

try:
self._sanitize_id3_frames(tags)
except Exception as e:
Copy link
Contributor

@Sophist-UK Sophist-UK Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I have never liked the idea of using global exceptions as it can hide all sorts of unexpected issues.

    Personally I would prefer that the _sanitize_id3_frames code to check for things that would cause exceptions and logs them.

  2. I think the methods should be _sanitize_link_frames since this is in id3.py (so id3 is implied) and it doesn't santitize any frames other than link frames.

log.error("Error sanitizing ID3 frames: %s", e)

if config.setting['write_id3v1']:
v1 = 2
else:
Expand All @@ -622,11 +631,129 @@ def _save_tags(self, tags, filename):
if config.setting['write_id3v23']:
tags.update_to_v23()
separator = config.setting['id3v23_join_with']
tags.save(filename, v2_version=3, v1=v1, v23_sep=separator)
try:
tags.save(filename, v2_version=3, v1=v1, v23_sep=separator)
except ValueError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike the idea of this try/except too.

Surely this is simply another form of sanitization that should be done before attempting to save the file.

for frame_id in list(tags.keys()):
if not (frame_id.isalnum() and frame_id.isupper()):
log.warning("Removing invalid frame ID: %r", frame_id)
del tags[frame_id]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phw Perhaps we should ask user about what do to in this case? Or is it ok to just delete invalid frames and move on?
We could also have an option for that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An option seems a bit cleaner to me so that the user has some measure of control, but I don't have a feel for how common this situation is encountered. It may not be worth having another setting (that most users won't understand).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this could be an option. The question is however what we do, if the user disables the removal? We can't keep it as it is, as it will fail saving.

I actually also don't quite see why this code is part of the PR, it seems to try to solve an entirely different problem. Are there cases where the frame IDs are invalid like this?

tags.save(filename, v2_version=3, v1=v1, v23_sep=separator)
else:
tags.update_to_v24()
tags.save(filename, v2_version=4, v1=v1)

def _sanitize_id3_frames(self, tags):
"""This method attempts to fix various issues with ID3 frames:
1. Handle invalid LINK frames (convert to WXXX if possible)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I don't understand exactly what cases this is supposed to fix - so it might be helpful if it was spelled out in the comments here.

Perhaps the most obvious confusion is that this PR talks specifically about issues with the LINK tag (see section 4.20 of the ID3 2.4 spec but appears to be looking at all tags and not just LINK tags (and not even just tags starting with W (see section 4.3 of the ID3 2.4 spec which might potentially have this issue).

I would like to see the specific tags to be sanitized listed here and the specific issues and fixes spelled out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I'd also prefer if this would fix the specific case we have.

Also something is strange here: For some reason this issue seems only to happen when saving ID3 v2.3, but not witth v2.4 (at least if what I wrote on the ticket is still true). Looking at the specs for "Linked information" in both ID3 v2.3 and v2.4 I don't really understand why it should be. The specs are very similar. According to the problem description (files containing a LINK frame without frame ID, only the null byte) the LINK frame would be invalid for both.

So maybe mutagen does something different here, but for some reason can successfully save it to ID3 v2.4.

2. Remove null bytes and other illegal characters from frame IDs
Note: In the future, we might consider prompting users about invalid frames."""
to_remove = []
to_add = []
confirmed_problematic_frames = []
extracted_urls = []
for frame_id, frame in list(tags.items()):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to convert tags.items() to list here?
Elsewhere tags are modified during the loop, but not here.

try:
invalid_id = False
if '\x00' in frame_id:
invalid_id = True
Comment on lines +657 to +659
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be shortened to:

Suggested change
invalid_id = False
if '\x00' in frame_id:
invalid_id = True
invalid_id = '\x00' in frame_id

if self._is_malformed_frame(frame_id, frame):
invalid_id = True
for url in self._extract_and_convert_link_frame(frame, frame_id):
log.debug("Successfully converted malformed LINK frame %r to URL: %r", frame_id, url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an unusual enough situation that it might warrant logging at one level above debug (i.e. info)?

extracted_urls.append(url)
wxxx_frame = mutagen.id3.WXXX(encoding=Id3Encoding.LATIN1, desc="URL from malformed LINK frame", url=url)
to_add.append(wxxx_frame)

if invalid_id:
confirmed_problematic_frames.append(frame_id)
to_remove.append(frame_id)

except Exception as e:
log.error("Error processing frame %r: %s", frame_id, e)

for frame_id in to_remove:
del tags[frame_id]
for frame in to_add:
tags.add(frame)

if extracted_urls and hasattr(self, 'metadata') and hasattr(self, 'orig_metadata'):
if 'user_website' not in self.metadata:
self.metadata['user_website'] = extracted_urls
if 'user_website' not in self.orig_metadata:
self.orig_metadata['user_website'] = extracted_urls

return confirmed_problematic_frames

def _extract_and_convert_link_frame(self, frame, frame_id):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor thing, but everywhere we have frame_id, frame but here, perhaps swap arguments.

"""Extract URL from a malformed LINK frame and convert it to a WXXX frame. Handles cases where URLs
are split between frame ID and data or separated by null bytes (common for archive.org)"""

try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does everything need to be included within the try: block?

url_parts = []
if hasattr(frame, 'data') and frame.data:
frame_data = frame.data.decode('latin1', errors='ignore')
if '\x00' in frame_data:
parts = frame_data.split('\x00', 1)
# Likely valid LINK frame if the first part looks like a URL and second part doesn't contain URL patterns
if (parts[0] and self._contains_url_pattern(parts[0])
and len(parts) > 1 and not self._contains_url_pattern(parts[1])):
log.debug("Skipping valid LINK frame with URL %r", parts[0])
return

if ':' in frame_id:
frame_id_parts = frame_id.split(':', 1)
if frame_id_parts[0] == 'LINK':
# Standard case where everything after LINK: is part of the URL
url_parts.append(frame_id_parts[1])
elif any(proto in frame_id_parts[0] for proto in ('LINKhttp', 'LINKhtt', 'LINKwww')):
# Case where part of the URL protocol is in the frame_id
prefix = 'LINK'
protocol_part = frame_id_parts[0][len(prefix):]
url_parts.append(protocol_part + ':' + frame_id_parts[1])

if hasattr(frame, 'data') and frame.data:
frame_data = frame.data.decode('latin1', errors='ignore').strip()
if frame_data:
if '\x00' in frame_data:
parts = frame_data.split('\x00')
for part in parts:
Comment on lines +719 to +720
Copy link
Collaborator

@rdswift rdswift Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment: What you have currently is likely clearer to read and understand, but since you only use parts in the for line, it could be shortened to for part in frame_data.split('\x00'):.

if part.strip() and self._contains_url_pattern(part):
url_parts.append(part.strip())
elif self._contains_url_pattern(frame_data):
url_parts.append(frame_data)

combined_url = ''.join(url_parts).strip(':;,. \t\n\r')

if combined_url:
url_match = re.search(r'(?:https?:?(?:/+|\\+)|www\.)[a-zA-Z0-9][-a-zA-Z0-9\.]+\.[a-zA-Z]{2,}(?:/[^\s:;,]*)?', combined_url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question because I'm really not sure, but would it be more efficient if the search/match string was pre-compiled as a constant?

if url_match:
raw_url = url_match.group(0)

if raw_url.startswith('www.'):
final_url = 'http://' + raw_url
elif raw_url.startswith(('http://', 'https://')):
final_url = raw_url
else:
final_url = re.sub(r'^htt:?p', 'http', raw_url)
if not final_url.startswith(('http://', 'https://')):
if final_url.startswith(':'):
final_url = 'http' + final_url
elif final_url.startswith('//'):
final_url = 'http:' + final_url
elif final_url.startswith('p://'):
final_url = 'http://' + final_url[4:]
else:
final_url = 'http://' + final_url

yield final_url
return

log.warning("Could not extract URL from malformed LINK frame %r", frame_id)

except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty broad exception capture. What exceptions do you anticipate could occur (and where), and could they be specifically captured?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree fully - this is way way way way way way way too broad.

Overall I would really like to see all exception handling like this removed, and instead for all things that would throw exceptions to be handled properly in normal code, leaving exceptions to be exactly that i.e. exceptions!!

log.error("Failed to process LINK frame %r: %s", frame_id, e)

def format_specific_metadata(self, metadata, tag, settings=None):
if not settings:
settings = get_config().setting
Expand Down Expand Up @@ -1035,6 +1162,53 @@ def _remove_other_supported_tag(self, tags, real_name):
"""Remove other supported tag from ID3 frames."""
del tags[real_name]

def _contains_url_pattern(self, text):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it doesn't depend on self it could be made a simple utility function at module level:

def _contains_url_pattern(text):
     """Check if the given text contains URL-like patterns."""
    return any(term in text for term in ('http', 'www.', '://'))

It would shorten calls:

if part.strip() and self._contains_url_pattern(part):

would become:

if part.strip() and _contains_url_pattern(part):

If you prefer to keep it at class level, make it a static method at least (using @staticmethod decoration), but I think the code would be more readable with shorter calls.

"""Check if the given text contains URL-like patterns."""
return any(term in text for term in ('http', 'www.', '://'))

def _is_malformed_frame(self, frame_id, frame):
"""Check if a LINK frame is malformed (contains URL in wrong place)."""
if not frame_id.startswith('LINK'):
return False

if ':' in frame_id and self._contains_url_pattern(frame_id):
return True

if hasattr(frame, 'data') and frame.data:
try:
frame_data = frame.data.decode('latin1', errors='ignore')

if frame_data.startswith('\x00') and self._contains_url_pattern(frame_data[1:]):
return True

if '\x00' in frame_data:
parts = frame_data.split('\x00', 1)
if len(parts) > 1 and self._contains_url_pattern(parts[1]):
return True

if self._contains_url_pattern(frame_data) and '\x00' not in frame_data:
return True

except Exception:
return True

return False

def _process_link_frames_on_load(self, tags, metadata):
"""Process malformed LINK frames during initial file load and add them as website URLs."""

malformed_link_frames = []
for frame_id, frame in list(tags.items()):
if self._is_malformed_frame(frame_id, frame):
malformed_link_frames.append((frame_id, frame))

if not malformed_link_frames:
return

for frame_id, frame in malformed_link_frames:
for url in self._extract_and_convert_link_frame(frame, frame_id):
metadata.add('user_website', url)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I miss something but can't this method be simplified even more:

    def _process_link_frames_on_load(self, tags, metadata):
        """Process malformed LINK frames during initial file load and add them as website URLs."""

        for frame_id, frame in list(tags.items()):
            if self._is_malformed_frame(frame_id, frame):
                for url in self._extract_and_convert_link_frame(frame, frame_id):
                    metadata.add('user_website', url)


class MP3File(ID3File):

Expand Down Expand Up @@ -1080,10 +1254,23 @@ def _get_tags(self, filename):

def _save_tags(self, tags, filename):
config = get_config()

try:
self._sanitize_id3_frames(tags)
except Exception as e:
log.error("Error sanitizing ID3 frames: %s", e)

if config.setting['write_id3v23']:
compatid3.update_to_v23(tags)
separator = config.setting['id3v23_join_with']
tags.save(filename, v2_version=3, v23_sep=separator)
try:
tags.save(filename, v2_version=3, v23_sep=separator)
except ValueError:
for frame_id in list(tags.keys()):
if not (frame_id.isalnum() and frame_id.isupper()):
log.warning("Removing invalid frame ID: %r", frame_id)
del tags[frame_id]
tags.save(filename, v2_version=3, v23_sep=separator)
else:
tags.update_to_v24()
tags.save(filename, v2_version=4)
Expand Down
1 change: 1 addition & 0 deletions picard/util/tags.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@
'totaltracks': N_('Total Tracks'),
'tracknumber': N_('Track Number'),
'website': N_('Artist Website'),
'user_website': N_('User Defined Website'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about all the other non-ID3 formats - how will they handle this completely new tag?????

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this needs to be defined. We shouldn't just write USER_WEBSITE to Vorbis tags, this seems odd. I actually wouldn't expect a fix for the LINK frame issue to introduce a new tag. If we add a new tag it should be well handled from the start. If in doubt I would rather not write a new tag to some formats initially until we have figured out a proper mapping.

'work': N_('Work'),
'writer': N_('Writer'),
}
Expand Down
65 changes: 65 additions & 0 deletions test/formats/test_id3.py
Original file line number Diff line number Diff line change
Expand Up @@ -625,6 +625,71 @@ def test_releasedate_v24(self):
raw_metadata = load_raw(self.filename)
self.assertEqual(metadata['releasedate'], raw_metadata['TDRL'])

@skipUnlessTestfile
def test_malformed_link_url_in_data(self):
filename = self.copy_file_tmp(os.path.join('test', 'data', 'test.mp3'), '.mp3')
tags = mutagen.id3.ID3()
try:
tags.load(filename)
except mutagen.id3.ID3NoHeaderError:
pass
frame = mutagen.id3.Frames['LINK'](data=b'\x00http://example.org/null')
tags.add(frame)
tags.save(filename)
metadata = load_metadata(filename)
self.assertIn('user_website', metadata)
website_url = metadata['user_website']
self.assertTrue('http://example.org/null' in website_url)
save_metadata(filename, metadata)
raw_metadata = load_raw(filename)
has_url_frame = False
for key in raw_metadata:
if key.startswith('WXXX:') or key == 'WXXX':
has_url_frame = True
self.assertTrue(website_url in key or website_url == raw_metadata[key])
break
self.assertTrue(has_url_frame, "No URL frame (WXXX) found in raw metadata")
self.assertFalse(any(k.startswith('LINK') for k in raw_metadata))

@skipUnlessTestfile
def test_archive_org_link_pattern(self):
filename = self.copy_file_tmp(os.path.join('test', 'data', 'test.mp3'), '.mp3')

tags = mutagen.id3.ID3()
try:
tags.load(filename)
except mutagen.id3.ID3NoHeaderError:
pass

frame = mutagen.id3.Frames['LINK'](data=b'p://www.archive.org/details/test_item')
tags.add(frame)
tags.save(filename)

metadata = load_metadata(filename)
self.assertIn('user_website', metadata, "URL was not extracted from archive.org pattern")
website_url = metadata['user_website']
self.assertIn('archive.org', website_url, "URL doesn't contain archive.org domain")
self.assertIn('details/test_item', website_url, "URL path was not correctly extracted")

save_metadata(filename, metadata)
raw_metadata = load_raw(filename)
has_url_frame = False

for key in raw_metadata:
if key.startswith('WXXX:') or key == 'WXXX':
has_url_frame = True
frame = raw_metadata[key]
if hasattr(frame, 'url'):
self.assertIn('archive.org', frame.url)
self.assertIn('details/test_item', frame.url)
else:
self.assertIn('archive.org', frame)
self.assertIn('details/test_item', frame)
break

self.assertTrue(has_url_frame, "No URL frame (WXXX) found - LINK frame wasn't converted")
self.assertFalse(any(k.startswith('LINK') for k in raw_metadata), "LINK frame was not removed after conversion")


class MP3Test(CommonId3Tests.Id3TestCase):
testfile = 'test.mp3'
Expand Down
Loading