-
-
Notifications
You must be signed in to change notification settings - Fork 402
PICARD-2750: Handle invalid LINK frames when saving MP3 files #2630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -170,6 +170,7 @@ class ID3File(File): | |||||||||
'TSOT': 'titlesort', | ||||||||||
'WCOP': 'license', | ||||||||||
'WOAR': 'website', | ||||||||||
'WXXX': 'user_website', | ||||||||||
'COMM': 'comment', | ||||||||||
'TOAL': 'originalalbum', | ||||||||||
'TOPE': 'originalartist', | ||||||||||
|
@@ -300,6 +301,8 @@ def _load(self, filename): | |||||||||
for frame in tags.values(): | ||||||||||
self._process_frame(frame, metadata, config_params) | ||||||||||
|
||||||||||
self._process_link_frames_on_load(tags, metadata) | ||||||||||
|
||||||||||
if 'date' in metadata: | ||||||||||
self._sanitize_date(metadata) | ||||||||||
|
||||||||||
|
@@ -614,6 +617,12 @@ def _get_tags(self, filename): | |||||||||
|
||||||||||
def _save_tags(self, tags, filename): | ||||||||||
config = get_config() | ||||||||||
|
||||||||||
try: | ||||||||||
self._sanitize_id3_frames(tags) | ||||||||||
except Exception as e: | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||||||
log.error("Error sanitizing ID3 frames: %s", e) | ||||||||||
|
||||||||||
if config.setting['write_id3v1']: | ||||||||||
v1 = 2 | ||||||||||
else: | ||||||||||
|
@@ -622,11 +631,129 @@ def _save_tags(self, tags, filename): | |||||||||
if config.setting['write_id3v23']: | ||||||||||
tags.update_to_v23() | ||||||||||
separator = config.setting['id3v23_join_with'] | ||||||||||
tags.save(filename, v2_version=3, v1=v1, v23_sep=separator) | ||||||||||
try: | ||||||||||
tags.save(filename, v2_version=3, v1=v1, v23_sep=separator) | ||||||||||
except ValueError: | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I dislike the idea of this try/except too. Surely this is simply another form of sanitization that should be done before attempting to save the file. |
||||||||||
for frame_id in list(tags.keys()): | ||||||||||
if not (frame_id.isalnum() and frame_id.isupper()): | ||||||||||
log.warning("Removing invalid frame ID: %r", frame_id) | ||||||||||
del tags[frame_id] | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @phw Perhaps we should ask user about what do to in this case? Or is it ok to just delete invalid frames and move on? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An option seems a bit cleaner to me so that the user has some measure of control, but I don't have a feel for how common this situation is encountered. It may not be worth having another setting (that most users won't understand). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that this could be an option. The question is however what we do, if the user disables the removal? We can't keep it as it is, as it will fail saving. I actually also don't quite see why this code is part of the PR, it seems to try to solve an entirely different problem. Are there cases where the frame IDs are invalid like this? |
||||||||||
tags.save(filename, v2_version=3, v1=v1, v23_sep=separator) | ||||||||||
else: | ||||||||||
tags.update_to_v24() | ||||||||||
tags.save(filename, v2_version=4, v1=v1) | ||||||||||
|
||||||||||
def _sanitize_id3_frames(self, tags): | ||||||||||
"""This method attempts to fix various issues with ID3 frames: | ||||||||||
1. Handle invalid LINK frames (convert to WXXX if possible) | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps I don't understand exactly what cases this is supposed to fix - so it might be helpful if it was spelled out in the comments here. Perhaps the most obvious confusion is that this PR talks specifically about issues with the LINK tag (see section 4.20 of the ID3 2.4 spec but appears to be looking at all tags and not just LINK tags (and not even just tags starting with W (see section 4.3 of the ID3 2.4 spec which might potentially have this issue). I would like to see the specific tags to be sanitized listed here and the specific issues and fixes spelled out. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I agree. I'd also prefer if this would fix the specific case we have. Also something is strange here: For some reason this issue seems only to happen when saving ID3 v2.3, but not witth v2.4 (at least if what I wrote on the ticket is still true). Looking at the specs for "Linked information" in both ID3 v2.3 and v2.4 I don't really understand why it should be. The specs are very similar. According to the problem description (files containing a LINK frame without frame ID, only the null byte) the LINK frame would be invalid for both. So maybe mutagen does something different here, but for some reason can successfully save it to ID3 v2.4. |
||||||||||
2. Remove null bytes and other illegal characters from frame IDs | ||||||||||
Note: In the future, we might consider prompting users about invalid frames.""" | ||||||||||
to_remove = [] | ||||||||||
to_add = [] | ||||||||||
confirmed_problematic_frames = [] | ||||||||||
extracted_urls = [] | ||||||||||
for frame_id, frame in list(tags.items()): | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason to convert |
||||||||||
try: | ||||||||||
invalid_id = False | ||||||||||
if '\x00' in frame_id: | ||||||||||
invalid_id = True | ||||||||||
Comment on lines
+657
to
+659
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this could be shortened to:
Suggested change
|
||||||||||
if self._is_malformed_frame(frame_id, frame): | ||||||||||
invalid_id = True | ||||||||||
for url in self._extract_and_convert_link_frame(frame, frame_id): | ||||||||||
log.debug("Successfully converted malformed LINK frame %r to URL: %r", frame_id, url) | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this an unusual enough situation that it might warrant logging at one level above |
||||||||||
extracted_urls.append(url) | ||||||||||
wxxx_frame = mutagen.id3.WXXX(encoding=Id3Encoding.LATIN1, desc="URL from malformed LINK frame", url=url) | ||||||||||
to_add.append(wxxx_frame) | ||||||||||
|
||||||||||
if invalid_id: | ||||||||||
confirmed_problematic_frames.append(frame_id) | ||||||||||
to_remove.append(frame_id) | ||||||||||
|
||||||||||
except Exception as e: | ||||||||||
log.error("Error processing frame %r: %s", frame_id, e) | ||||||||||
|
||||||||||
for frame_id in to_remove: | ||||||||||
del tags[frame_id] | ||||||||||
for frame in to_add: | ||||||||||
tags.add(frame) | ||||||||||
|
||||||||||
if extracted_urls and hasattr(self, 'metadata') and hasattr(self, 'orig_metadata'): | ||||||||||
if 'user_website' not in self.metadata: | ||||||||||
self.metadata['user_website'] = extracted_urls | ||||||||||
if 'user_website' not in self.orig_metadata: | ||||||||||
self.orig_metadata['user_website'] = extracted_urls | ||||||||||
|
||||||||||
return confirmed_problematic_frames | ||||||||||
|
||||||||||
def _extract_and_convert_link_frame(self, frame, frame_id): | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just a minor thing, but everywhere we have |
||||||||||
"""Extract URL from a malformed LINK frame and convert it to a WXXX frame. Handles cases where URLs | ||||||||||
are split between frame ID and data or separated by null bytes (common for archive.org)""" | ||||||||||
|
||||||||||
try: | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does everything need to be included within the |
||||||||||
url_parts = [] | ||||||||||
if hasattr(frame, 'data') and frame.data: | ||||||||||
frame_data = frame.data.decode('latin1', errors='ignore') | ||||||||||
if '\x00' in frame_data: | ||||||||||
parts = frame_data.split('\x00', 1) | ||||||||||
# Likely valid LINK frame if the first part looks like a URL and second part doesn't contain URL patterns | ||||||||||
if (parts[0] and self._contains_url_pattern(parts[0]) | ||||||||||
and len(parts) > 1 and not self._contains_url_pattern(parts[1])): | ||||||||||
log.debug("Skipping valid LINK frame with URL %r", parts[0]) | ||||||||||
return | ||||||||||
|
||||||||||
if ':' in frame_id: | ||||||||||
frame_id_parts = frame_id.split(':', 1) | ||||||||||
if frame_id_parts[0] == 'LINK': | ||||||||||
# Standard case where everything after LINK: is part of the URL | ||||||||||
url_parts.append(frame_id_parts[1]) | ||||||||||
elif any(proto in frame_id_parts[0] for proto in ('LINKhttp', 'LINKhtt', 'LINKwww')): | ||||||||||
# Case where part of the URL protocol is in the frame_id | ||||||||||
prefix = 'LINK' | ||||||||||
protocol_part = frame_id_parts[0][len(prefix):] | ||||||||||
url_parts.append(protocol_part + ':' + frame_id_parts[1]) | ||||||||||
|
||||||||||
if hasattr(frame, 'data') and frame.data: | ||||||||||
frame_data = frame.data.decode('latin1', errors='ignore').strip() | ||||||||||
if frame_data: | ||||||||||
if '\x00' in frame_data: | ||||||||||
parts = frame_data.split('\x00') | ||||||||||
for part in parts: | ||||||||||
Comment on lines
+719
to
+720
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Minor comment: What you have currently is likely clearer to read and understand, but since you only use |
||||||||||
if part.strip() and self._contains_url_pattern(part): | ||||||||||
url_parts.append(part.strip()) | ||||||||||
elif self._contains_url_pattern(frame_data): | ||||||||||
url_parts.append(frame_data) | ||||||||||
|
||||||||||
combined_url = ''.join(url_parts).strip(':;,. \t\n\r') | ||||||||||
|
||||||||||
if combined_url: | ||||||||||
url_match = re.search(r'(?:https?:?(?:/+|\\+)|www\.)[a-zA-Z0-9][-a-zA-Z0-9\.]+\.[a-zA-Z]{2,}(?:/[^\s:;,]*)?', combined_url) | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just a question because I'm really not sure, but would it be more efficient if the search/match string was pre-compiled as a constant? |
||||||||||
if url_match: | ||||||||||
raw_url = url_match.group(0) | ||||||||||
|
||||||||||
if raw_url.startswith('www.'): | ||||||||||
final_url = 'http://' + raw_url | ||||||||||
elif raw_url.startswith(('http://', 'https://')): | ||||||||||
final_url = raw_url | ||||||||||
else: | ||||||||||
final_url = re.sub(r'^htt:?p', 'http', raw_url) | ||||||||||
if not final_url.startswith(('http://', 'https://')): | ||||||||||
if final_url.startswith(':'): | ||||||||||
final_url = 'http' + final_url | ||||||||||
elif final_url.startswith('//'): | ||||||||||
final_url = 'http:' + final_url | ||||||||||
elif final_url.startswith('p://'): | ||||||||||
final_url = 'http://' + final_url[4:] | ||||||||||
else: | ||||||||||
final_url = 'http://' + final_url | ||||||||||
|
||||||||||
yield final_url | ||||||||||
return | ||||||||||
|
||||||||||
log.warning("Could not extract URL from malformed LINK frame %r", frame_id) | ||||||||||
|
||||||||||
except Exception as e: | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a pretty broad exception capture. What exceptions do you anticipate could occur (and where), and could they be specifically captured? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree fully - this is way way way way way way way too broad. Overall I would really like to see all exception handling like this removed, and instead for all things that would throw exceptions to be handled properly in normal code, leaving exceptions to be exactly that i.e. exceptions!! |
||||||||||
log.error("Failed to process LINK frame %r: %s", frame_id, e) | ||||||||||
|
||||||||||
def format_specific_metadata(self, metadata, tag, settings=None): | ||||||||||
if not settings: | ||||||||||
settings = get_config().setting | ||||||||||
|
@@ -1035,6 +1162,53 @@ def _remove_other_supported_tag(self, tags, real_name): | |||||||||
"""Remove other supported tag from ID3 frames.""" | ||||||||||
del tags[real_name] | ||||||||||
|
||||||||||
def _contains_url_pattern(self, text): | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since it doesn't depend on def _contains_url_pattern(text):
"""Check if the given text contains URL-like patterns."""
return any(term in text for term in ('http', 'www.', '://')) It would shorten calls: if part.strip() and self._contains_url_pattern(part): would become: if part.strip() and _contains_url_pattern(part): If you prefer to keep it at class level, make it a static method at least (using |
||||||||||
"""Check if the given text contains URL-like patterns.""" | ||||||||||
return any(term in text for term in ('http', 'www.', '://')) | ||||||||||
|
||||||||||
def _is_malformed_frame(self, frame_id, frame): | ||||||||||
"""Check if a LINK frame is malformed (contains URL in wrong place).""" | ||||||||||
if not frame_id.startswith('LINK'): | ||||||||||
return False | ||||||||||
|
||||||||||
if ':' in frame_id and self._contains_url_pattern(frame_id): | ||||||||||
return True | ||||||||||
|
||||||||||
if hasattr(frame, 'data') and frame.data: | ||||||||||
try: | ||||||||||
frame_data = frame.data.decode('latin1', errors='ignore') | ||||||||||
|
||||||||||
if frame_data.startswith('\x00') and self._contains_url_pattern(frame_data[1:]): | ||||||||||
return True | ||||||||||
|
||||||||||
if '\x00' in frame_data: | ||||||||||
parts = frame_data.split('\x00', 1) | ||||||||||
if len(parts) > 1 and self._contains_url_pattern(parts[1]): | ||||||||||
return True | ||||||||||
|
||||||||||
if self._contains_url_pattern(frame_data) and '\x00' not in frame_data: | ||||||||||
return True | ||||||||||
|
||||||||||
except Exception: | ||||||||||
return True | ||||||||||
|
||||||||||
return False | ||||||||||
|
||||||||||
def _process_link_frames_on_load(self, tags, metadata): | ||||||||||
"""Process malformed LINK frames during initial file load and add them as website URLs.""" | ||||||||||
|
||||||||||
malformed_link_frames = [] | ||||||||||
for frame_id, frame in list(tags.items()): | ||||||||||
if self._is_malformed_frame(frame_id, frame): | ||||||||||
malformed_link_frames.append((frame_id, frame)) | ||||||||||
|
||||||||||
if not malformed_link_frames: | ||||||||||
return | ||||||||||
|
||||||||||
for frame_id, frame in malformed_link_frames: | ||||||||||
for url in self._extract_and_convert_link_frame(frame, frame_id): | ||||||||||
metadata.add('user_website', url) | ||||||||||
|
||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps I miss something but can't this method be simplified even more: def _process_link_frames_on_load(self, tags, metadata):
"""Process malformed LINK frames during initial file load and add them as website URLs."""
for frame_id, frame in list(tags.items()):
if self._is_malformed_frame(frame_id, frame):
for url in self._extract_and_convert_link_frame(frame, frame_id):
metadata.add('user_website', url) |
||||||||||
|
||||||||||
class MP3File(ID3File): | ||||||||||
|
||||||||||
|
@@ -1080,10 +1254,23 @@ def _get_tags(self, filename): | |||||||||
|
||||||||||
def _save_tags(self, tags, filename): | ||||||||||
config = get_config() | ||||||||||
|
||||||||||
try: | ||||||||||
self._sanitize_id3_frames(tags) | ||||||||||
except Exception as e: | ||||||||||
log.error("Error sanitizing ID3 frames: %s", e) | ||||||||||
|
||||||||||
if config.setting['write_id3v23']: | ||||||||||
compatid3.update_to_v23(tags) | ||||||||||
separator = config.setting['id3v23_join_with'] | ||||||||||
tags.save(filename, v2_version=3, v23_sep=separator) | ||||||||||
try: | ||||||||||
tags.save(filename, v2_version=3, v23_sep=separator) | ||||||||||
except ValueError: | ||||||||||
for frame_id in list(tags.keys()): | ||||||||||
if not (frame_id.isalnum() and frame_id.isupper()): | ||||||||||
log.warning("Removing invalid frame ID: %r", frame_id) | ||||||||||
del tags[frame_id] | ||||||||||
tags.save(filename, v2_version=3, v23_sep=separator) | ||||||||||
else: | ||||||||||
tags.update_to_v24() | ||||||||||
tags.save(filename, v2_version=4) | ||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -136,6 +136,7 @@ | |
'totaltracks': N_('Total Tracks'), | ||
'tracknumber': N_('Track Number'), | ||
'website': N_('Artist Website'), | ||
'user_website': N_('User Defined Website'), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about all the other non-ID3 formats - how will they handle this completely new tag????? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this needs to be defined. We shouldn't just write |
||
'work': N_('Work'), | ||
'writer': N_('Writer'), | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to standardize on the use of this tag, is it something that should be documented in https://picard-docs.musicbrainz.org/en/appendices/tag_mapping.html and/or https://picard-docs.musicbrainz.org/en/variables/variables_other_information.html or should we remain silent about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really happy with the name here. Also not sure if this should be written to other file formats just like this. Maybe just use
url
? And we should consider how this can be mapped to other formats or prevent writing it there.@zas @rdswift What are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this? I mean it creates complications and the best would to avoid it totally.