Skip to content

handle_ref() cannot handle &gid= #2263

Closed
@ix5

Description

@ix5

When you have links formatted like this: http://www.linkedin.com/groupAnswers?viewQuestionAndAnswers=&discussionID=12741155&gid=87954&trk=EML_anet_qa_ttle-0Pt79xs2RVr6JBpnsJt7dBpSBA), the ?gid= part makes the pelican HTMLParser hiccup when truncating for feeds:
(This popped up while helping ramonsuarez on his bug #2258 )

CRITICAL: ValueError: substring not found
Traceback (most recent call last):
  File "/home/me/data/pip/lib/python3.6/site-packages/pelican/utils.py", line 556, in handle_entityref
    codepoint = html_entities.name2codepoint[name]
KeyError: 'gid'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/me/data/pip/bin/pelican", line 11, in <module>
    sys.exit(main())
  File "/home/me/data/pip/lib/python3.6/site-packages/pelican/__init__.py", line 487, in main
    pelican.run()
  File "/home/me/data/pip/lib/python3.6/site-packages/pelican/__init__.py", line 179, in run
    p.generate_output(writer)
  File "/home/me/data/pip/lib/python3.6/site-packages/pelican/generators.py", line 599, in generate_output
    self.generate_feeds(writer)
  File "/home/me/data/pip/lib/python3.6/site-packages/pelican/generators.py", line 300, in generate_feeds
    self.settings['FEED_ALL_ATOM'])
  File "/home/me/data/pip/lib/python3.6/site-packages/pelican/writers.py", line 123, in write_feed
    self._add_item_to_the_feed(feed, elements[i])
  File "/home/me/data/pip/lib/python3.6/site-packages/pelican/writers.py", line 52, in _add_item_to_the_feed
    description = item.summary
  File "/home/me/data/pip/lib/python3.6/site-packages/pelican/contents.py", line 310, in summary
    return self.get_summary(self.get_siteurl())
  File "/home/me/data/pip/lib/python3.6/site-packages/pelican/utils.py", line 173, in __call__
    value = self.func(*args)
  File "/home/me/data/pip/lib/python3.6/site-packages/pelican/contents.py", line 306, in get_summary
    self.settings['SUMMARY_MAX_LENGTH'])
  File "/home/me/data/pip/lib/python3.6/site-packages/pelican/utils.py", line 583, in truncate_html_words
    truncator.feed(s)
  File "/home/me/data/pip/lib/python3.6/site-packages/pelican/utils.py", line 484, in feed
    HTMLParser.feed(self, *args, **kwargs)
  File "/usr/lib/python3.6/html/parser.py", line 111, in feed
    self.goahead(0)
  File "/usr/lib/python3.6/html/parser.py", line 219, in goahead
    self.handle_entityref(name)
  File "/home/me/data/pip/lib/python3.6/site-packages/pelican/utils.py", line 558, in handle_entityref
    self.handle_ref('')
  File "/home/me/data/pip/lib/python3.6/site-packages/pelican/utils.py", line 543, in handle_ref
    ref_end = self.rawdata.index(';', offset) + 1
ValueError: substring not found

It looks like the issue was introduced in 9d0804de7: When truncating, consider hypens, apostrophes and HTML entities.

As I do not fully understand this, @andreacorbellini do you think this simple change to use find() instead of index() in handle_ref() is sufficient? I have been able to get pelican to work using this, but I don’t know whether this is the right approach.

And is &gid a protected codepoint somehow?

Maybe @mosra , sincce you're working on something related to unescaping right now, you could have a look at this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions