Strip HTML from event descriptions before insertion#276
Strip HTML from event descriptions before insertion#276benthamite wants to merge 7 commits intokidd:masterfrom
Conversation
Google Calendar API returns event descriptions as HTML, containing tags like <a>, <br>, <html-blob>, <u>, <ul>/<li>, and HTML entities like &, , etc. These were inserted raw into the :org-gcal: drawer, producing malformed content in Org files. Add `org-gcal--strip-html' to convert HTML descriptions to plain text and apply it when binding the `desc' variable in `org-gcal--update-entry'. Fixes kidd#258.
|
@benthamite I'm not necessarily opposed to something like this, but I want to allow people to edit their event descriptions locally, including the HTML. Perhaps a per-headline property could be set in the Org file on the events that you want to strip HTML from. |
telotortium
left a comment
There was a problem hiding this comment.
See my comments - also merge the latest master into this branch.
Fair. How about a user option, |
It should probably at least be a per-calendar option, with a global user-customizable default, so that shared calendars that you import are not corrupted. |
|
@benthamite Also merge latest master - it has some fixes to the CI |
Add `org-gcal-strip-html-descriptions' (boolean, default nil) as the global default, and `org-gcal-strip-html-descriptions-overrides' (alist of calendar-id to boolean) for per-calendar overrides. This allows users to strip HTML globally while preserving it for specific shared calendars, as requested in PR review.
|
Done. I've added two user options:
This way users can e.g. strip HTML globally while preserving it for specific shared calendars: (setopt org-gcal-strip-html-descriptions t)
(setopt org-gcal-strip-html-descriptions-overrides
'(("shared-calendar@group.calendar.google.com" . nil)))Also confirmed the branch is up to date with latest master (merged in |
|
@benthamite Could you add tests please? |
- org-gcal-test--strip-html: unit tests for the HTML-to-text conversion (tags, entities, list items, blank line collapsing) - org-gcal-test--strip-html-p: predicate tests covering global default, per-calendar enable override, and per-calendar disable override - org-gcal-test--update-entry-strip-html: integration test verifying HTML is converted when stripping is enabled - org-gcal-test--update-entry-preserve-html: integration test verifying HTML is preserved when stripping is disabled (default) - org-gcal-test--update-entry-strip-html-per-calendar: integration test verifying per-calendar override takes effect
The regex <[^>]+> also matches Org timestamps like <2019-10-06 Sun 17:00-21:00>. Use </?[a-zA-Z][^>]*> instead, which only matches actual HTML tags.
|
Added tests and merged latest master:
Also merged latest master (picks up PR #279). |
|
@telotortium, please let me know if there is anything else you’d like me to do. |
Summary
Google Calendar API returns event descriptions as HTML, containing tags like
<a>,<br>,<html-blob>,<u>,<ul>/<li>, and HTML entities like&, , etc. These are currently inserted raw into the:org-gcal:drawer, producing malformed content in Org files.This causes problems for tools that parse Org files (e.g. org-roam interprets
https://meet.google.com/abc</a>as a valid link, inserting a broken path into its database).Changes
org-gcal--strip-htmlfunction that converts HTML descriptions to plain text:<br>→ newlines<li>→\n-(Org list items)&,<,>, ,",')descinorg-gcal--update-entryRound-trip safety
Descriptions posted back to Google Calendar via
org-gcal-post-at-pointare read back from the Org file as plain text, which Google Calendar accepts. No formatting is lost that wasn't already lost by storing raw HTML in an Org file.Example
Before:
After:
Fixes #258.