-
Notifications
You must be signed in to change notification settings - Fork 1
Description
The developer list archives go back to 1998:
https://listarchives.boost.org/Archives/boost/
However the mm3 import process utilizes the .mbox files which only went to 2004.
The mm3 import process doesn't use the mm2 archives but is based on the .mbox file format. Multiple sources of the same data.
How could messages from 1998 be imported?
Design and run an intermediate script that ingests "archives" and constructs a new mbox file.
Then import the mbox file.
Pay special attention to UTF-8 characters, foreign letters, and include additional processing that converts these letters back to the necessary format so that when an mbox gets imported the letters will render correctly. What we observed is the historical *.mbox files were ok, but the "archives" had become garbled.
Examining the source *.mbox files from wowbagger, and also the generated "Download" mbox files from the new hyperkitty, both of those seem to output UTF characters in this way:
Congratulations to Joaqu=C3=ADn
However, the earlier mm2 html archives (not mbox) contain:
2025/05/259631.php:<p>Congratulations to Joaquín on an outstanding contribution, and my genuine
Notice the difference.
That means when parsing Joaquín it should most likely be converted to Joaqu=C3=ADn in the new mbox file. It's also conceivable that Joaquín could be converted to Joaquín, however that would need to be carefully tested because it doesn't match the format of mm2 mbox or mm3 (hyperkitty) mbox. What is known to work is Joaqu=C3=ADn.