Skip to content

import earliest mailman archives #10

@sdarwin

Description

@sdarwin

The developer list archives go back to 1998:

https://listarchives.boost.org/Archives/boost/

However the mm3 import process utilizes the .mbox files which only went to 2004.

The mm3 import process doesn't use the mm2 archives but is based on the .mbox file format. Multiple sources of the same data.

How could messages from 1998 be imported?

Design and run an intermediate script that ingests "archives" and constructs a new mbox file.
Then import the mbox file.
Pay special attention to UTF-8 characters, foreign letters, and include additional processing that converts these letters back to the necessary format so that when an mbox gets imported the letters will render correctly. What we observed is the historical *.mbox files were ok, but the "archives" had become garbled.
Examining the source *.mbox files from wowbagger, and also the generated "Download" mbox files from the new hyperkitty, both of those seem to output UTF characters in this way:

Congratulations to Joaqu=C3=ADn

However, the earlier mm2 html archives (not mbox) contain:

2025/05/259631.php:<p>Congratulations to Joaqu&#195;&#173;n on an outstanding contribution, and my genuine

Notice the difference.

That means when parsing Joaqu&#195;&#173;n it should most likely be converted to Joaqu=C3=ADn in the new mbox file. It's also conceivable that Joaqu&#195;&#173;n could be converted to Joaquín, however that would need to be carefully tested because it doesn't match the format of mm2 mbox or mm3 (hyperkitty) mbox. What is known to work is Joaqu=C3=ADn.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions