Skip to content

Conversation

@hanoii
Copy link

@hanoii hanoii commented Sep 21, 2025

I was attempting to submit this individually, but it has become a lot of work and I need this for a site I am archiving, so for now I am creating a single PR with all of what I am working with so you can peek and see the direction I am taking and send over any feedback.

This is the things I've been working on:

Major things:

  • Major regex rework: I found the whole regex a bit weak to support the three double quote, single quote and no quote alternatives of most parsings. So:
    • I refactored this into a single function that performs the regex individually on all three cases. This allows me to do custom bits like allowing spaces when quoted and not allowing spaces without them.
    • I worked this out using a pseudo-template-regex language where a single regex can be used for all three cases.
    • This allowed me also to add some "macros" to the language so regex partials can be easily reused.
  • Move normlizeUrl and use as well when doing updateHtmlPathsToRelative - f83ce16 - couldn't fully grasp the implications of this, but seems to be working.

Minor things:

  • With --replace-query-string="/.*/ -> " \ the files were downloaded with ..css, now if the query string is empty is .css.
  • Added --offline-export-lowercase so that everything is converted to lowercase.
  • Some tests, but haven't thoroughly looked at that.

Fixes #79

@hanoii hanoii changed the title wip: regex improvements (supporting spaces), convert to lowercase support, proper empty query string support wip: major regex improvements (supporting spaces), convert to lowercase support, proper empty query string support Sep 22, 2025
@hanoii
Copy link
Author

hanoii commented Sep 23, 2025

I am not sure if I can continue reshaping all regex as they are alot, however I really feel this is a good approach so if you like it maybe you can also accept it and convert it as time permits.

Also I wouldn't want to spend a lot more if you don't feel like accepting this.

@hanoii
Copy link
Author

hanoii commented Sep 23, 2025

I tried improving https://github.com/hanoii/siteone-crawler/blob/ad2a2d95950c57f6c00650b10bd46dfe59dadeb7/src/Crawler/ContentProcessor/HtmlProcessor.php#L147

which I then reverted.

I understand what it's looking for now:

<meta http-equiv="refresh" content="0; url=someurl.html">

This one is odder as the following are all valid:

<meta http-equiv="refresh" content="0; url=someurl.html">
<meta http-equiv="refresh" content="0; url='someurl with spaces.html'">
<meta http-equiv="refresh" content="0; url=&quot;someurl with spaces.html&quot;">
<meta http-equiv="refresh" content='0; url="someurl with spaces.html";'>
<meta http-equiv="refresh" content='0; url=&quot;someurl with spaces.html&quot;'>

And would be missed or wrongly accounted for on that regex. I am sure if you are targetting only the redirects you create or it could come from an actual crawl of some page. I guess both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: html where filenames have spaces (not %20) are not properly parsed

1 participant