Skip to content

Conversation

@cnsgithub
Copy link

No description provided.

@Chaiavi
Copy link
Contributor

Chaiavi commented Jan 19, 2020

Why do we need the md5 checksum of a page content ?

What use can be done with it ?

@dgoiko
Copy link

dgoiko commented Jan 26, 2020

Why do we need the md5 checksum of a page content ?

What use can be done with it ?

If a crawler's visit algorithm performs expensive operations on pages and then stores only the extracted information it may be usefull to have a common checksum storage where they can check if an identical page has already been processed and ignore them in the future, for instance. The problem is that every page with a non-js clock, visit counter, a tiny little PHP dinamic banner or additional whitespace printed God knows why would break the equality.

I'm currently using a more html-driven solution by checking a specific tag that contains a field that I know can be taken as a "primary key" for the website, but I've to parse the content into a jsoup Document first, so it is probably more expensive and requires me to exactly know the exact layout of crawled pages (which I know because I'm crawling information, not just html documents)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants