Skip to content

Allow negative priorities#61

Open
EgbertW wants to merge 1 commit intoyasserg:masterfrom
EgbertW:negative-priorities
Open

Allow negative priorities#61
EgbertW wants to merge 1 commit intoyasserg:masterfrom
EgbertW:negative-priorities

Conversation

@EgbertW
Copy link
Copy Markdown
Contributor

@EgbertW EgbertW commented May 20, 2015

Note: I recently discovered that you moved from google code to Github. I've been using and modifying crawlerj4 for a while now and now I'm able to submit pull requests to allow you to merge them. I rebased all my (relevant) patches on the new master, so they should be easy to merge. I hope you consider some or all of them useful.

Description of this patch:
Fix ordering of urls in workqueues when priorities < 0 are used. Because the ordering is done strictly binary, negative values will come last, because their binary representation starts with the MSB at 1. In order to fix this, we'll have to add the minimum byte value to become 0. This means that the maximum number will become out of range in Byte-value, but the integer value is nicely converted down to the actual binary representation that is useful here.

You can set priority when adding a seed by using the new variant of addSeed with a third parameter that is the priority.

@EgbertW EgbertW force-pushed the negative-priorities branch 2 times, most recently from 670b024 to 4fca2e1 Compare May 26, 2015 09:02
Because the ordering is done strictly binary, negative values will come
last, because their binary representation starts with the MSB at 1. In
order to fix this, we'll have to add the minimum byte value to become 0.
This means that the maximum number will become out of range in
Byte-value, but the integer value is nicely converted down to the actual
binary representation that is useful here.

You can set priority when adding a seed by using the new variant of addSeed with a
third parameter that is the priority.
@EgbertW EgbertW force-pushed the negative-priorities branch from 4fca2e1 to 8fe8655 Compare July 20, 2015 08:07
@cmacdonald
Copy link
Copy Markdown

I just rediscovered this bug. It would be good if it can be committed. I do have a unit test as well.

@yasserg
Copy link
Copy Markdown
Owner

yasserg commented Sep 28, 2016

Why should we have negative priorities? I added a commit to disallow that: #160 will merge that one instead unless there are valid objections.

@EgbertW
Copy link
Copy Markdown
Contributor Author

EgbertW commented Sep 28, 2016

I have a crawler that is looking for specific content on webpages, and it also maintains a maximum number of crawled pages per website. In order to have the highest probability to crawl the most relevant pages, negative priorities are essential.

Most pages get a neutral priority of 0. Some pages are estimated to be more relevant based on factors such as the tag it was linked from, or the URL. They get a positive, increased priority. Some pages are estimated to be less relevant based on these same factors. These get a negative, decreased priority.

The result is that I can be pretty sure that the increased priority pages are always crawled, the neutral pages come after and only if aftere these pages the limit has not been reached, the decreased priority pages are crawled as well.

An alternative solution would be to change the default, neutral priority from 0 to something like 50, so that you could use priorities of 49 and 51 for the same effect.

@cmacdonald
Copy link
Copy Markdown

cmacdonald commented Sep 28, 2016

+1 for @madegg's detailed explanation.
I had been going to say that the Javadoc says: "URLs with lower priority numbers will be crawled earlier."

As the default priority is 0, so it would be beneficial to allow higher priority URLs by having values lower than the default.

I can confirm that the approach in this patch works for permitting negative priorities.

@s17t
Copy link
Copy Markdown
Contributor

s17t commented Dec 13, 2016

@yasserg I would go to merge @madegg's patch if no objections arise.

@rzo1
Copy link
Copy Markdown
Contributor

rzo1 commented Dec 15, 2016

+1 for this too

@yasserg
Copy link
Copy Markdown
Owner

yasserg commented Dec 15, 2016

I prefer to change default priority from 0 to 50. Then negative values are not needed. We can just use values lower than 50.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants