Allow negative priorities#61
Conversation
670b024 to
4fca2e1
Compare
Because the ordering is done strictly binary, negative values will come last, because their binary representation starts with the MSB at 1. In order to fix this, we'll have to add the minimum byte value to become 0. This means that the maximum number will become out of range in Byte-value, but the integer value is nicely converted down to the actual binary representation that is useful here. You can set priority when adding a seed by using the new variant of addSeed with a third parameter that is the priority.
4fca2e1 to
8fe8655
Compare
|
I just rediscovered this bug. It would be good if it can be committed. I do have a unit test as well. |
|
Why should we have negative priorities? I added a commit to disallow that: #160 will merge that one instead unless there are valid objections. |
|
I have a crawler that is looking for specific content on webpages, and it also maintains a maximum number of crawled pages per website. In order to have the highest probability to crawl the most relevant pages, negative priorities are essential. Most pages get a neutral priority of 0. Some pages are estimated to be more relevant based on factors such as the tag it was linked from, or the URL. They get a positive, increased priority. Some pages are estimated to be less relevant based on these same factors. These get a negative, decreased priority. The result is that I can be pretty sure that the increased priority pages are always crawled, the neutral pages come after and only if aftere these pages the limit has not been reached, the decreased priority pages are crawled as well. An alternative solution would be to change the default, neutral priority from 0 to something like 50, so that you could use priorities of 49 and 51 for the same effect. |
|
+1 for @madegg's detailed explanation. As the default priority is 0, so it would be beneficial to allow higher priority URLs by having values lower than the default. I can confirm that the approach in this patch works for permitting negative priorities. |
|
+1 for this too |
|
I prefer to change default priority from 0 to 50. Then negative values are not needed. We can just use values lower than 50. |
Note: I recently discovered that you moved from google code to Github. I've been using and modifying crawlerj4 for a while now and now I'm able to submit pull requests to allow you to merge them. I rebased all my (relevant) patches on the new master, so they should be easy to merge. I hope you consider some or all of them useful.
Description of this patch:
Fix ordering of urls in workqueues when priorities < 0 are used. Because the ordering is done strictly binary, negative values will come last, because their binary representation starts with the MSB at 1. In order to fix this, we'll have to add the minimum byte value to become 0. This means that the maximum number will become out of range in Byte-value, but the integer value is nicely converted down to the actual binary representation that is useful here.
You can set priority when adding a seed by using the new variant of addSeed with a third parameter that is the priority.