-
-
Notifications
You must be signed in to change notification settings - Fork 159
Implement recursion #1603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Implement recursion #1603
Conversation
seems like formatting rules from my cargo install differs from ci's? i get nothing when running cargo fmt locally... |
@@ -181,24 +193,56 @@ where | |||
Ok(()) | |||
} | |||
|
|||
/// Reads from the request channel and updates the progress bar status | |||
async fn progress_bar_task( | |||
/// Reads from the response channel, updates the progress bar status and (if recursing) sends new requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: recursing
-> recurring
/ recursive
Great progress on implementing recursion! The core implementation using request/response channels with the
I hope that I covered all open points for now. Let me know if I missed anything. 😃 |
Probably. We don't have any specific rustfmt settings. Can you disable your global config to see if that changes things? RUSTFMT_CONFIG_PATH=/dev/null cargo fmt |
okay, so i figured out why it hang, and it was as you guessed a backpressure problem, I was waiting on the subsequent URI sends before finishing to process a response, so the response-receiver task would not move on to process the response, resulting in a lock-up. I fixed this by putting the requests channel "refill" in a tokio::spawn. I don't know if having that much tasks spawned is a good idea in Tokio, but I guess one of the points of using a async runtime is that it manages these things for us? unfortunately the problem has not completely gone away, for some reason it locks up on the last URL on en.wikipedia.org with --max-depth=0 (which is functionnaly the same as not activating recursion1), but works without --recursive... theres also the issue of how we should handle a "cached" response in recursive mode: right now it's considered a cached response as if it came from .lycheecache, but this isn't really true. I'm tempted to make a separate Because of the parallel nature of the request-to-response task, it seems to me that sending the same request twice to the channel is hard to prevent. i tried adding guards basically everywhere (when getting the list of subsequent uris to add to the requests channel, before processing a request, ...) and i still seem to get duplicates. it might be a skill issue though. or maybe we should use an ArcMutex instead of just an Arc for the cache? i'll have to try that later. Finally, I'm a bit on the fence on this, but for now I added a .insert method to the Stats struct to prevent adding duplicate entries for the same URL, as a stopgap measure for the problem mentionned above. I'll get around to work on this again sometime this week (maybe), and I think i'll start testing by writing tests instead of checking manually because it's starting to become cumbersome, and this way i'll progress on the writing of tests too. By the way, I discovered tokio-console, and tried to use it to debug the backpressure bug. while it was not extremely helpful, it did help me see whether I still had some tasks doing stuff or if there was just nothing going on anymore. So right now I have some dirty git changes related to tokio-console setup (another dep and a call to a hook/setup function in main()). Do you want me to commit these (of course, by enabling the tracing for the dev profile only) Footnotes
|
match response.text().await { | ||
Err(_) => (status, vec![]), | ||
Ok(response_text) => { | ||
let links: Vec<_> = Collector::new(None, Some(Base::Remote(base_url))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: use the collector that's constructed around main
, since some flags are passed to it, while we are ouright ignoring a bunc of flags there
Thanks for the detailed update @ewen-lbh! Let me address your points:
I enjoy reading the progress reports. 👍 |
@gwennlbh, any open questions at the moment? Let me know if I can help you. 😃 |
Sorry, had end-of-semester rush w a lot of projects and exams haha, i'll get back to it in a few days |
Oh, no rush. Hope all went well. Good luck and take your time. Just didn't want this nice work to go stale. 😄 |
Almost there. 😄 Nice updates. |
hiya, i wanna start working on it again but i really gotta write tests before doing anything else (especially since i need to get back into it, and testing the thing with examples sites isn't gonna cut it) any guidance on writing tests? like where do i put them, if you have any testing infrastructure like helper functions, etc |
i also removed the console-subscriber dependency as you requested |
844ce5f
to
044f623
Compare
ok so lints are finally passing, imma need to make current tests pass again before adding any lol i added 3 |
Did you find your way around the tests? Basically integration tests go into |
heya, an update from me: i recently switched back to windows due to unrelated things, and i can't seem to build the openssl crate (tried vcpkg, restarting etc, won't work) but even then, even though i was kinda denying it, it's pretty clear that i've lost motivation to keep working on this '^^ (even with the generous bounty that would help with basic living expenses lol) feel free to re-use my work if it's easier that way, but i have to be realistic and not hold back the project by keeping on stalling this PR, so... yeah, i'm sorry T_T |
if no one can take this, i might come back to it when my brain decides to suddenly become interested in it again, but since i can't really provide a timeline for that, i figured it was better to tell you |
Closes #78
Current usage:
Example on a small site: (literally a friend's portfolio site because it's what i pulled off the top of my head haha)
The idea of the implementation is the following:
subsequent_uris
in most of the code)Arc<AtomicUsize>
is used to keep track of the number of remaining requests to check: (i guess this is similar to the Semaphore approach?)len(URIs)
What's WIP:
Vec<Uri>
for the subsequent uris? this gets passed around a lot, it's probably inefficient