-
-
Notifications
You must be signed in to change notification settings - Fork 159
Add support for Text fragment feature (#1545) #1600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I missed to run the clippy across the test modules - the related lint failure issues are now fixed and ready for review! |
lychee-lib/src/utils/url.rs
Outdated
); | ||
match url { | ||
Ok(url) => { | ||
eprintln!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be an assertion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified the tests to have asserts (instead of the manual checks that was earlier)
lychee-lib/src/checker/website.rs
Outdated
let mut status = Status::new(&response, self.accepted.clone()); | ||
if self.validate_text_fragments && has_fragment_directive { | ||
if let Ok(res) = response.text().await { | ||
info!("checking fragment directive..."); | ||
if let Some(fd) = req_url.fragment_directive() { | ||
info!("directive: {:?}", fd.text_directives); | ||
match fd.check(&res) { | ||
Ok(stat) => { | ||
status = stat; | ||
} | ||
Err(e) => { | ||
return e.into(); | ||
} | ||
} | ||
} | ||
} | ||
} | ||
status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can move that into a function/method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some tests for that part would also be nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code is modified to move this into a separate function - as well, bulk of the logic is abstracted at the textfrag crate itself so that the websitechecker will deal with only error responses from the text fragment checker function.
lychee-lib/src/client.rs
Outdated
assert!(res.status().is_success()); | ||
|
||
// start with suffix | ||
println!("\ntesting start with suffix..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'll probably remove the println!
s right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - the println's are removed now
|
||
use crate::types::TextDirective; | ||
|
||
const BLOCK_ELEMENTS: &[&str] = &[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a long list. Does that mean we'd have to maintain the HTML keywords here? Maybe we can avoid that as it would be an uphill battle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am yet to explore alternative approach to this and am open for suggestion - for now, this list is retained (reserving this change for a near future commit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that this functionality is isolated in its own module. But it's a looot of code. 😅 Not sure what to do here, but at least the ratio of code/tests could be improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could move it into a separate crate or use an upstream crate for that? I think it would be a nice library to maintain individually as more applications could profit from it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text fragment is now moved into its own crate, textfrag, in the lychee main tree itself - as I am new to the ecosystem, I'll need help in moving this into a separate crate - please suggest!
let mut all_directives_found = false; | ||
let directive = td.directive.borrow(); | ||
|
||
'directive_loop: while !all_directives_found { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The labels make it quite hard to read. Have you considered any alternatives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - the code logic was cumbersome and I've rewritten this now - please share your comments!
pub(crate) const FRAGMENT_DIRECTIVE_DELIMITER: &str = ":~:"; | ||
pub(crate) const TEXT_DIRECTIVE_DELIMITER: &str = "text="; | ||
|
||
pub(crate) const TEXT_DIRECTIVE_REGEX: &str = r"(?s)^text=(?:\s*(?P<prefix>[^,&-]*)-\s*[,$]?\s*)?(?:\s*(?P<start>[^-&,]*)\s*)(?:\s*,\s*(?P<end>[^,&-]*)\s*)?(?:\s*,\s*-(?P<suffix>[^,&-]*)\s*)?$"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does that regex come from? Did you write it yourself? If there's an "official" regex for those text fragments, we could perhaps add a link to the reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this regex was built by myself - I started searching for any equivalent but couldn't land on one when I started this implementation - I am open to revisit this.
In fact, i want to replace regex with a simple parser for this requirement - the specification is not particular about the order of the directives, whereas the regex assumes the order imperatively - with a parser, we might be able to get away from the ordering constraints.
lychee-lib/src/utils/url.rs
Outdated
@@ -23,10 +26,61 @@ pub(crate) fn find_links(input: &str) -> impl Iterator<Item = linkify::Link> { | |||
LINK_FINDER.links(input) | |||
} | |||
|
|||
/// Fragment Directive feature trait | |||
/// we will use the extension trait pattern to extend the Url to support Text Fragment feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you - this has now moved into textfrag crate
lychee-lib/src/utils/url.rs
Outdated
@@ -23,10 +26,61 @@ pub(crate) fn find_links(input: &str) -> impl Iterator<Item = linkify::Link> { | |||
LINK_FINDER.links(input) | |||
} | |||
|
|||
/// Fragment Directive feature trait | |||
/// we will use the extension trait pattern to extend the Url to support Text Fragment feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// we will use the extension trait pattern to extend the Url to support Text Fragment feature | |
/// We will use the extension trait pattern to extend [`url::Url`] to support the text fragment feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorporated this comment in to the textfrag::utils::url file
lychee-lib/src/utils/url.rs
Outdated
/// Fragment Directive feature trait | ||
/// we will use the extension trait pattern to extend the Url to support Text Fragment feature | ||
pub(crate) trait UrlExt { | ||
/// Checks if the url has a fragment and if the fragment is has the fragment directive delimiter embedded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Checks if the url has a fragment and if the fragment is has the fragment directive delimiter embedded | |
/// Checks if the url has a fragment and if the fragment has the fragment directive delimiter embedded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comments section (with slight rewording) in the textfrag crate
lychee-lib/src/utils/url.rs
Outdated
} | ||
|
||
impl UrlExt for Url { | ||
/// Returns whether the URL has fragment directive or not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Returns whether the URL has fragment directive or not | |
/// Checks whether the URL has fragment directive or not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have addressed this is in the textfrag::utils::url[:16]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. I like the overall structure. Good work so far!
@mre |
@thiru, any updates? Let me know in case you need any help. 😃 |
Thanks, sounds good! |
@mre my apologies for the delay - while refactoring, into a separate crate, I encountered few corner case issues and it took more time than planned - running the tests now and hoping to re-raise the pull-request by tomorrow! |
I am committing the changes into my fork and will initiate a pull-request shortly! |
lib's website checker continues to have the logic to validate text fragments clean-up of the tests were done review feedback incorporated - addressed structural, logic, tests and document comments
@mre request your help in addressing the CI / publish-check failure - please recommend if I've to make any changes on my end - thank you! |
lychee-lib/src/checker/website.rs
Outdated
@@ -85,10 +94,51 @@ impl WebsiteChecker { | |||
status | |||
} | |||
|
|||
fn check_text_fragments(site_data: &str, url: &Url, mut status: Status) -> Status { | |||
let res = check_text_fragments(site_data, url); | |||
if res.is_err() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the if condition here. You're matching on res.err()
below, and in the case where res.err() is not set, it will be None
. This case is already covered in your _res
match arm since _res
is just a placeholder, which also includes the value being None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I've updated the code
/// | ||
/// # Errors | ||
/// - `TextDirectiveNotFound`, if text directive match fails | ||
// fn check_fragment_directive(&self, buf: &str) -> Result<TextFragmentStatus, TextFragmentError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// fn check_fragment_directive(&self, buf: &str) -> Result<TextFragmentStatus, TextFragmentError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean-up done!
buf: &str, | ||
) -> Result<FragmentDirectiveStatus, FragmentDirectiveError> { | ||
let mut map = HashMap::new(); | ||
let fd_checker = FragmentDirectiveTokenizer::new(self.text_directives()); // self.text_directives().clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let fd_checker = FragmentDirectiveTokenizer::new(self.text_directives()); // self.text_directives().clone()); | |
let fd_checker = FragmentDirectiveTokenizer::new(self.text_directives()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean-up done!
for td in &tok.sink.get_text_directives() { | ||
let directive = td.raw_directive().to_string(); | ||
log::debug!("text directive: {:?}", directive); | ||
println!("text directive: {:?}", directive); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
println!("text directive: {:?}", directive); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed unnecessary println!'s
|
||
let _status = status.to_string(); | ||
log::debug!("search status: {:?}", status); | ||
println!("search status: {:?}", status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
println!("search status: {:?}", status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed unnecessary println!'s
|
||
let res_str = td.get_result_str(); | ||
log::debug!("search result: {:?}", res_str); | ||
println!("search result: {:?}", res_str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
println!("search result: {:?}", res_str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed unnecessary println!'s
// assert_eq!(results.len(), 1); | ||
// assert_eq!(results[FRAGMENT], Ok(TextDirectiveStatus::Completed)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// assert_eq!(results.len(), 1); | |
// assert_eq!(results[FRAGMENT], Ok(TextDirectiveStatus::Completed)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean-up done!
// assert_eq!(res.len(), 1); | ||
// assert_eq!(res[FRAGMENT], TextDirectiveStatus::Completed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// assert_eq!(res.len(), 1); | |
// assert_eq!(res[FRAGMENT], TextDirectiveStatus::Completed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean-up done!
// assert_eq!(results.len(), 1); | ||
// assert_eq!(results[FRAGMENT], TextDirectiveStatus::Completed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// assert_eq!(results.len(), 1); | |
// assert_eq!(results[FRAGMENT], TextDirectiveStatus::Completed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean-up done!
// assert_eq!(results.len(), 1); | ||
// assert_eq!(results[FRAGMENT], TextDirectiveStatus::Completed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// assert_eq!(results.len(), 1); | |
// assert_eq!(results[FRAGMENT], TextDirectiveStatus::Completed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean-up done!
// assert_eq!(results.len(), 1); | ||
// assert_eq!(results[FRAGMENT], TextDirectiveStatus::Completed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// assert_eq!(results.len(), 1); | |
// assert_eq!(results[FRAGMENT], TextDirectiveStatus::Completed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean-up done!
// assert!(results.len() == 1); | ||
// assert_eq!(results[FRAGMENT], TextDirectiveStatus::Completed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// assert!(results.len() == 1); | |
// assert_eq!(results[FRAGMENT], TextDirectiveStatus::Completed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean-up done!
|
||
/// [Internal] the use of regular expression does not comply with the specification | ||
/// To be used for testing purposes only | ||
fn _check(&self, input: &str) -> Result<FragmentDirectiveStatus, TextFragmentError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using underscores for internal methods is rather uncommon. It's clear enought that there's no pub
in front.
fn _check(&self, input: &str) -> Result<FragmentDirectiveStatus, TextFragmentError> { | |
fn check(&self, input: &str) -> Result<FragmentDirectiveStatus, TextFragmentError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed the function name and also to suppress the warning, i've included the #[allow(dead_code)] directive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed the function per feedback
textfrag/src/types/mod.rs
Outdated
mod error; | ||
mod status; | ||
mod url; | ||
// mod frag_directive; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// mod frag_directive; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean-up done!
textfrag/src/types/mod.rs
Outdated
pub use error::*; | ||
pub use status::*; | ||
pub use url::*; | ||
// pub use frag_directive::*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// pub use frag_directive::*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean-up done
textfrag/src/types/status.rs
Outdated
// FragmentDirectiveStatus::PartialOk(m) => write!(f, "Partial Ok {:?}", m), | ||
// FragmentDirectiveStatus::Error(e) => write!(f, "Error: {:?}", e), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take care of now - I've made sure that all the unwanted code is now cleaned entirely
textfrag/src/types/status.rs
Outdated
@@ -0,0 +1,78 @@ | |||
/// Defines the status of the Text Fragment search and extraction/search operation status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Defines the status of the Text Fragment search and extraction/search operation status | |
//! Defines the status of the Text Fragment search and extraction/search operation status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored code takes are of this now!
textfrag/src/types/url.rs
Outdated
|
||
use crate::types::{FragmentDirective, FRAGMENT_DIRECTIVE_DELIMITER}; | ||
|
||
/// Fragment Directive feature trait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Fragment Directive feature trait | |
/// Fragment Directive extension trait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the comment
textfrag/src/utils/mod.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we don't need that file and can move the code closer to where it's used? (Assuming it's only used in one place.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the folder as part of the refactoring exercise
textfrag/src/extract/fragdirtok.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module is so large, I'd assume a relatively large doc-block as well.
In there, I'd answer a few questions:
- What does the module do?
- How is it supposed to be used (including an example)?
- What were the design tradeoffs?
- What are the possible error conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
working on it - will commit this documentation for review!
textfrag/src/extract/fragdirtok.rs
Outdated
|
||
use crate::types::{TextDirective, TextDirectiveKind, TextDirectiveStatus}; | ||
|
||
const BLOCK_ELEMENTS: &[&str] = &[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not too happy that we have to maintain this list.
Looked around for alternatives, and I found https://github.com/servo/html5ever/blob/main/html5ever/src/tree_builder/tag_sets.rs.
Not sure if it can be used, but I wanted to mention it.
If it's too much hassle to integrate, we can keep the current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately, the html5ever does not expose these macros for us to consume and so couldn't use it - i don't want to keep a copy of this file inside lychee repository and so ruling it out. And i did search around to check on alternative approaches and found either we need to have a headless browser to manage/identify the element type (as block) or stay with this current approach.
My recommendation, for now, is to keep this code - in future, if the html5ever makes these macros as public, we can potentially be freed up of maintaining this list.
I've added a few more comments. As for the error, it's currently failing because there is no
I'd vote for option 3, which is the easiest right now and allows us to keep maintaining the code inside lychee. It's still reasonably well encapsuled in its own module and we can always make it a separate crate later once the code is mature enough. The downside is longer compile-times because it would be in the same codegen unit as the rest of the library code. Detailed instructions for option 3To merge the
This approach has several benefits:
Let me know what you think. |
@mre I understand and agree with your recommendation - I am addressing this and, as well, the other review comments and republish for review - thanks for your patience! |
@thiru-appitap, I saw that you did some work. Can you close the resolved conversations already? It becomes a bit hard to keep track of the open TODOs. 😉 |
- moved (back) the textfrag as a module into the lychee-lib crate - added documents and ran doctests to verify its working - added a cli test to validate the text fragment functionality
@mre was juggling between couple of priorities, along with local travels and so couldn't resolve & commit the changes earlier itself - now on I should be able to get back with much faster turnaround on the feedback |
This branch kindly asks for a rebase. Running Would be nice to see this arrive. |
I have rebased the branch to the latest but am facing lint failures - working on to fix it and recommit by this weekend. |
Text Fragment feature implementation pull request. The feature follows the published URL Fragment Text Directives specification (https://wicg.github.io/scroll-to-text-fragment/).
If the fragment directive is not found, a
TextDirectiveNotFound
error will be returned.Below changes are completed:
fancy-regex
include-text-fragments
is added to support the featureUrlExt
trait is implemented to enhance Url's to support Fragment Directive#:~:text=linked%20URL,-'s%20format&text=Deprecated-,attributes,attribute
)