Skip to content

Commit 747e845

Browse files
committed
feat: skip fragment checking for unsupported MIME types
The remote URL/website checker currently passes all URLs with fragments to the fragment checker as HTML document, even if it is a different or unsupported MIME type. This can cause false fragment checking for Markdown documents, failures for other MIME types, especially binaries, and unnecessary traffic for large downloads, which are always finished completely, if the fragment checker is invoked. This commit checks the Content-Type header of the response: - Only if it is `text/html`, it is passed to the fragment checker as HTML type. - Only if it is `text/markdown`, of `text/plain` and URL path ends on `.md`, it is passed to the fragment checker as Markdown type. - In all other cases, the fragment checker is skipped and the HTTP status is returned. To invoke the fragment checker with a variable document type, a new `FileType` argument is added to the `check_html_fragment()` function. The fragment checker test and fixture are adjusted to match the expected result: checking a binary file via remote URL with fragment is now expected to succeed, since its Content-Type header does not invoke the fragment checker anymore. Signed-off-by: MichaIng <micha@dietpi.com>
1 parent 140f701 commit 747e845

3 files changed

Lines changed: 36 additions & 16 deletions

File tree

fixtures/fragments/file1.md

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -83,17 +83,10 @@ Even with fragment checking enabled, the following links must hence succeed:
8383
[Link to remote binary file without fragment](https://raw.githubusercontent.com/lycheeverse/lychee/master/fixtures/fragments/zero.bin)
8484
[Link to remote binary file with empty fragment](https://raw.githubusercontent.com/lycheeverse/lychee/master/fixtures/fragments/zero.bin#)
8585

86-
## Local file with fragment
86+
## With fragment
8787

88-
For local files URIs with fragment, the fragment checker is invoked and fails to read the content,
89-
but the file checker emits a warning only. The following link hence must succeed as well:
88+
Fragment checking is skipped if the Content-Type header is not "text/html", "text/markdown", or "text/plain" with ".md" URL path ending.
89+
Even that the URL contains a fragment, the following checks must hence succeed:
9090

9191
[Link to local binary file with fragment](zero.bin#fragment)
92-
93-
## Remote URL with fragment
94-
95-
Right now, there is not MIME/content type based exclusion for fragment checks in the website checker.
96-
Also, other than the file checker, the website checker throws an error if reading the response body fails.
97-
The following link hence must fail:
98-
9992
[Link to remote binary file with fragment](https://raw.githubusercontent.com/lycheeverse/lychee/master/fixtures/fragments/zero.bin#fragment)

lychee-bin/tests/cli.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1889,9 +1889,9 @@ mod cli {
18891889
"https://raw.githubusercontent.com/lycheeverse/lychee/master/fixtures/fragments/zero.bin#fragment",
18901890
))
18911891
.stdout(contains("34 Total"))
1892-
.stdout(contains("28 OK"))
1892+
.stdout(contains("29 OK"))
18931893
// Failures because of missing fragments or failed binary body scan
1894-
.stdout(contains("6 Errors"));
1894+
.stdout(contains("5 Errors"));
18951895
}
18961896

18971897
#[test]

lychee-lib/src/checker/website.rs

Lines changed: 31 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
use crate::{
22
BasicAuthCredentials, ErrorKind, Status, Uri,
33
chain::{Chain, ChainResult, ClientRequestChains, Handler, RequestChain},
4+
FileType,
45
quirks::Quirks,
56
retry::RetryExt,
67
types::uri::github::GithubUri,
@@ -9,7 +10,7 @@ use crate::{
910
use async_trait::async_trait;
1011
use http::{Method, StatusCode};
1112
use octocrab::Octocrab;
12-
use reqwest::{Request, Response};
13+
use reqwest::{header::CONTENT_TYPE, Request, Response};
1314
use std::{collections::HashSet, time::Duration};
1415

1516
#[derive(Debug, Clone)]
@@ -108,7 +109,28 @@ impl WebsiteChecker {
108109
&& method == Method::GET
109110
&& response.url().fragment().is_some_and(|x| !x.is_empty())
110111
{
111-
self.check_html_fragment(status, response).await
112+
if response
113+
.headers()
114+
.get(CONTENT_TYPE)
115+
.is_some_and(|x| x.to_str().is_some_and(|s| s.starts_with("text/html")))
116+
{
117+
self.check_html_fragment(status, response, FileType::Html)
118+
.await
119+
} else if response
120+
.headers()
121+
.get(CONTENT_TYPE)
122+
.is_some_and(|x| x.to_str().is_some_and(|s| s.starts_with("text/markdown")))
123+
|| (response
124+
.headers()
125+
.get(CONTENT_TYPE)
126+
.is_some_and(|x| x.to_str().is_some_and(|s| s.starts_with("text/plain")))
127+
&& response.url().path().ends_with(".md"))
128+
{
129+
self.check_html_fragment(status, response, FileType::Markdown)
130+
.await
131+
} else {
132+
status
133+
}
112134
} else {
113135
status
114136
}
@@ -117,7 +139,12 @@ impl WebsiteChecker {
117139
}
118140
}
119141

120-
async fn check_html_fragment(&self, status: Status, response: Response) -> Status {
142+
async fn check_html_fragment(
143+
&self,
144+
status: Status,
145+
response: Response,
146+
file_type: FileType,
147+
) -> Status {
121148
let url = response.url().clone();
122149
match response.text().await {
123150
Ok(text) => {
@@ -126,7 +153,7 @@ impl WebsiteChecker {
126153
.check(
127154
FragmentInput {
128155
content: text,
129-
file_type: crate::FileType::Html,
156+
file_type: file_type,
130157
},
131158
&url,
132159
)

0 commit comments

Comments
 (0)