Description
Describe the Bug
When using the Firecrawl API and calling the map endpoint, some extra urls are sometimes included in the results. This seems to happen when calling the map endpoint for a child url that does not exist and then later calling it again for the parent url.
To Reproduce
Steps to reproduce the issue:
- Call the
/map
endpoint for a site such as
curl --location 'https://api.firecrawl.dev/v1/map' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer ***' \
--data '{
"url": "https://faculty.cs.byu.edu/~rodham/cs240"
}'
The results do not include the url https://faculty.cs.byu.edu/~rodham/cs240/this-path-is-invalid
- Call the
/map
endpoint with a child url that does not exist and resolves to a 404
curl --location 'https://api.firecrawl.dev/v1/map' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer ***' \
--data '{
"url": "https://faculty.cs.byu.edu/~rodham/cs240/this-path-is-invalid"
}'
This returns a map result with just the provided page as it's not a valid path and leads to a forbidden page.
- Call the
/map
endpoint for the parent url again.
curl --location 'https://api.firecrawl.dev/v1/map' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer ***' \
--data '{
"url": "https://faculty.cs.byu.edu/~rodham/cs240"
}'
Now the results do include the url https://faculty.cs.byu.edu/~rodham/cs240/this-path-is-invalid
Expected Behavior
These non-existent paths would not be included in the map results.
Screenshots
N/A
Environment (please complete the following information):
Using the Firecrawl API here, so no specific environment info.
Logs
N/A
Additional Context
N/A
Activity
shosseini811 commentedon Apr 19, 2025
Hey @sawyerclemmons
The issue is that the page at base_url = “https://faculty.cs.byu.edu/~rodham/cs240/” is designed differently on the backend.
For example, new_url = “https://faculty.cs.byu.edu/~rodham/cs240/schedule.html” appears to have many links, but when you click on each lecture, you’ll see that they actually belong to base_url.
That’s why Firecrawl can’t find any map for new_url — the content isn’t actually available at that location.
sawyerclemmons commentedon Apr 21, 2025
Yes, that makes sense. I am still confused on why the map result seems to change after multiple requests though. It seems like maybe a caching issue. On my example from the bug description, I would not expect
https://faculty.cs.byu.edu/~rodham/cs240/this-path-is-invalid
to ever show up in the map results forhttps://faculty.cs.byu.edu/~rodham/cs240
. However in the 3rd request from my example that path shows up. It is not included in the results of the 1st request. So it seems like after I make the 2nd map request, that non-existent page is cached and then included in the next map request that is made for the root url.