Description
Describe the Bug
When using the Firecrawl API and calling the map endpoint, some extra urls are sometimes included in the results. This seems to happen when calling the map endpoint for a child url that does not exist and then later calling it again for the parent url.
To Reproduce
Steps to reproduce the issue:
- Call the
/map
endpoint for a site such as
curl --location 'https://api.firecrawl.dev/v1/map' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer ***' \
--data '{
"url": "https://faculty.cs.byu.edu/~rodham/cs240"
}'
The results do not include the url https://faculty.cs.byu.edu/~rodham/cs240/this-path-is-invalid
- Call the
/map
endpoint with a child url that does not exist and resolves to a 404
curl --location 'https://api.firecrawl.dev/v1/map' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer ***' \
--data '{
"url": "https://faculty.cs.byu.edu/~rodham/cs240/this-path-is-invalid"
}'
This returns a map result with just the provided page as it's not a valid path and leads to a forbidden page.
- Call the
/map
endpoint for the parent url again.
curl --location 'https://api.firecrawl.dev/v1/map' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer ***' \
--data '{
"url": "https://faculty.cs.byu.edu/~rodham/cs240"
}'
Now the results do include the url https://faculty.cs.byu.edu/~rodham/cs240/this-path-is-invalid
Expected Behavior
These non-existent paths would not be included in the map results.
Screenshots
N/A
Environment (please complete the following information):
Using the Firecrawl API here, so no specific environment info.
Logs
N/A
Additional Context
N/A