Skip to content

[Bug] Map results include extra links #1426

Open
@sawyerclemmons

Description

@sawyerclemmons

Describe the Bug
When using the Firecrawl API and calling the map endpoint, some extra urls are sometimes included in the results. This seems to happen when calling the map endpoint for a child url that does not exist and then later calling it again for the parent url.

To Reproduce
Steps to reproduce the issue:

  1. Call the /map endpoint for a site such as
curl --location 'https://api.firecrawl.dev/v1/map' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer ***' \
--data '{
    "url": "https://faculty.cs.byu.edu/~rodham/cs240"
}'

The results do not include the url https://faculty.cs.byu.edu/~rodham/cs240/this-path-is-invalid

  1. Call the /map endpoint with a child url that does not exist and resolves to a 404
curl --location 'https://api.firecrawl.dev/v1/map' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer ***' \
--data '{
    "url": "https://faculty.cs.byu.edu/~rodham/cs240/this-path-is-invalid"
}'

This returns a map result with just the provided page as it's not a valid path and leads to a forbidden page.

  1. Call the /map endpoint for the parent url again.
curl --location 'https://api.firecrawl.dev/v1/map' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer ***' \
--data '{
    "url": "https://faculty.cs.byu.edu/~rodham/cs240"
}'

Now the results do include the url https://faculty.cs.byu.edu/~rodham/cs240/this-path-is-invalid

Expected Behavior
These non-existent paths would not be included in the map results.

Screenshots
N/A

Environment (please complete the following information):
Using the Firecrawl API here, so no specific environment info.

Logs
N/A

Additional Context
N/A

Activity

shosseini811

shosseini811 commented on Apr 19, 2025

@shosseini811

Hey @sawyerclemmons
The issue is that the page at base_url = “https://faculty.cs.byu.edu/~rodham/cs240/” is designed differently on the backend.
For example, new_url = “https://faculty.cs.byu.edu/~rodham/cs240/schedule.html” appears to have many links, but when you click on each lecture, you’ll see that they actually belong to base_url.
That’s why Firecrawl can’t find any map for new_url — the content isn’t actually available at that location.

Image

Image

sawyerclemmons

sawyerclemmons commented on Apr 21, 2025

@sawyerclemmons
Author

Yes, that makes sense. I am still confused on why the map result seems to change after multiple requests though. It seems like maybe a caching issue. On my example from the bug description, I would not expect https://faculty.cs.byu.edu/~rodham/cs240/this-path-is-invalid to ever show up in the map results for https://faculty.cs.byu.edu/~rodham/cs240. However in the 3rd request from my example that path shows up. It is not included in the results of the 1st request. So it seems like after I make the 2nd map request, that non-existent page is cached and then included in the next map request that is made for the root url.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      [Bug] Map results include extra links · Issue #1426 · mendableai/firecrawl