Skip to content

Content of Some links cannot be crawled #65

Open
@simmonn

Description

Description

Hi, I encountered a problem. After executing the scraper, I found that the content of some links cannot be crawled. The logs show 0 records. I have tried many methods, but it still cannot be crawled.

here is the snapshot of logs:
image

Steps to reproduce

here is part of my config

{
  "index_name": "docs",
  "sitemap_urls": [
    "https://mydomain/sitemap.xml"
  ],
  "start_urls": [
    {
      "url": "https://mydomain/guides",
      "tags": [
        "guides"
      ],
      "selectors_key": "guides"
    }
  ],
  "stop_urls": [],
  "selectors": {
    "default": {
      "lvl0": {
        "selector": "",
        "global": true,
        "default_value": "文档"
      },
      "lvl1": "article h1",
      "lvl2": "article h2",
      "lvl3": "article h3",
      "lvl4": "article h4",
      "lvl5": "article h5, article th, article td:first-child",
      "lvl6": "article h6",
      "text": "article p, article li, article td"
    },
    "guides": {
      "lvl0": {
        "selector": "",
        "global": true,
        "default_value": "开发指南"
      },
      "lvl1": "article h1",
      "lvl2": "article h2",
      "lvl3": "article h3",
      "lvl4": "article h4",
      "lvl5": "article h5, article th, article td:first-child",
      "lvl6": "article h6",
      "text": "article p, article li, article td"
    }
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag",
      "tags"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "nb_hits": 2227
}

Expected Behavior

I hope to crawl the content of all the links in the configuration into Typesense.

Actual Behavior

Content cannot be searched

image

Metadata

Typesense Version: maybe 0.24,I don't know how to get to know version

OS:x86_64 GNU/Linux

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions