Skip to content

More corrupted values in technology detection data #29

Open
@max-ostapenko

Description

Based on the corrupted data here is the list of pages with corrupted ca:

WITH wappalyzer AS (
  SELECT
    category
  FROM wappalyzer.apps,
    UNNEST(categories) AS category
)

SELECT
  technology,
  category,
  count(distinct page) AS cnt_pages,
  ARRAY_AGG(DISTINCT page LIMIT 3) AS sample_pages
FROM crawl.pages,
  UNNEST (technologies) AS technology,
  UNNEST (technology.categories) AS category
LEFT JOIN wappalyzer
USING (category)
WHERE date = '2024-11-01'
AND wappalyzer.category IS NULL
GROUP BY 1,2
order by category ASC

The detection seems to work fine. It looks like page context is messing with some built-in objects again.
Maybe we could avoid using any values that could be impacted by it.

A few cases:

One of the observations - in most of these cases only the values within detected_technologies have correct data (keys are also impacted).
Maybe we should switch to it for the BigQuery data?
For example:

technologies = [
    {
        "technology": technology["name"],
        "categories": [category["name"] for category in technology["categories"]],
        "info": [technology["version"]]
    }
    for technology in detected_technologies.values()
]

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions