Open
Description
Based on the corrupted data here is the list of pages with corrupted ca:
WITH wappalyzer AS (
SELECT
category
FROM wappalyzer.apps,
UNNEST(categories) AS category
)
SELECT
technology,
category,
count(distinct page) AS cnt_pages,
ARRAY_AGG(DISTINCT page LIMIT 3) AS sample_pages
FROM crawl.pages,
UNNEST (technologies) AS technology,
UNNEST (technology.categories) AS category
LEFT JOIN wappalyzer
USING (category)
WHERE date = '2024-11-01'
AND wappalyzer.category IS NULL
GROUP BY 1,2
order by category ASC
The detection seems to work fine. It looks like page context is messing with some built-in objects again.
Maybe we could avoid using any values that could be impacted by it.
A few cases:
- https://newcar.one2car.com/search (capitalised)
- https://ascf.amorepacific.co.kr/ (whitespaces removed)
- https://advancement.shu.edu/get-involved/events-calendar.html (replaced with HTML)
- https://www.gmi.go.kr/ (lowercase with dashes)
- https://iot.lostnfound.com/en/functions/ (replaced with
undefined
) - etc
One of the observations - in most of these cases only the values within detected_technologies
have correct data (keys are also impacted).
Maybe we should switch to it for the BigQuery data?
For example:
technologies = [
{
"technology": technology["name"],
"categories": [category["name"] for category in technology["categories"]],
"info": [technology["version"]]
}
for technology in detected_technologies.values()
]
Metadata
Assignees
Labels
No labels