-
Notifications
You must be signed in to change notification settings - Fork 2
[CAI-749] Local parser #2007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
anemone008
wants to merge
38
commits into
main
Choose a base branch
from
CAI-749-parser-url-crawler
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,525
−6
Open
[CAI-749] Local parser #2007
Changes from 2 commits
Commits
Show all changes
38 commits
Select commit
Hold shift + click to select a range
dbcf818
Add script to perform parsing. Store output locally. Sanitize urls fo…
anemone008 4354cd5
Add changeset
anemone008 a9540a3
Fix fetch. Set headless true
anemone008 de11fe0
Replace timeout with AbortController + setTimeout
anemone008 3daa8c9
Add filter on ariaExpanded to avoid collapse of already expanded content
anemone008 a1ea8f0
Update apps/parser/src/modules/output.ts
anemone008 4018873
Drop the hash from the visit key to avoid redundant navigation/scrapi…
anemone008 3feea1b
Merge branch 'CAI-749-parser-url-crawler' of https://github.com/pagop…
anemone008 26236f0
Update README.md
anemone008 9afad68
Add test on filename sanitization
anemone008 d0b0cc6
Add short hash when filename is longer than 255 chars for URL filenam…
anemone008 31c4bad
Update apps/parser/tests/parser.error-handling.test.ts
anemone008 fca22ae
Rename VECTOR_INDEX_NAME to PARSER_VECTOR_INDEX_NAME
anemone008 748684e
Refactor deriveSubPath to use base URL for path derivation
anemone008 b403294
Rename vector_index variable, remove alternative output dir, set sing…
anemone008 ccdc78a
Merge remote-tracking branch 'origin/main' into CAI-749-parser-url-cr…
anemone008 eff7c31
Update README and config for improved environment variable handling a…
anemone008 e53dfa7
Update apps/parser/README.md
anemone008 f1ac135
Refactor URL handling and filename sanitization into helpers
anemone008 1be8fc9
Move assertReachable function to module network.ts
anemone008 d2a476a
Rename buildDefaultOutputDirectory to generateOutputDirectoryPath
anemone008 bb35a84
Merge remote-tracking branch 'origin/main' into CAI-749-parser-url-cr…
anemone008 f94b3a6
Add error logging for failed anchor href parsing in parsePages function
anemone008 58ec25e
Add warning logs for iframe src parsing and anchor extraction failures
anemone008 e291067
Improve error handling in UrlWithoutAnchors
anemone008 9de4eb4
Move FILENAME_LENGTH_THRESHOLD constant at the start of the file
anemone008 462c5e6
Update README for type-check and compile instructions
anemone008 c8b1cab
Refactor code for consistency and readability; update imports, format…
anemone008 0c19cbb
Add progress log in recursive parsePages function
anemone008 286325d
Enhance URL handling and configuration:
anemone008 985363d
Add TODO in isWithinScope
anemone008 eb5e964
Rename parsePages to exploreAndParsePages. Expand sanitizeUrlAsFilena…
anemone008 416ef96
Add warnings for missing or invalid URL replacements; rename function…
anemone008 5d24926
Refactor parser module: clean main function, rename crawler to parser…
anemone008 cfd4ffe
Formatting
anemone008 923aa16
Merge remote-tracking branch 'origin/main' into CAI-749-parser-url-cr…
anemone008 755fb6b
Save root page metadata as index.json
anemone008 2a1b9a0
Refactor parser configuration: update maxDepth to allow null for unli…
anemone008 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| --- | ||
| "parser": major | ||
| --- | ||
|
|
||
| Add script to perform parsing to parser app. Store parsed information locally. Sanitize urls in filesystem compatible format to use as file names. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| # Default environment variables for the parser CLI | ||
| # Root URL to start parsing from | ||
| URL="https://example.com" | ||
|
|
||
| # Maximum recursion depth (integer) | ||
| DEPTH=2 | ||
|
|
||
| # Name of the vector index bucket/folder where parsed artifacts are stored | ||
| VECTOR_INDEX_NAME="vector-index-name" | ||
|
|
||
| # Optional absolute/relative directory override. Leave empty to use | ||
| # <VECTOR_INDEX_NAME>/parsing/<sanitized(baseUrl)> automatically. | ||
| OUTDIR="" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| ## Parser utilities | ||
|
|
||
| This package contains the following TypeScript CLI utility: | ||
|
|
||
| - `parser` — recursively visits a website, extracts structured page metadata, and saves each page under `<VECTOR_INDEX_NAME>/parsing/<sanitized(baseUrl)>/`. | ||
|
|
||
|
|
||
| ### Getting started | ||
|
|
||
| ```bash | ||
| npm install | ||
| npm run build | ||
| ``` | ||
|
|
||
| ### Parse a website | ||
|
|
||
| ```bash | ||
| URL=https://example.com DEPTH=2 npm run parse | ||
| ``` | ||
MarBert marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Environment variables: | ||
|
|
||
| - `URL` (required): root page for the parse. | ||
| - `DEPTH` (optional, default `2`): max depth for recursion. | ||
| - `VECTOR_INDEX_NAME` (required unless `OUTDIR` is provided): base directory name where parsed data will be stored as `<VECTOR_INDEX_NAME>/parsing/<sanitized(baseUrl)>/`. | ||
| - `OUTDIR` (optional): fully override the destination directory. | ||
anemone008 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| `<sanitized(baseUrl)>` and `<sanitized(path)>` refer to filesystem-safe versions of the URL components (illegal characters replaced with `_`), ensuring predictable, human-readable filenames. | ||
|
|
||
| Each visited page is stored as `<VECTOR_INDEX_NAME>/parsing/<sanitized(baseUrl)>/<sanitized(path)>.json` (or under `OUTDIR` if specified) with normalized metadata, making it easy to diff between runs. | ||
|
|
||
|
|
||
| ### Tests | ||
|
|
||
| ```bash | ||
| npm test | ||
anemone008 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ``` | ||
|
|
||
| Tests compile the project before executing Jest to ensure the CLI behaves exactly like the production build. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| import type { Config } from 'jest'; | ||
|
|
||
| const config: Config = { | ||
| rootDir: __dirname, | ||
| testRegex: 'tests/.*\\.test\\.ts$', | ||
| transform: { | ||
| '^.+\\.ts$': ['ts-jest', { tsconfig: 'tsconfig.json' }], | ||
| }, | ||
| testEnvironment: 'node', | ||
| clearMocks: true, | ||
| verbose: false, | ||
| }; | ||
|
|
||
| export default config; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,10 +1,28 @@ | ||
| { | ||
| "name": "parser", | ||
| "version": "0.1.0", | ||
| "version": "1.0.0", | ||
| "private": true, | ||
| "scripts": {}, | ||
| "scripts": { | ||
| "clean": "shx rm -rf dist", | ||
| "compile": "tsc --project tsconfig.json", | ||
| "build": "npm run clean && tsc --project tsconfig.build.json", | ||
| "parse": "npm run build && node dist/parser.js", | ||
| "test": "npm run build && jest -i" | ||
| }, | ||
| "dependencies": { | ||
| "puppeteer": "^24.37.1" | ||
| "node-fetch": "^3.3.2", | ||
| "puppeteer": "^24.37.1", | ||
| "puppeteer-extra": "^3.3.6", | ||
| "puppeteer-extra-plugin-stealth": "^2.11.2", | ||
| "xml2js": "^0.6.2" | ||
| }, | ||
| "devDependencies": { | ||
| "@types/jest": "^29.5.1", | ||
| "@types/node": "18.16.*", | ||
| "@types/xml2js": "^0.4.11", | ||
| "jest": "^29.5.0", | ||
| "shx": "^0.3.4", | ||
| "ts-jest": "^29.1.1", | ||
| "typescript": "5.1.6" | ||
| } | ||
| } | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| import path from 'node:path'; | ||
| import { stripUrlDecorations } from '../utils/url'; | ||
| import { sanitizeFilename } from '../utils/sanitizeFilename'; | ||
|
|
||
| export type EnvConfig = { | ||
| readonly baseUrl: string; | ||
| readonly sanitizedBaseUrl: string; | ||
| readonly outputDirectory: string; | ||
| readonly maxDepth: number; | ||
| }; | ||
|
|
||
| const DEFAULT_BASE_URL = 'https://news.polymer-project.org/'; | ||
| const DEFAULT_DEPTH = 2; | ||
|
|
||
| export function resolveEnv(): EnvConfig { | ||
| const baseUrl = process.env.URL?.trim().length ? process.env.URL : DEFAULT_BASE_URL; | ||
| const sanitizedBaseUrl = stripUrlDecorations(baseUrl); | ||
| const parsedDepth = Number.parseInt(process.env.DEPTH ?? `${DEFAULT_DEPTH}`, 10); | ||
| const maxDepth = Number.isNaN(parsedDepth) ? DEFAULT_DEPTH : Math.max(parsedDepth, 0); | ||
| const vectorIndexName = process.env.VECTOR_INDEX_NAME?.trim(); | ||
MarBert marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| const derivedOutput = buildDefaultOutputDirectory(vectorIndexName, sanitizedBaseUrl); | ||
| const outputDirectory = process.env.OUTDIR?.trim().length | ||
| ? process.env.OUTDIR | ||
| : derivedOutput; | ||
| return { baseUrl, sanitizedBaseUrl, outputDirectory, maxDepth }; | ||
| } | ||
|
|
||
| function buildDefaultOutputDirectory( | ||
| vectorIndexName: string | undefined, | ||
| sanitizedBaseUrl: string | ||
| ): string { | ||
| const safeBaseSegment = sanitizeFilename(sanitizedBaseUrl, { replacement: '_' }); | ||
| if (!vectorIndexName) { | ||
| return `output/${safeBaseSegment}`; | ||
| } | ||
| return path.join(vectorIndexName, 'parsing', safeBaseSegment); | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,142 @@ | ||
| import { Browser } from 'puppeteer'; | ||
| import { ParseNode, ParseMetadata } from './types'; | ||
| import { normalizeUrl } from '../utils/url'; | ||
| import { expandInteractiveSections } from './domActions'; | ||
|
|
||
| export async function parsePages( | ||
| browser: Browser, | ||
| node: ParseNode, | ||
| depth: number, | ||
| maxDepth: number, | ||
| parsedPages: Map<string, ParseMetadata>, | ||
| parsePageFn: (browser: Browser, url: string) => Promise<ParseMetadata | null>, | ||
| baseOrigin: string, | ||
| baseScope: string, | ||
| baseHostToken: string | ||
| ): Promise<void> { | ||
| const visitKey = buildVisitKey(node.url); | ||
| if (parsedPages.has(visitKey) || depth > maxDepth) { | ||
| return; | ||
| } | ||
|
|
||
| const normalizedUrl = normalizeUrl(node.url); | ||
| if (!isWithinScope(normalizedUrl, baseScope, baseHostToken)) { | ||
| return; | ||
| } | ||
|
|
||
| const metadata = await parsePageFn(browser, node.url); | ||
| if (!metadata) return; | ||
|
|
||
| parsedPages.set(visitKey, metadata); | ||
| node.title = metadata.title; | ||
| node.bodyText = metadata.bodyText; | ||
| node.lang = metadata.lang; | ||
| node.keywords = metadata.keywords; | ||
| node.datePublished = metadata.datePublished; | ||
| node.lastModified = metadata.lastModified; | ||
|
|
||
| let page; | ||
| let anchors: string[] = []; | ||
| try { | ||
| page = await browser.newPage(); | ||
| await page.goto(node.url, { waitUntil: 'networkidle2', timeout: 45000 }); | ||
| await expandInteractiveSections(page); | ||
| anchors = await page.evaluate((allowedToken: string) => { | ||
| const anchors = Array.from(document.querySelectorAll('a[href]')); | ||
| const iframeSources = Array.from(document.querySelectorAll('iframe[src]')); | ||
| const unique = new Set<string>(); | ||
| for (const anchor of anchors) { | ||
| const href = (anchor as HTMLAnchorElement).href; | ||
| if (!href || !href.startsWith('http')) continue; | ||
| try { | ||
| const target = new URL(href, window.location.href); | ||
| const normalizedHref = target.href.toLowerCase(); | ||
| if (allowedToken && !normalizedHref.includes(allowedToken)) continue; | ||
| if (target.href === window.location.href) continue; | ||
| unique.add(target.href); | ||
| } catch (_) {} | ||
MarBert marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| } | ||
|
|
||
| for (const frame of iframeSources) { | ||
| const src = (frame as HTMLIFrameElement).src; | ||
| if (!src || !src.startsWith('http')) { | ||
| continue; | ||
| } | ||
| try { | ||
| const target = new URL(src, window.location.href); | ||
| const normalizedSrc = target.href.toLowerCase(); | ||
| if (allowedToken && !normalizedSrc.includes(allowedToken)) continue; | ||
| unique.add(target.href); | ||
| } catch (_) {} | ||
MarBert marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| } | ||
| return Array.from(unique); | ||
| }, baseHostToken) as string[]; | ||
| } catch (error) { | ||
| // Ignore anchor extraction errors | ||
MarBert marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| } finally { | ||
| if (page) await page.close(); | ||
| } | ||
|
|
||
|
|
||
| const scheduled = new Set<string>(); | ||
| const nextChildren: ParseNode[] = []; | ||
| for (const href of anchors) { | ||
| const normalized = normalizeUrl(href); | ||
| const visitCandidate = buildVisitKey(href); | ||
| if (parsedPages.has(visitCandidate) || scheduled.has(visitCandidate)) continue; | ||
| const lowerNormalized = normalized.toLowerCase(); | ||
| if (baseHostToken && !lowerNormalized.includes(baseHostToken)) { | ||
| continue; | ||
| } | ||
| if (!isWithinScope(normalized, baseScope, baseHostToken)) { | ||
| continue; | ||
| } | ||
| scheduled.add(visitCandidate); | ||
| nextChildren.push({ url: href }); | ||
| } | ||
| node.children = nextChildren; | ||
|
|
||
| if (!node.children || depth >= maxDepth) return; | ||
| for (const child of node.children) { | ||
| await parsePages( | ||
| browser, | ||
| child, | ||
| depth + 1, | ||
| maxDepth, | ||
| parsedPages, | ||
| parsePageFn, | ||
| baseOrigin, | ||
| baseScope, | ||
| baseHostToken | ||
| ); | ||
| } | ||
| } | ||
|
|
||
| function isWithinScope(url: string, scope: string, hostToken: string): boolean { | ||
| if (hostToken && url.toLowerCase().includes(hostToken)) { | ||
| return true; | ||
| } | ||
MarBert marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| if (!scope) { | ||
| return true; | ||
| } | ||
| const lowerUrl = url.toLowerCase(); | ||
| const lowerScope = scope.toLowerCase(); | ||
| if (lowerUrl === lowerScope) { | ||
| return true; | ||
| } | ||
| if (!lowerUrl.startsWith(lowerScope)) { | ||
| return false; | ||
| } | ||
| const nextChar = lowerUrl.charAt(lowerScope.length); | ||
| return nextChar === '/' || nextChar === '?' || nextChar === '#'; | ||
| } | ||
|
|
||
| export function buildVisitKey(rawUrl: string): string { | ||
| try { | ||
| const url = new URL(rawUrl); | ||
| const normalizedBase = normalizeUrl(url.toString()); | ||
| return url.hash ? `${normalizedBase}${url.hash}` : normalizedBase; | ||
| } catch (_error) { | ||
| return rawUrl; | ||
|
||
| } | ||
| } | ||
anemone008 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| import type { Page } from 'puppeteer'; | ||
|
|
||
| const TOGGLE_SELECTORS = [ | ||
| '[data-toggle]' | ||
| ,'[data-testid="accordion-toggle"]' | ||
| ,'[aria-expanded]' | ||
| ,'.accordion button' | ||
| ,'.accordion-toggle' | ||
| ,'.accordion-trigger' | ||
| ,'.faq-item button' | ||
| ,'.collapse-toggle' | ||
| ,'.MuiButtonBase-root[aria-expanded]' | ||
| ]; | ||
|
|
||
| export async function expandInteractiveSections(page: Page): Promise<void> { | ||
| await page.evaluate((selectors) => { | ||
| document.querySelectorAll('details').forEach((element) => { | ||
| (element as HTMLDetailsElement).open = true; | ||
| }); | ||
|
|
||
| selectors.forEach((selector) => { | ||
| document.querySelectorAll(selector).forEach((node) => { | ||
| const target = node as HTMLElement; | ||
| if (!target || target.getAttribute('data-expanded') === 'true') { | ||
| return; | ||
| } | ||
|
|
||
| const ariaExpanded = target.getAttribute('aria-expanded'); | ||
| const isToggleButton = | ||
| target.tagName === 'BUTTON' || target.getAttribute('role') === 'button'; | ||
| const isCollapsed = | ||
| ariaExpanded === 'false' || target.classList.contains('collapsed'); | ||
|
|
||
| if (isToggleButton || isCollapsed || selector === '[data-toggle]') { | ||
anemone008 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| target.click(); | ||
| target.setAttribute('data-expanded', 'true'); | ||
| } | ||
| }); | ||
| }); | ||
| }, TOGGLE_SELECTORS); | ||
|
|
||
| await new Promise((resolve) => setTimeout(resolve, 250)); | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| export function handleError(error: unknown) { | ||
| console.error('Parser terminated with an error:', error); | ||
| process.exitCode = 1; | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| import { mkdirSync } from 'node:fs'; | ||
| import { writeFile } from 'node:fs/promises'; | ||
|
|
||
| export function ensureDirectory(dir: string) { | ||
| mkdirSync(dir, { recursive: true }); | ||
| } | ||
|
|
||
| export async function saveMetadata(dir: string, filename: string, metadata: object) { | ||
| await writeFile(`${dir}/${filename}`, JSON.stringify(metadata, null, 2)); | ||
anemone008 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| export type ParseMetadata = { | ||
| readonly url: string; | ||
| readonly title: string; | ||
| readonly bodyText: string; | ||
| readonly lang: string | null; | ||
| readonly keywords: string | null; | ||
| readonly datePublished: string | null; | ||
| readonly lastModified: string | null; | ||
| }; | ||
|
|
||
| export type ParseNode = { | ||
| readonly url: string; | ||
| title?: string; | ||
| bodyText?: string; | ||
| lang?: string | null; | ||
| keywords?: string | null; | ||
| datePublished?: string | null; | ||
| lastModified?: string | null; | ||
| children?: ParseNode[]; | ||
| }; |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.