Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
dbcf818
Add script to perform parsing. Store output locally. Sanitize urls fo…
anemone008 Feb 9, 2026
4354cd5
Add changeset
anemone008 Feb 9, 2026
a9540a3
Fix fetch. Set headless true
anemone008 Feb 9, 2026
de11fe0
Replace timeout with AbortController + setTimeout
anemone008 Feb 10, 2026
3daa8c9
Add filter on ariaExpanded to avoid collapse of already expanded content
anemone008 Feb 10, 2026
a1ea8f0
Update apps/parser/src/modules/output.ts
anemone008 Feb 10, 2026
4018873
Drop the hash from the visit key to avoid redundant navigation/scrapi…
anemone008 Feb 10, 2026
3feea1b
Merge branch 'CAI-749-parser-url-crawler' of https://github.com/pagop…
anemone008 Feb 10, 2026
26236f0
Update README.md
anemone008 Feb 10, 2026
9afad68
Add test on filename sanitization
anemone008 Feb 10, 2026
d0b0cc6
Add short hash when filename is longer than 255 chars for URL filenam…
anemone008 Feb 10, 2026
31c4bad
Update apps/parser/tests/parser.error-handling.test.ts
anemone008 Feb 10, 2026
fca22ae
Rename VECTOR_INDEX_NAME to PARSER_VECTOR_INDEX_NAME
anemone008 Feb 10, 2026
748684e
Refactor deriveSubPath to use base URL for path derivation
anemone008 Feb 10, 2026
b403294
Rename vector_index variable, remove alternative output dir, set sing…
anemone008 Feb 11, 2026
ccdc78a
Merge remote-tracking branch 'origin/main' into CAI-749-parser-url-cr…
anemone008 Feb 11, 2026
eff7c31
Update README and config for improved environment variable handling a…
anemone008 Feb 11, 2026
e53dfa7
Update apps/parser/README.md
anemone008 Feb 11, 2026
f1ac135
Refactor URL handling and filename sanitization into helpers
anemone008 Feb 11, 2026
1be8fc9
Move assertReachable function to module network.ts
anemone008 Feb 11, 2026
d2a476a
Rename buildDefaultOutputDirectory to generateOutputDirectoryPath
anemone008 Feb 11, 2026
bb35a84
Merge remote-tracking branch 'origin/main' into CAI-749-parser-url-cr…
anemone008 Feb 11, 2026
f94b3a6
Add error logging for failed anchor href parsing in parsePages function
anemone008 Feb 11, 2026
58ec25e
Add warning logs for iframe src parsing and anchor extraction failures
anemone008 Feb 11, 2026
e291067
Improve error handling in UrlWithoutAnchors
anemone008 Feb 11, 2026
9de4eb4
Move FILENAME_LENGTH_THRESHOLD constant at the start of the file
anemone008 Feb 12, 2026
462c5e6
Update README for type-check and compile instructions
anemone008 Feb 12, 2026
c8b1cab
Refactor code for consistency and readability; update imports, format…
anemone008 Feb 12, 2026
0c19cbb
Add progress log in recursive parsePages function
anemone008 Feb 13, 2026
286325d
Enhance URL handling and configuration:
anemone008 Feb 13, 2026
985363d
Add TODO in isWithinScope
anemone008 Feb 13, 2026
eb5e964
Rename parsePages to exploreAndParsePages. Expand sanitizeUrlAsFilena…
anemone008 Feb 13, 2026
416ef96
Add warnings for missing or invalid URL replacements; rename function…
anemone008 Feb 13, 2026
5d24926
Refactor parser module: clean main function, rename crawler to parser…
anemone008 Feb 13, 2026
cfd4ffe
Formatting
anemone008 Feb 13, 2026
923aa16
Merge remote-tracking branch 'origin/main' into CAI-749-parser-url-cr…
anemone008 Feb 13, 2026
755fb6b
Save root page metadata as index.json
anemone008 Feb 13, 2026
2a1b9a0
Refactor parser configuration: update maxDepth to allow null for unli…
anemone008 Feb 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/wide-hairs-fail.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"parser": major
---

Add script to perform parsing to parser app. Store parsed information locally. Sanitize urls in filesystem compatible format to use as file names.
13 changes: 13 additions & 0 deletions apps/parser/.env.default
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Default environment variables for the parser CLI
# Root URL to start parsing from
URL="https://example.com"

# Maximum recursion depth (integer)
DEPTH=2

# Name of the vector index bucket/folder where parsed artifacts are stored
VECTOR_INDEX_NAME="vector-index-name"

# Optional absolute/relative directory override. Leave empty to use
# <VECTOR_INDEX_NAME>/parsing/<sanitized(baseUrl)> automatically.
OUTDIR=""
39 changes: 39 additions & 0 deletions apps/parser/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
## Parser utilities

This package contains the following TypeScript CLI utility:

- `parser` — recursively visits a website, extracts structured page metadata, and saves each page under `<VECTOR_INDEX_NAME>/parsing/<sanitized(baseUrl)>/`.


### Getting started

```bash
npm install
npm run build
```

### Parse a website

```bash
URL=https://example.com DEPTH=2 npm run parse
```

Environment variables:

- `URL` (required): root page for the parse.
- `DEPTH` (optional, default `2`): max depth for recursion.
- `VECTOR_INDEX_NAME` (required unless `OUTDIR` is provided): base directory name where parsed data will be stored as `<VECTOR_INDEX_NAME>/parsing/<sanitized(baseUrl)>/`.
- `OUTDIR` (optional): fully override the destination directory.

`<sanitized(baseUrl)>` and `<sanitized(path)>` refer to filesystem-safe versions of the URL components (illegal characters replaced with `_`), ensuring predictable, human-readable filenames.

Each visited page is stored as `<VECTOR_INDEX_NAME>/parsing/<sanitized(baseUrl)>/<sanitized(path)>.json` (or under `OUTDIR` if specified) with normalized metadata, making it easy to diff between runs.


### Tests

```bash
npm test
```

Tests compile the project before executing Jest to ensure the CLI behaves exactly like the production build.
14 changes: 14 additions & 0 deletions apps/parser/jest.config.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
import type { Config } from 'jest';

const config: Config = {
rootDir: __dirname,
testRegex: 'tests/.*\\.test\\.ts$',
transform: {
'^.+\\.ts$': ['ts-jest', { tsconfig: 'tsconfig.json' }],
},
testEnvironment: 'node',
clearMocks: true,
verbose: false,
};

export default config;
26 changes: 22 additions & 4 deletions apps/parser/package.json
Original file line number Diff line number Diff line change
@@ -1,10 +1,28 @@
{
"name": "parser",
"version": "0.1.0",
"version": "1.0.0",
"private": true,
"scripts": {},
"scripts": {
"clean": "shx rm -rf dist",
"compile": "tsc --project tsconfig.json",
"build": "npm run clean && tsc --project tsconfig.build.json",
"parse": "npm run build && node dist/parser.js",
"test": "npm run build && jest -i"
},
"dependencies": {
"puppeteer": "^24.37.1"
"node-fetch": "^3.3.2",
"puppeteer": "^24.37.1",
"puppeteer-extra": "^3.3.6",
"puppeteer-extra-plugin-stealth": "^2.11.2",
"xml2js": "^0.6.2"
},
"devDependencies": {
"@types/jest": "^29.5.1",
"@types/node": "18.16.*",
"@types/xml2js": "^0.4.11",
"jest": "^29.5.0",
"shx": "^0.3.4",
"ts-jest": "^29.1.1",
"typescript": "5.1.6"
}
}

37 changes: 37 additions & 0 deletions apps/parser/src/modules/config.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import path from 'node:path';
import { stripUrlDecorations } from '../utils/url';
import { sanitizeFilename } from '../utils/sanitizeFilename';

export type EnvConfig = {
readonly baseUrl: string;
readonly sanitizedBaseUrl: string;
readonly outputDirectory: string;
readonly maxDepth: number;
};

const DEFAULT_BASE_URL = 'https://news.polymer-project.org/';
const DEFAULT_DEPTH = 2;

export function resolveEnv(): EnvConfig {
const baseUrl = process.env.URL?.trim().length ? process.env.URL : DEFAULT_BASE_URL;
const sanitizedBaseUrl = stripUrlDecorations(baseUrl);
const parsedDepth = Number.parseInt(process.env.DEPTH ?? `${DEFAULT_DEPTH}`, 10);
const maxDepth = Number.isNaN(parsedDepth) ? DEFAULT_DEPTH : Math.max(parsedDepth, 0);
const vectorIndexName = process.env.VECTOR_INDEX_NAME?.trim();
const derivedOutput = buildDefaultOutputDirectory(vectorIndexName, sanitizedBaseUrl);
const outputDirectory = process.env.OUTDIR?.trim().length
? process.env.OUTDIR
: derivedOutput;
return { baseUrl, sanitizedBaseUrl, outputDirectory, maxDepth };
}

function buildDefaultOutputDirectory(
vectorIndexName: string | undefined,
sanitizedBaseUrl: string
): string {
const safeBaseSegment = sanitizeFilename(sanitizedBaseUrl, { replacement: '_' });
if (!vectorIndexName) {
return `output/${safeBaseSegment}`;
}
return path.join(vectorIndexName, 'parsing', safeBaseSegment);
}
142 changes: 142 additions & 0 deletions apps/parser/src/modules/crawler.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
import { Browser } from 'puppeteer';
import { ParseNode, ParseMetadata } from './types';
import { normalizeUrl } from '../utils/url';
import { expandInteractiveSections } from './domActions';

export async function parsePages(
browser: Browser,
node: ParseNode,
depth: number,
maxDepth: number,
parsedPages: Map<string, ParseMetadata>,
parsePageFn: (browser: Browser, url: string) => Promise<ParseMetadata | null>,
baseOrigin: string,
baseScope: string,
baseHostToken: string
): Promise<void> {
const visitKey = buildVisitKey(node.url);
if (parsedPages.has(visitKey) || depth > maxDepth) {
return;
}

const normalizedUrl = normalizeUrl(node.url);
if (!isWithinScope(normalizedUrl, baseScope, baseHostToken)) {
return;
}

const metadata = await parsePageFn(browser, node.url);
if (!metadata) return;

parsedPages.set(visitKey, metadata);
node.title = metadata.title;
node.bodyText = metadata.bodyText;
node.lang = metadata.lang;
node.keywords = metadata.keywords;
node.datePublished = metadata.datePublished;
node.lastModified = metadata.lastModified;

let page;
let anchors: string[] = [];
try {
page = await browser.newPage();
await page.goto(node.url, { waitUntil: 'networkidle2', timeout: 45000 });
await expandInteractiveSections(page);
anchors = await page.evaluate((allowedToken: string) => {
const anchors = Array.from(document.querySelectorAll('a[href]'));
const iframeSources = Array.from(document.querySelectorAll('iframe[src]'));
const unique = new Set<string>();
for (const anchor of anchors) {
const href = (anchor as HTMLAnchorElement).href;
if (!href || !href.startsWith('http')) continue;
try {
const target = new URL(href, window.location.href);
const normalizedHref = target.href.toLowerCase();
if (allowedToken && !normalizedHref.includes(allowedToken)) continue;
if (target.href === window.location.href) continue;
unique.add(target.href);
} catch (_) {}
}

for (const frame of iframeSources) {
const src = (frame as HTMLIFrameElement).src;
if (!src || !src.startsWith('http')) {
continue;
}
try {
const target = new URL(src, window.location.href);
const normalizedSrc = target.href.toLowerCase();
if (allowedToken && !normalizedSrc.includes(allowedToken)) continue;
unique.add(target.href);
} catch (_) {}
}
return Array.from(unique);
}, baseHostToken) as string[];
} catch (error) {
// Ignore anchor extraction errors
} finally {
if (page) await page.close();
}


const scheduled = new Set<string>();
const nextChildren: ParseNode[] = [];
for (const href of anchors) {
const normalized = normalizeUrl(href);
const visitCandidate = buildVisitKey(href);
if (parsedPages.has(visitCandidate) || scheduled.has(visitCandidate)) continue;
const lowerNormalized = normalized.toLowerCase();
if (baseHostToken && !lowerNormalized.includes(baseHostToken)) {
continue;
}
if (!isWithinScope(normalized, baseScope, baseHostToken)) {
continue;
}
scheduled.add(visitCandidate);
nextChildren.push({ url: href });
}
node.children = nextChildren;

if (!node.children || depth >= maxDepth) return;
for (const child of node.children) {
await parsePages(
browser,
child,
depth + 1,
maxDepth,
parsedPages,
parsePageFn,
baseOrigin,
baseScope,
baseHostToken
);
}
}

function isWithinScope(url: string, scope: string, hostToken: string): boolean {
if (hostToken && url.toLowerCase().includes(hostToken)) {
return true;
}
if (!scope) {
return true;
}
const lowerUrl = url.toLowerCase();
const lowerScope = scope.toLowerCase();
if (lowerUrl === lowerScope) {
return true;
}
if (!lowerUrl.startsWith(lowerScope)) {
return false;
}
const nextChar = lowerUrl.charAt(lowerScope.length);
return nextChar === '/' || nextChar === '?' || nextChar === '#';
}

export function buildVisitKey(rawUrl: string): string {
try {
const url = new URL(rawUrl);
const normalizedBase = normalizeUrl(url.toString());
return url.hash ? `${normalizedBase}${url.hash}` : normalizedBase;
} catch (_error) {
return rawUrl;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suggest to add a warning here

}
}
43 changes: 43 additions & 0 deletions apps/parser/src/modules/domActions.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import type { Page } from 'puppeteer';

const TOGGLE_SELECTORS = [
'[data-toggle]'
,'[data-testid="accordion-toggle"]'
,'[aria-expanded]'
,'.accordion button'
,'.accordion-toggle'
,'.accordion-trigger'
,'.faq-item button'
,'.collapse-toggle'
,'.MuiButtonBase-root[aria-expanded]'
];

export async function expandInteractiveSections(page: Page): Promise<void> {
await page.evaluate((selectors) => {
document.querySelectorAll('details').forEach((element) => {
(element as HTMLDetailsElement).open = true;
});

selectors.forEach((selector) => {
document.querySelectorAll(selector).forEach((node) => {
const target = node as HTMLElement;
if (!target || target.getAttribute('data-expanded') === 'true') {
return;
}

const ariaExpanded = target.getAttribute('aria-expanded');
const isToggleButton =
target.tagName === 'BUTTON' || target.getAttribute('role') === 'button';
const isCollapsed =
ariaExpanded === 'false' || target.classList.contains('collapsed');

if (isToggleButton || isCollapsed || selector === '[data-toggle]') {
target.click();
target.setAttribute('data-expanded', 'true');
}
});
});
}, TOGGLE_SELECTORS);

await new Promise((resolve) => setTimeout(resolve, 250));
}
4 changes: 4 additions & 0 deletions apps/parser/src/modules/errors.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
export function handleError(error: unknown) {
console.error('Parser terminated with an error:', error);
process.exitCode = 1;
}
10 changes: 10 additions & 0 deletions apps/parser/src/modules/output.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
import { mkdirSync } from 'node:fs';
import { writeFile } from 'node:fs/promises';

export function ensureDirectory(dir: string) {
mkdirSync(dir, { recursive: true });
}

export async function saveMetadata(dir: string, filename: string, metadata: object) {
await writeFile(`${dir}/${filename}`, JSON.stringify(metadata, null, 2));
}
20 changes: 20 additions & 0 deletions apps/parser/src/modules/types.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
export type ParseMetadata = {
readonly url: string;
readonly title: string;
readonly bodyText: string;
readonly lang: string | null;
readonly keywords: string | null;
readonly datePublished: string | null;
readonly lastModified: string | null;
};

export type ParseNode = {
readonly url: string;
title?: string;
bodyText?: string;
lang?: string | null;
keywords?: string | null;
datePublished?: string | null;
lastModified?: string | null;
children?: ParseNode[];
};
Loading
Loading