-
Notifications
You must be signed in to change notification settings - Fork 2
[CAI-749] Local parser #2007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
anemone008
wants to merge
38
commits into
main
Choose a base branch
from
CAI-749-parser-url-crawler
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
[CAI-749] Local parser #2007
Changes from 28 commits
Commits
Show all changes
38 commits
Select commit
Hold shift + click to select a range
dbcf818
Add script to perform parsing. Store output locally. Sanitize urls fo…
anemone008 4354cd5
Add changeset
anemone008 a9540a3
Fix fetch. Set headless true
anemone008 de11fe0
Replace timeout with AbortController + setTimeout
anemone008 3daa8c9
Add filter on ariaExpanded to avoid collapse of already expanded content
anemone008 a1ea8f0
Update apps/parser/src/modules/output.ts
anemone008 4018873
Drop the hash from the visit key to avoid redundant navigation/scrapi…
anemone008 3feea1b
Merge branch 'CAI-749-parser-url-crawler' of https://github.com/pagop…
anemone008 26236f0
Update README.md
anemone008 9afad68
Add test on filename sanitization
anemone008 d0b0cc6
Add short hash when filename is longer than 255 chars for URL filenam…
anemone008 31c4bad
Update apps/parser/tests/parser.error-handling.test.ts
anemone008 fca22ae
Rename VECTOR_INDEX_NAME to PARSER_VECTOR_INDEX_NAME
anemone008 748684e
Refactor deriveSubPath to use base URL for path derivation
anemone008 b403294
Rename vector_index variable, remove alternative output dir, set sing…
anemone008 ccdc78a
Merge remote-tracking branch 'origin/main' into CAI-749-parser-url-cr…
anemone008 eff7c31
Update README and config for improved environment variable handling a…
anemone008 e53dfa7
Update apps/parser/README.md
anemone008 f1ac135
Refactor URL handling and filename sanitization into helpers
anemone008 1be8fc9
Move assertReachable function to module network.ts
anemone008 d2a476a
Rename buildDefaultOutputDirectory to generateOutputDirectoryPath
anemone008 bb35a84
Merge remote-tracking branch 'origin/main' into CAI-749-parser-url-cr…
anemone008 f94b3a6
Add error logging for failed anchor href parsing in parsePages function
anemone008 58ec25e
Add warning logs for iframe src parsing and anchor extraction failures
anemone008 e291067
Improve error handling in UrlWithoutAnchors
anemone008 9de4eb4
Move FILENAME_LENGTH_THRESHOLD constant at the start of the file
anemone008 462c5e6
Update README for type-check and compile instructions
anemone008 c8b1cab
Refactor code for consistency and readability; update imports, format…
anemone008 0c19cbb
Add progress log in recursive parsePages function
anemone008 286325d
Enhance URL handling and configuration:
anemone008 985363d
Add TODO in isWithinScope
anemone008 eb5e964
Rename parsePages to exploreAndParsePages. Expand sanitizeUrlAsFilena…
anemone008 416ef96
Add warnings for missing or invalid URL replacements; rename function…
anemone008 5d24926
Refactor parser module: clean main function, rename crawler to parser…
anemone008 cfd4ffe
Formatting
anemone008 923aa16
Merge remote-tracking branch 'origin/main' into CAI-749-parser-url-cr…
anemone008 755fb6b
Save root page metadata as index.json
anemone008 2a1b9a0
Refactor parser configuration: update maxDepth to allow null for unli…
anemone008 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| --- | ||
| "parser": major | ||
| --- | ||
|
|
||
| Add script to perform parsing to parser app. Store parsed information locally. Sanitize urls in filesystem compatible format to use as file names. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| # Default environment variables for the parser CLI | ||
| # Root URL to start parsing from | ||
| URL="https://example.com" | ||
|
|
||
| # Maximum recursion depth (integer) | ||
| DEPTH=2 | ||
|
|
||
| # Name of the vector index bucket/folder where parsed artifacts are stored | ||
| CHB_INDEX_ID="parser-vector-index-name" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
|
|
||
| # Parser Utilities | ||
|
|
||
| This package provides a TypeScript CLI tool for recursively crawling a website, extracting structured metadata from each page, and saving the results in a predictable directory structure. | ||
|
|
||
| ## Features | ||
|
|
||
| - **Recursive website parsing**: Visits all reachable pages up to a configurable depth. | ||
| - **Structured output**: Saves each page's metadata as a JSON file. | ||
| - **Configurable via environment variables or .env file**. | ||
|
|
||
| --- | ||
|
|
||
| ## Getting Started | ||
|
|
||
| 1. **Install dependencies:** | ||
| ```bash | ||
| npm install | ||
| ``` | ||
| 2. **Type-check & compile** | ||
| ```bash | ||
| npm run compile | ||
| ``` | ||
| 3. **Build the project:** | ||
| ```bash | ||
| npm run build | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Usage | ||
|
|
||
| ### 1. Configure Environment Variables | ||
|
|
||
| You can provide configuration in two ways: | ||
|
|
||
| #### a) Using a `.env` file (recommended) | ||
|
|
||
| Create a `.env` file in the `apps/parser` directory with the following content: | ||
|
|
||
| ``` | ||
| URL=https://example.com | ||
| CHB_INDEX_ID=name_of_your_choice | ||
| # DEPTH=2 # Optional, defaults to 2 | ||
| ``` | ||
|
|
||
| #### b) Using command line variables | ||
|
|
||
| ```bash | ||
| URL=https://example.com DEPTH=2 CHB_INDEX_ID=name_of_your_choice npm run parse | ||
| ``` | ||
|
|
||
| ### 2. Run the Parser | ||
|
|
||
| ```bash | ||
| npm run parse | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Environment Variables | ||
|
|
||
| - **`URL`** (required): The root page to start parsing from. | ||
| - **`CHB_INDEX_ID`** (required): The base directory for storing parsed data. Output will be saved as `<CHB_INDEX_ID>/parsing/<sanitized(baseUrl)>/`. | ||
| - **`DEPTH`** (optional, default: `2`): Maximum recursion depth for crawling links. | ||
|
|
||
| **Note:** The parser will first look for these variables in the environment. If not found, it will load them from `.env` in the `apps/parser` directory. | ||
|
|
||
| --- | ||
|
|
||
| ## Output Structure | ||
|
|
||
| Each visited page is saved as a JSON file: | ||
|
|
||
| ``` | ||
| <CHB_INDEX_ID>/parsing/<sanitized(baseUrl)>/<sanitized(path)>.json | ||
| ``` | ||
|
|
||
| - `<sanitized(baseUrl)>` and `<sanitized(path)>` are filesystem-safe versions of the URL components (illegal characters replaced with `-`). | ||
| - This structure ensures output is predictable, easy to diff, and human-readable. | ||
|
|
||
| --- | ||
|
|
||
| ## Testing | ||
|
|
||
| Run tests with: | ||
|
|
||
| ```bash | ||
| npm run test | ||
| ``` | ||
|
|
||
| Tests will compile the project and then execute Jest to ensure the CLI behaves as expected. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| import type { Config } from "jest"; | ||
|
|
||
| const config: Config = { | ||
| rootDir: __dirname, | ||
| testRegex: "tests/.*\\.test\\.ts$", | ||
| transform: { | ||
| "^.+\\.ts$": ["ts-jest", { tsconfig: "tsconfig.json" }], | ||
| }, | ||
| testEnvironment: "node", | ||
| clearMocks: true, | ||
| verbose: false, | ||
| }; | ||
|
|
||
| export default config; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,10 +1,28 @@ | ||
| { | ||
| "name": "parser", | ||
| "version": "0.1.0", | ||
| "version": "1.0.0", | ||
| "private": true, | ||
| "scripts": {}, | ||
| "scripts": { | ||
| "clean": "shx rm -rf dist", | ||
| "compile": "tsc --project tsconfig.json", | ||
| "build": "npm run clean && tsc --project tsconfig.build.json", | ||
| "parse": "npm run build && node dist/parser.js", | ||
| "test": "npm run build && jest -i" | ||
| }, | ||
| "dependencies": { | ||
| "puppeteer": "^24.37.1" | ||
| "node-fetch": "^3.3.2", | ||
| "puppeteer": "^24.37.1", | ||
| "puppeteer-extra": "^3.3.6", | ||
| "puppeteer-extra-plugin-stealth": "^2.11.2", | ||
| "xml2js": "^0.6.2" | ||
| }, | ||
| "devDependencies": { | ||
| "@types/jest": "^29.5.1", | ||
| "@types/node": "18.16.*", | ||
| "@types/xml2js": "^0.4.11", | ||
| "jest": "^29.5.0", | ||
| "shx": "^0.3.4", | ||
| "ts-jest": "^29.1.1", | ||
| "typescript": "5.1.6" | ||
| } | ||
| } | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| export function toIsoOrNull(value: string | null): string | null { | ||
| if (!value) { | ||
| return null; | ||
| } | ||
| const date = new Date(value); | ||
| return Number.isNaN(date.getTime()) ? null : date.toISOString(); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| import { SanitizeOptions } from "../modules/types"; | ||
|
|
||
| const ILLEGAL_RE = /[\/\?<>\\:\*\|"]/g; | ||
| const CONTROL_RE = /[\x00-\x1f\x80-\x9f]/g; | ||
| const RESERVED_RE = /^\.+$/; | ||
| const WINDOWS_RESERVED_RE = /^(con|prn|aux|nul|com[0-9]|lpt[0-9])$/i; | ||
| const WINDOWS_TRAILING_RE = /[\. ]+$/; | ||
| const DEFAULT_REPLACEMENT = "-"; | ||
|
|
||
| export function sanitizeUrlAsFilename( | ||
| input: string, | ||
| options?: SanitizeOptions, | ||
| ): string { | ||
| if (!input) { | ||
| return DEFAULT_REPLACEMENT; | ||
| } | ||
| const replacement = validReplacementOrDefault( | ||
| options?.replacement ?? DEFAULT_REPLACEMENT, | ||
| ); | ||
| let sanitized = input | ||
| .replace(ILLEGAL_RE, replacement) | ||
| .replace(CONTROL_RE, replacement) | ||
| .replace(RESERVED_RE, replacement) | ||
| .replace(WINDOWS_RESERVED_RE, replacement) | ||
| .replace(WINDOWS_TRAILING_RE, replacement) | ||
| .trim(); | ||
| if (sanitized.length === 0) { | ||
| return replacement; | ||
| } | ||
| return sanitized.slice(0, 255); | ||
| } | ||
|
|
||
| function validReplacementOrDefault(candidate: string): string { | ||
| if (!candidate) { | ||
| return DEFAULT_REPLACEMENT; | ||
MarBert marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| } | ||
| if ( | ||
| /[\/\?<>\\:\*\|"]/u.test(candidate) || | ||
| /[\x00-\x1f\x80-\x9f]/u.test(candidate) | ||
| ) { | ||
| return DEFAULT_REPLACEMENT; | ||
| } | ||
| return candidate; | ||
| } | ||
|
|
||
| export const UrlWithoutAnchors = (rawUrl: string): string => { | ||
|
||
| try { | ||
| const parsed = new URL(rawUrl); | ||
| parsed.hash = ""; | ||
| if (parsed.pathname.length > 1 && parsed.pathname.endsWith("/")) { | ||
| parsed.pathname = parsed.pathname.slice(0, -1); | ||
| } | ||
| const serialized = parsed.toString(); | ||
| if (parsed.pathname === "/" && !parsed.search) { | ||
| return serialized.endsWith("/") ? serialized.slice(0, -1) : serialized; | ||
| } | ||
| return serialized; | ||
| } catch (error) { | ||
| console.warn(`Failed to parse URL: ${rawUrl}`, error); | ||
| return rawUrl; | ||
| } | ||
| }; | ||
|
|
||
| export function deriveSubPath( | ||
| targetUrl: string, | ||
| baseUrl: string, | ||
| sanitizedBaseUrl: string, | ||
| ): string { | ||
| const base = new URL(baseUrl); | ||
| const target = new URL(targetUrl); | ||
| let relPath = target.pathname; | ||
| if (base.pathname !== "/" && relPath.startsWith(base.pathname)) { | ||
| relPath = relPath.slice(base.pathname.length); | ||
| if (!relPath.startsWith("/")) relPath = "/" + relPath; | ||
| } | ||
| if ( | ||
| UrlWithoutAnchors(targetUrl) === sanitizedBaseUrl || | ||
| relPath === "/" || | ||
| relPath === "" | ||
| ) { | ||
| return "/"; | ||
| } | ||
| return `${relPath}${target.search}${target.hash}` || "/"; | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| import path from "node:path"; | ||
| import { EnvConfig } from "./types"; | ||
| import { | ||
| UrlWithoutAnchors, | ||
| sanitizeUrlAsFilename, | ||
| } from "../helpers/url-handling"; | ||
| import * as dotenv from "dotenv"; | ||
|
|
||
| const DEFAULT_DEPTH = 2; | ||
|
|
||
| export function resolveEnv(): EnvConfig { | ||
| let baseUrl = process.env.URL?.trim(); | ||
| let depth = process.env.DEPTH?.trim(); | ||
| let vectorIndexName = process.env.CHB_INDEX_ID?.trim(); | ||
| if (!baseUrl || !depth || !vectorIndexName) { | ||
| const parserHome = path.resolve(__dirname, "../../"); | ||
| dotenv.config({ path: path.join(parserHome, ".env") }); | ||
| baseUrl = baseUrl || process.env.URL?.trim(); | ||
| depth = depth || process.env.DEPTH?.trim(); | ||
| vectorIndexName = vectorIndexName || process.env.CHB_INDEX_ID?.trim(); | ||
MarBert marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| } | ||
| if (!baseUrl) { | ||
| throw new Error( | ||
| "Missing required URL. Set URL in environment or .env file.", | ||
| ); | ||
| } | ||
| const sanitizedBaseUrl = UrlWithoutAnchors(baseUrl); | ||
| const parsedDepth = Number.parseInt(depth ?? `${DEFAULT_DEPTH}`, 10); | ||
| const maxDepth = Number.isNaN(parsedDepth) | ||
| ? DEFAULT_DEPTH | ||
| : Math.max(parsedDepth, 0); | ||
| const outputDirectory = generateOutputDirectoryPath( | ||
| vectorIndexName, | ||
| sanitizedBaseUrl, | ||
| ); | ||
| return { baseUrl, sanitizedBaseUrl, outputDirectory, maxDepth }; | ||
| } | ||
|
|
||
| function generateOutputDirectoryPath( | ||
| vectorIndexName: string | undefined, | ||
| sanitizedBaseUrl: string, | ||
| ): string { | ||
| const safeBaseSegment = sanitizeUrlAsFilename(sanitizedBaseUrl, { | ||
| replacement: "_", | ||
| }); | ||
| if (!vectorIndexName) { | ||
| return `output/${safeBaseSegment}`; | ||
| } | ||
| return path.join(vectorIndexName, "parsing", safeBaseSegment); | ||
| } | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i suggest to add a warning log so that we can understand if something went wrong and the DEFAULT_REPLACEMENT is used