Skip to content

saeeddehghan450/extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Ultimate DevTools Extractor

A comprehensive browser automation and data extraction tool built with TypeScript and Playwright. Extract and analyze complete browser state including network traffic, console logs, DOM elements, scripts, styles, performance metrics, and much more.

Features

πŸ”Œ 13 Modular Collectors

Each data collection module can be independently enabled/disabled:

  1. Network - HTTP requests/responses with headers and bodies
  2. Console - Logs, warnings, errors, and browser dialogs
  3. Elements - DOM element information and CDP listeners
  4. Scripts - JavaScript files and inline scripts
  5. CSS - Stylesheets and computed styles
  6. Application - Cookies, local storage, session storage, IndexedDB
  7. Security - SSL/TLS and security headers
  8. Performance - Performance traces and metrics
  9. Accessibility - Accessibility tree
  10. Media - Audio/video element statistics
  11. WebGL - Canvas and WebGL details
  12. Memory - Heap snapshots
  13. Service Worker - Service worker registration info

πŸ“ Module-Based Output

Each collector's data is automatically organized into its own folder:

output/
β”œβ”€β”€ network/
β”‚   β”œβ”€β”€ network-1763561850014.json
β”‚   └── click-1763561850014/
β”œβ”€β”€ console/
β”‚   └── console-1763561850014.json
β”œβ”€β”€ elements/
β”‚   └── elements-1763561850014.json
β”œβ”€β”€ scripts/
β”œβ”€β”€ css/
β”œβ”€β”€ application/
β”œβ”€β”€ security/
β”œβ”€β”€ performance/
β”œβ”€β”€ accessibility/
β”œβ”€β”€ media/
β”œβ”€β”€ webgl/
β”œβ”€β”€ memory/
β”œβ”€β”€ serviceWorker/
└── analysis-1763561850014.json (combined analysis)

βš™οΈ Configuration System

Easily control which collectors are active via config.json:

{
  "targetUrl": "https://www.i2ocr.com/pdf-ocr-persian",
  "headless": false,
  "slowMo": 80,
  "collectors": {
    "network": { "enabled": true, "description": "..." },
    "console": { "enabled": true, "description": "..." },
    ...
  }
}

🎯 Real-time Click Tracking

  • Automatically injects click listeners into pages
  • Captures click event metadata instantly
  • Saves partial snapshots immediately (non-blocking)
  • Runs full data collection in background
  • Prevents blocking of page behavior (e.g., captcha)

Installation

npm install

Usage

Start Data Collection

npm start

This will:

  1. Load configuration from config.json
  2. Launch a Chromium browser
  3. Navigate to the target URL
  4. Listen for clicks and collect data as you interact
  5. Export all collected data when you close the browser

Manage Collectors

Use the collector CLI to enable/disable modules:

# List all collectors and their status
npm run collectors list

# Enable a specific collector
npm run collectors enable network
npm run collectors enable console

# Disable a specific collector
npm run collectors disable memory
npm run collectors disable webgl

# Toggle a collector on/off
npm run collectors toggle performance

# Check status of a single collector
npm run collectors status console

# Enable all collectors
npm run collectors enable-all

# Disable all collectors
npm run collectors disable-all

Example Workflows

Minimal Configuration - Only collect network and console:

npm run collectors disable-all
npm run collectors enable network
npm run collectors enable console
npm run collectors enable elements
npm start

Performance Analysis - Focus on performance metrics:

npm run collectors disable-all
npm run collectors enable performance
npm run collectors enable media
npm run collectors enable webgl
npm start

Full Analysis - Collect everything:

npm run collectors enable-all
npm start

Output Structure

Final Analysis

  • analysis-<timestamp>.json - Combined analysis containing all enabled collectors' data
  • <collector-name>/<collector-name>-<timestamp>.json - Individual collector data files

Click Data

  • click-<timestamp>.partial.json - Quick snapshot saved immediately on click
  • click-<timestamp>.json - Full detailed analysis (saved in background)
  • click-<timestamp>.html - HTML snapshot of the page
  • click-<timestamp>.png - Screenshot
  • <collector-name>/click-<timestamp>/ - Collector-specific click data

Configuration

Edit config.json to customize:

{
  "targetUrl": "https://your-target-url.com",
  "headless": false,           // Set to true for headless mode
  "slowMo": 80,                // Slow down browser actions (ms)
  "collectors": { ... }        // Enable/disable specific collectors
}

Project Structure

β”œβ”€β”€ config.json                 # Collector configuration
β”œβ”€β”€ package.json               # Dependencies and scripts
β”œβ”€β”€ tsconfig.json             # TypeScript configuration
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.ts               # Main entry point
β”‚   β”œβ”€β”€ types.ts              # TypeScript type definitions
β”‚   β”œβ”€β”€ bin/
β”‚   β”‚   └── collector-cli.ts   # CLI for managing collectors
β”‚   β”œβ”€β”€ collectors/            # Data collection modules
β”‚   β”‚   β”œβ”€β”€ networkCollector.ts
β”‚   β”‚   β”œβ”€β”€ consoleCollector.ts
β”‚   β”‚   β”œβ”€β”€ elementsCollector.ts
β”‚   β”‚   β”œβ”€β”€ scriptsCollector.ts
β”‚   β”‚   β”œβ”€β”€ cssCollector.ts
β”‚   β”‚   β”œβ”€β”€ applicationCollector.ts
β”‚   β”‚   β”œβ”€β”€ securityCollector.ts
β”‚   β”‚   β”œβ”€β”€ performanceCollector.ts
β”‚   β”‚   β”œβ”€β”€ accessibilityCollector.ts
β”‚   β”‚   β”œβ”€β”€ mediaCollector.ts
β”‚   β”‚   β”œβ”€β”€ webglCollector.ts
β”‚   β”‚   β”œβ”€β”€ memoryCollector.ts
β”‚   β”‚   └── serviceWorkerCollector.ts
β”‚   └── utils/
β”‚       β”œβ”€β”€ configManager.ts   # Configuration management
β”‚       β”œβ”€β”€ cdpClient.ts       # Chrome DevTools Protocol
β”‚       β”œβ”€β”€ fileUtils.ts       # File utilities
β”‚       └── logger.ts          # Logging utilities
└── output/                    # Generated data (organized by module)

API Reference

configManager.ts

// Load configuration
loadConfig(): Config

// Check if collector is enabled
isCollectorEnabled(collectorName: string): boolean

// Get all enabled collectors
getEnabledCollectors(): string[]

// Enable/disable collectors
enableCollector(collectorName: string): void
disableCollector(collectorName: string): void

// Create module directory
createModuleDir(outputDir: string, moduleName: string): string

// Ensure all module directories exist
ensureModuleDirs(outputDir: string): void

Advanced Usage

Custom Target URLs

Set via environment variable:

TARGET_URL=https://example.com npm start

Programmatic Usage

import { isCollectorEnabled, enableCollector } from './src/utils/configManager';

// Check if enabled
if (isCollectorEnabled('network')) {
  // Handle network collection
}

// Enable dynamically
enableCollector('performance');

Performance Considerations

  • Network Collector: Stores response bodies on disk (can use significant space)
  • Memory Collector: Heap snapshots can be large; disable if disk space is limited
  • Performance Traces: Can be 10-50 MB depending on page activity
  • slowMo Setting: Increase for slower systems; decrease for faster collection

Troubleshooting

Configuration not loading

  • Ensure config.json exists in project root
  • Check JSON syntax is valid

Collectors not enabled

  • Use npm run collectors list to verify status
  • Check that collector name matches exactly (case-sensitive)

Out of memory

  • Disable memory-intensive collectors (memory, webgl)
  • Enable only required collectors

Browser not launching

  • Ensure Playwright dependencies are installed: npm install
  • Check Chromium is not already running

License

MIT

Author

Saeed Dehghan

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors