Skip to content

0xme32/endpoint-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Relative API Extractor

Inspirred by jobertabma/relative-url-extractor

<<<<<<< HEAD

Overview

endpoint_extractor is a modular, plugin-based tool designed to extract API endpoints and path-like strings from arbitrary source code, configuration files, or binary-like assembly formats.
It is implemented in Python 3 for readability, maintainability, and extensibility.

<<<<<<< HEAD endpoint_extractor is a modular, plugin-based tool designed to extract API endpoints and path-like strings from arbitrary source code, configuration files, or binary-like formats. It is implemented in Python 3 with a streaming core and extensible plugin system.

Compared to naive regex approaches, this tool provides:

  • Streaming processing for very large files.
  • Unicode and escape decoding.
  • Smarter Base64 decoding (entropy, readability, and structural checks).
  • Language-aware parsing (Python AST, Java, Smali, Assembly).
  • Config parsing for JSON, YAML, and .env files.
  • Modular plugins for easy extension. =======

Overview

This program scans arbitrary source code or text files to identify and extract unique API endpoints or path-like strings. It is designed to detect strings enclosed in single (') or double (") quotes that begin with a forward slash (/). The tool supports both direct endpoint strings as well as obfuscated forms using escape sequences such as hexadecimal (\xNN) and Unicode (\uNNNN) representations.

The output is a list of unique endpoints. Optionally, the tool can also display the full line where each endpoint is found.

parent of 847e64d (rewrite the project) ======= This project improves upon naive regex-based approaches by offering:

  • Streaming processing of large files.
  • Unicode and escape decoding.
  • Base64 decoding with entropy and readability heuristics.
  • Language-specific parsers (Python, Java, Smali, Assembly).
  • Config and JSON parsing support.
  • Modular plugin architecture for easy extension.

parent of fcdffab (add some plugin)


Features

<<<<<<< HEAD

Core Engine

  • Streaming mode: Processes input line-by-line, without loading entire files into memory.
  • Deduplication: Tracks discovered endpoints to avoid duplicates.
  • Configurable output: Optionally show the original line that produced each endpoint.

Escape and Obfuscation Handling

  • Decodes escaped strings (\xNN, \uNNNN, etc.).
  • Detects Base64-like strings and decodes them into endpoints.
  • Uses entropy and printable character ratio checks to minimize false positives.

Supported Formats and Languages

  • General string literals: Detects quoted strings in any language.
  • JSON/config: Extracts string values that look like endpoints.
  • Python: Resolves concatenations, .format(), and % substitutions.
  • Java: Extracts quoted string literals.
  • Smali (Android bytecode): Extracts const-string instructions.
  • Assembly: Extracts .asciz inline string directives.

<<<<<<< HEAD

Supported Plugins

  • Simple string literals: Extracts quoted strings and checks for Base64.

  • JSON plugin: Extracts JSON values that look like endpoints.

  • Config plugin: Extracts from .env, YAML, and multi-line JSON.

  • Python plugin:

    • Resolves concatenations (a + b).
    • Supports .format() and % substitutions.
    • Evaluates f-strings.
    • Maintains a symbol table to resolve variables across lines.
  • Java plugin: Extracts string literals (grammar-based parsing can be enabled with tree-sitter).

  • Smali plugin: Extracts const-string assignments.

  • Assembly plugin: Extracts .asciz inline strings. =======

  • Reads input from standard input (stdin).

  • Replaces non-ASCII characters with an underscore (_) for safe processing.

  • Decodes:

    • Hexadecimal escapes (e.g., "\x2f\x61\x70\x69"/api).
    • Unicode escapes (e.g., "\u002fusers"/users).
  • Regex-based extraction of string literals containing endpoints.

  • Ensures uniqueness of extracted endpoints.

  • Optional display of the complete line containing the endpoint using a command-line flag.

parent of 847e64d (rewrite the project) =======

Extensible Plugins

  • Each language or format is handled by a plugin.
  • Plugins can be registered dynamically, making it easy to add new language handlers.

parent of fcdffab (add some plugin)


Workflow

<<<<<<< HEAD

endpoint_extractor/
├── README.md
├── main.py               # Entry point
├── core/
│   ├── engine.py         # Core engine (streaming, dedupe, emit)
│   └── utils.py          # Escape decoding, Base64 heuristics, entropy
├── plugins/
│   ├── simple_literal.py # Quoted string & Base64
│   ├── json_plugin.py    # JSON/config
│   ├── python_plugin.py  # Python parser
│   ├── java_plugin.py    # Java
│   ├── smali_plugin.py   # Smali
│   └── asm_plugin.py     # Assembly
└── examples/
    └── sample_input.txt

=======

  1. Input reading: The entire input is read from standard input and stored in memory.
  2. Sanitization: Any non-ASCII characters are replaced with _.
  3. Escape decoding:

parent of 847e64d (rewrite the project)

  • \xNN sequences are decoded into their ASCII equivalents.
  • \uNNNN sequences are decoded if within the ASCII range; otherwise replaced with _.
  1. Regex scanning: Each line of the input is checked against a POSIX regular expression that identifies quoted strings starting with /.
  2. Uniqueness filtering: Endpoints already printed are skipped.
  3. Output:

<<<<<<< HEAD

Installation

Clone the repository:

git clone https://github.com/0xme32/endpoint-extractor.git
cd endpoint-extractor

Run with Python 3.9+ (no external dependencies required):

python3 main.py < input_file

=======

  • The endpoint itself is always printed.
  • If the program is invoked with --show-line, the entire source line is also displayed under a separator.

parent of 847e64d (rewrite the project)


Usage

Compilation

gcc -o main main.c -Wall -O2 -std=c11

<<<<<<< HEAD <<<<<<< HEAD

Show Line Context

=======

Basic execution

parent of 847e64d (rewrite the project) =======

Show Extracting Line Context

parent of fcdffab (add some plugin)

cat source_file.js | ./main

With line output

cat source_file.py | ./main --show-line

Example

Input

<<<<<<< HEAD

const base = "/api";
const users = base + "/users";
const obf = "L2FwaS92MS9sb2dpbg=="; // Base64 encoded: /api/v1/login
{"service":"my","url":"/internal/health"}
path = "/v1/{}/info".format("status")
const-string v0, "/smali/endpoint"
.asciz "/asm/path"
=======
```c
char *a = "/api/v1/data";
char *b = "\x2f\x61\x70\x69\x2f\x75\x73\x65\x72\x73"; // "/api/users"
char *c = "\u002fitems";
>>>>>>> parent of 847e64d (rewrite the project)

Output (without --show-line)

<<<<<<< HEAD
/api
------------------------------------------------
1: const base = "/api";

/users
------------------------------------------------
2: const users = base + "/users";

/api/v1/login
------------------------------------------------
3: const obf = "L2FwaS92MS9sb2dpbg=="; // Base64 encoded: /api/v1/login

/internal/health
------------------------------------------------
4: {"service":"my","url":"/internal/health"}

/v1/status/info
------------------------------------------------
5: path = "/v1/{}/info".format("status")

/smali/endpoint
------------------------------------------------
6: const-string v0, "/smali/endpoint"

/asm/path
<<<<<<< HEAD
=======
/api/v1/data
/api/users
/items

Output (with --show-line)

/api/v1/data
------------------------------------------------
char *a = "/api/v1/data";
/api/users
------------------------------------------------
char *b = "\x2f\x61\x70\x69\x2f\x75\x73\x65\x72\x73";
/items
------------------------------------------------
char *c = "\u002fitems";
>>>>>>> parent of 847e64d (rewrite the project)
=======
------------------------------------------------
7: .asciz "/asm/path"
>>>>>>> parent of fcdffab (add some plugin)

Design Notes

<<<<<<< HEAD <<<<<<< HEAD

  • Core manages deduplication and streaming.

  • Plugins handle syntax-specific parsing.

  • Python plugin uses the ast module to safely evaluate partial string expressions.

  • Config plugin supports multiple formats:

    • .envKEY=VALUE
    • JSON (multi-line)
    • YAML (if pyyaml available). =======
  • Core engine: Manages streaming, deduplication, and output formatting.
  • Plugins: Encapsulate language-specific heuristics, making the system modular.
  • Base64 heuristics:
    • Validates input format.
    • Decodes safely.
    • Requires decoded string to have a high printable ratio.
    • Requires decoded string to contain endpoint-like substrings.
  • Python plugin: Uses ast module for safe parsing and partial evaluation of string expressions.

parent of fcdffab (add some plugin)


License

MIT License.

=======

  • Regex choice: The POSIX regular expression ([^\n]*?([\"'])(/[A-Za-z0-9\?\/&=#\.\!:_-]*)(\2).*) was selected to balance readability and functionality.

  • Memory management: Endpoints are dynamically allocated and tracked in an array, then freed before program termination.

  • Scalability: The program uses fixed-size buffers (MAX_CONTENT, MAX_MATCHES), which can be increased as needed.

  • Limitations:

    • Does not resolve variable concatenation (e.g., "/api" + "/users").
    • Only basic Unicode escapes are supported; characters outside ASCII are replaced with _.
    • Base64 or custom obfuscation methods are not decoded in this version.

parent of 847e64d (rewrite the project)


Author

Made by 0xme

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published