Inspirred by jobertabma/relative-url-extractor
<<<<<<< HEAD
endpoint_extractor is a modular, plugin-based tool designed to extract API endpoints and path-like strings from arbitrary source code, configuration files, or binary-like assembly formats.
It is implemented in Python 3 for readability, maintainability, and extensibility.
<<<<<<< HEAD
endpoint_extractor is a modular, plugin-based tool designed to extract API endpoints and path-like strings from arbitrary source code, configuration files, or binary-like formats.
It is implemented in Python 3 with a streaming core and extensible plugin system.
Compared to naive regex approaches, this tool provides:
- Streaming processing for very large files.
- Unicode and escape decoding.
- Smarter Base64 decoding (entropy, readability, and structural checks).
- Language-aware parsing (Python AST, Java, Smali, Assembly).
- Config parsing for JSON, YAML, and
.envfiles. - Modular plugins for easy extension. =======
This program scans arbitrary source code or text files to identify and extract unique API endpoints or path-like strings. It is designed to detect strings enclosed in single (') or double (") quotes that begin with a forward slash (/).
The tool supports both direct endpoint strings as well as obfuscated forms using escape sequences such as hexadecimal (\xNN) and Unicode (\uNNNN) representations.
The output is a list of unique endpoints. Optionally, the tool can also display the full line where each endpoint is found.
parent of 847e64d (rewrite the project) ======= This project improves upon naive regex-based approaches by offering:
- Streaming processing of large files.
- Unicode and escape decoding.
- Base64 decoding with entropy and readability heuristics.
- Language-specific parsers (Python, Java, Smali, Assembly).
- Config and JSON parsing support.
- Modular plugin architecture for easy extension.
parent of fcdffab (add some plugin)
<<<<<<< HEAD
- Streaming mode: Processes input line-by-line, without loading entire files into memory.
- Deduplication: Tracks discovered endpoints to avoid duplicates.
- Configurable output: Optionally show the original line that produced each endpoint.
- Decodes escaped strings (
\xNN,\uNNNN, etc.). - Detects Base64-like strings and decodes them into endpoints.
- Uses entropy and printable character ratio checks to minimize false positives.
- General string literals: Detects quoted strings in any language.
- JSON/config: Extracts string values that look like endpoints.
- Python: Resolves concatenations,
.format(), and%substitutions. - Java: Extracts quoted string literals.
- Smali (Android bytecode): Extracts
const-stringinstructions. - Assembly: Extracts
.ascizinline string directives.
<<<<<<< HEAD
-
Simple string literals: Extracts quoted strings and checks for Base64.
-
JSON plugin: Extracts JSON values that look like endpoints.
-
Config plugin: Extracts from
.env, YAML, and multi-line JSON. -
Python plugin:
- Resolves concatenations (
a + b). - Supports
.format()and%substitutions. - Evaluates f-strings.
- Maintains a symbol table to resolve variables across lines.
- Resolves concatenations (
-
Java plugin: Extracts string literals (grammar-based parsing can be enabled with tree-sitter).
-
Smali plugin: Extracts
const-stringassignments. -
Assembly plugin: Extracts
.ascizinline strings. ======= -
Reads input from standard input (stdin).
-
Replaces non-ASCII characters with an underscore (
_) for safe processing. -
Decodes:
- Hexadecimal escapes (e.g.,
"\x2f\x61\x70\x69"→/api). - Unicode escapes (e.g.,
"\u002fusers"→/users).
- Hexadecimal escapes (e.g.,
-
Regex-based extraction of string literals containing endpoints.
-
Ensures uniqueness of extracted endpoints.
-
Optional display of the complete line containing the endpoint using a command-line flag.
parent of 847e64d (rewrite the project) =======
- Each language or format is handled by a plugin.
- Plugins can be registered dynamically, making it easy to add new language handlers.
parent of fcdffab (add some plugin)
<<<<<<< HEAD
endpoint_extractor/
├── README.md
├── main.py # Entry point
├── core/
│ ├── engine.py # Core engine (streaming, dedupe, emit)
│ └── utils.py # Escape decoding, Base64 heuristics, entropy
├── plugins/
│ ├── simple_literal.py # Quoted string & Base64
│ ├── json_plugin.py # JSON/config
│ ├── python_plugin.py # Python parser
│ ├── java_plugin.py # Java
│ ├── smali_plugin.py # Smali
│ └── asm_plugin.py # Assembly
└── examples/
└── sample_input.txt
=======
- Input reading: The entire input is read from standard input and stored in memory.
- Sanitization: Any non-ASCII characters are replaced with
_. - Escape decoding:
parent of 847e64d (rewrite the project)
\xNNsequences are decoded into their ASCII equivalents.\uNNNNsequences are decoded if within the ASCII range; otherwise replaced with_.
- Regex scanning: Each line of the input is checked against a POSIX regular expression that identifies quoted strings starting with
/. - Uniqueness filtering: Endpoints already printed are skipped.
- Output:
<<<<<<< HEAD
Clone the repository:
git clone https://github.com/0xme32/endpoint-extractor.git
cd endpoint-extractorRun with Python 3.9+ (no external dependencies required):
python3 main.py < input_file=======
- The endpoint itself is always printed.
- If the program is invoked with
--show-line, the entire source line is also displayed under a separator.
parent of 847e64d (rewrite the project)
gcc -o main main.c -Wall -O2 -std=c11<<<<<<< HEAD <<<<<<< HEAD
=======
parent of 847e64d (rewrite the project) =======
parent of fcdffab (add some plugin)
cat source_file.js | ./maincat source_file.py | ./main --show-line<<<<<<< HEAD
const base = "/api";
const users = base + "/users";
const obf = "L2FwaS92MS9sb2dpbg=="; // Base64 encoded: /api/v1/login
{"service":"my","url":"/internal/health"}
path = "/v1/{}/info".format("status")
const-string v0, "/smali/endpoint"
.asciz "/asm/path"
=======
```c
char *a = "/api/v1/data";
char *b = "\x2f\x61\x70\x69\x2f\x75\x73\x65\x72\x73"; // "/api/users"
char *c = "\u002fitems";
>>>>>>> parent of 847e64d (rewrite the project)<<<<<<< HEAD
/api
------------------------------------------------
1: const base = "/api";
/users
------------------------------------------------
2: const users = base + "/users";
/api/v1/login
------------------------------------------------
3: const obf = "L2FwaS92MS9sb2dpbg=="; // Base64 encoded: /api/v1/login
/internal/health
------------------------------------------------
4: {"service":"my","url":"/internal/health"}
/v1/status/info
------------------------------------------------
5: path = "/v1/{}/info".format("status")
/smali/endpoint
------------------------------------------------
6: const-string v0, "/smali/endpoint"
/asm/path
<<<<<<< HEAD
=======
/api/v1/data
/api/users
/items
/api/v1/data
------------------------------------------------
char *a = "/api/v1/data";
/api/users
------------------------------------------------
char *b = "\x2f\x61\x70\x69\x2f\x75\x73\x65\x72\x73";
/items
------------------------------------------------
char *c = "\u002fitems";
>>>>>>> parent of 847e64d (rewrite the project)
=======
------------------------------------------------
7: .asciz "/asm/path"
>>>>>>> parent of fcdffab (add some plugin)
<<<<<<< HEAD <<<<<<< HEAD
-
Core manages deduplication and streaming.
-
Plugins handle syntax-specific parsing.
-
Python plugin uses the
astmodule to safely evaluate partial string expressions. -
Config plugin supports multiple formats:
.env→KEY=VALUE- JSON (multi-line)
- YAML (if
pyyamlavailable). =======
- Core engine: Manages streaming, deduplication, and output formatting.
- Plugins: Encapsulate language-specific heuristics, making the system modular.
- Base64 heuristics:
- Validates input format.
- Decodes safely.
- Requires decoded string to have a high printable ratio.
- Requires decoded string to contain endpoint-like substrings.
- Python plugin: Uses
astmodule for safe parsing and partial evaluation of string expressions.
parent of fcdffab (add some plugin)
MIT License.
=======
-
Regex choice: The POSIX regular expression
([^\n]*?([\"'])(/[A-Za-z0-9\?\/&=#\.\!:_-]*)(\2).*)was selected to balance readability and functionality. -
Memory management: Endpoints are dynamically allocated and tracked in an array, then freed before program termination.
-
Scalability: The program uses fixed-size buffers (
MAX_CONTENT,MAX_MATCHES), which can be increased as needed. -
Limitations:
- Does not resolve variable concatenation (e.g.,
"/api" + "/users"). - Only basic Unicode escapes are supported; characters outside ASCII are replaced with
_. - Base64 or custom obfuscation methods are not decoded in this version.
- Does not resolve variable concatenation (e.g.,
parent of 847e64d (rewrite the project)
Made by 0xme