Skip to content

Latest commit

 

History

History
582 lines (454 loc) · 20.1 KB

File metadata and controls

582 lines (454 loc) · 20.1 KB

JAVDB AutoSpider API Usage Guide

This document explains how to use JAVDB AutoSpider's parsing capabilities through two interfaces: the Python module API and the REST API.


Table of Contents


Installation

pip install -r requirements.txt

Core dependencies: beautifulsoup4, lxml, javdb_rust_core (optional). The REST API additionally requires: fastapi, uvicorn.

Rust first, Python fallback: When javdb_rust_core (a Rust extension compiled via PyO3 + maturin) is available, the system automatically uses the Rust parser implementation for a 5--10x performance boost. When javdb_rust_core is unavailable, it falls back to the pure Python implementation using beautifulsoup4 / lxml. This switch is completely transparent to callers -- no API call changes are needed.


1. Python Module API

All parsing functions accept an HTML string and return structured dataclass objects. The parsers perform no business-level filtering (no phase1/phase2 distinction, no subtitle/date tag filtering) and return all raw data present on the page.

Parser functions and dataclasses live in javdb.parsing and javdb.parsing.models. Import them from there.

1.1 Parsing Index Pages

Parses any page containing a movie list (works for the home page, category pages, ranking pages, etc.).

from javdb.parsing import parse_index_page

# Read HTML content
with open('page.html', 'r', encoding='utf-8') as f:
    html = f.read()

result = parse_index_page(html, page_num=1)

# Basic information
print(result.has_movie_list)  # True / False
print(result.page_title)      # Page title
print(len(result.movies))     # Number of movies

# Iterate over each movie
for movie in result.movies:
    print(f"Code: {movie.video_code}")
    print(f"Title: {movie.title}")
    print(f"Rating: {movie.rate}")
    print(f"Review count: {movie.comment_count}")
    print(f"Release date: {movie.release_date}")
    print(f"Tags: {movie.tags}")           # ['含中字磁鏈', '今日新種']
    print(f"Cover: {movie.cover_url}")
    print(f"Link: {movie.href}")
    print(f"Ranking: {movie.ranking}")     # Only populated on ranking pages
    print()

# Convert to dict (convenient for JSON serialization)
data = result.to_dict()

Return type: IndexPageResult

Field Type Description
has_movie_list bool Whether the page contains a movie list
movies List[MovieIndexEntry] All movie entries
page_title str Page <title> text

MovieIndexEntry fields:

Field Type Description
href str Movie detail page link
video_code str Video code (e.g., "STAR-123")
title str Movie title
rate str Rating (e.g., "4.47")
comment_count str Number of reviews (e.g., "595")
release_date str Release date (e.g., "2026-02-11")
tags List[str] Page tags (e.g., ["含中字磁鏈", "今日新種"])
cover_url str Cover image URL
page int Page number
ranking Optional[int] Ranking (only populated on ranking pages)

1.2 Parsing Movie Detail Pages

Extracts full metadata from a movie detail page.

from javdb.parsing import parse_detail_page

with open('detail.html', 'r', encoding='utf-8') as f:
    html = f.read()

detail = parse_detail_page(html)

# ---- Basic information ----
print(f"Title: {detail.title}")
print(f"Code: {detail.video_code}")
print(f"Code prefix link: {detail.code_prefix_link}")  # e.g., /video_codes/VDD
print(f"Duration: {detail.duration}")
print(f"Release date: {detail.release_date}")

# ---- Related entities (MovieLink: name + href) ----
if detail.maker:
    print(f"Maker: {detail.maker.name} ({detail.maker.href})")
if detail.publisher:
    print(f"Publisher: {detail.publisher.name}")
if detail.series:
    print(f"Series: {detail.series.name}")
for d in detail.directors:
    print(f"Director: {d.name}")
for a in detail.actors:
    # ActorCredit: name, href, gender ('female' / 'male' / '')
    print(f"Actor: {a.name} ({a.href}) [{a.gender}]")
for t in detail.tags:
    print(f"Genre tag: {t.name}")

# ---- Ratings ----
print(f"Rating: {detail.rate}")
print(f"Review count: {detail.comment_count}")
print(f"Short reviews: {detail.review_count}")
print(f"Want to watch: {detail.want_count} people")
print(f"Watched: {detail.watched_count} people")

# ---- Media resources ----
print(f"Poster: {detail.poster_url}")
print(f"Fanart: {detail.fanart_urls}")      # List of full-size image URLs
print(f"Trailer: {detail.trailer_url}")

# ---- Magnet links ----
for m in detail.magnets:
    print(f"Magnet: {m.name} | Size: {m.size} | Tags: {m.tags} | Date: {m.timestamp}")
    print(f"  Link: {m.href}")

# ---- Legacy interface compatibility / Lead and supporting actors ----
actor_name = detail.get_first_actor_name()        # First (lead) actor name
actor_gender = detail.get_first_actor_gender()    # Lead actor gender
supporting_json = detail.get_supporting_actors_json()  # Supporting actors JSON (for DB storage)
d = detail.to_dict()  # Includes lead_actor and supporting_actors convenience fields
magnets_list = detail.get_magnets_as_legacy()     # List[dict] format

Return type: MovieDetail

Field Type Description
title str Movie title
video_code str Video code
code_prefix_link str Code prefix page link (e.g., /video_codes/VDD)
duration str Duration
release_date str Release date
publisher Optional[MovieLink] Publisher
maker Optional[MovieLink] Maker
series Optional[MovieLink] Series
directors List[MovieLink] List of directors
tags List[MovieLink] List of genre tags
rate str Rating
comment_count str Number of reviews
poster_url str Poster URL
fanart_urls List[str] List of fanart URLs
trailer_url Optional[str] Trailer URL
actors List[ActorCredit] List of actors (order matches page; includes gender)
lead_actor Optional[dict] Lead actor {name, href, gender} in to_dict() output
supporting_actors List[dict] Remaining actors in to_dict() output
magnets List[MagnetInfo] List of magnet links
review_count int Number of short reviews
want_count int Number of "want to watch"
watched_count int Number of "watched"
parse_success bool Whether the magnet links section was found

1.3 Parsing Category Pages

Parses category pages for makers, publishers, series, directors, code prefixes, actors, etc., extracting additional category information.

from javdb.parsing import parse_category_page

result = parse_category_page(html, page_num=1)

print(f"Category type: {result.category_type}")   # e.g., 'makers', 'directors'
print(f"Category name: {result.category_name}")   # e.g., 'PRESTIGE'
print(f"Movie count: {len(result.movies)}")

# The movies field is identical to IndexPageResult
for movie in result.movies:
    print(f"  {movie.video_code} - {movie.title}")

Return type: CategoryPageResult (extends IndexPageResult)

Additional field Type Description
category_type str Category type (makers, publishers, series, directors, video_codes, actors)
category_name str Category display name

1.4 Parsing Ranking Pages

Parses Top250, daily/weekly/monthly ranking pages, etc.

from javdb.parsing import parse_top_page

result = parse_top_page(html, page_num=1)

print(f"Ranking type: {result.top_type}")   # 'top250', 'top_movies', 'top_playback'
print(f"Period: {result.period}")           # '2025', 'daily', 'weekly', 'monthly'

for movie in result.movies:
    print(f"  #{movie.ranking} {movie.video_code} - Rating: {movie.rate}")

Return type: TopPageResult (extends IndexPageResult)

Additional field Type Description
top_type str Ranking type
period Optional[str] Time period

1.5 Parsing Tag Filter Pages

Parses the /tags page, extracting the complete tag filter panel (all categories, all tag options, ID-to-name mappings) along with the movie list.

from javdb.parsing import parse_tag_page

result = parse_tag_page(html, page_num=1)

# ---- Movie list (same as IndexPageResult) ----
print(f"Movie count: {len(result.movies)}")

# ---- Current filter state ----
print(f"Current selections: {result.current_selections}")
# Output: {'1': '23', '5': '24', '6': '29', '7': '28', '11': '2026'}

# ---- View all categories ----
for cat in result.categories:
    print(f"\nCategory c{cat.category_id}: {cat.name}")
    for opt in cat.options:
        status = " [selected]" if opt.selected else ""
        id_info = f" (ID: {opt.tag_id})" if opt.tag_id else " (ID unknown)"
        print(f"  - {opt.name}{id_info}{status}")

# ---- Look up by category ID ----
cat4 = result.get_category_by_id('4')        # Body Type (體型)
print(f"Body Type category has {len(cat4.options)} tags")

# ---- Look up by category name (the API uses the Chinese name as identifier) ----
cat = result.get_category_by_name('行爲')     # Behavior
print(f"Behavior category ID: c{cat.category_id}")

# ---- Get ID-to-name mapping (names are returned in Chinese as the API stores them) ----
id_map = cat4.get_id_to_name_map()
print(id_map['15'])    # → '熟女'  (Mature Woman)
print(id_map['17'])    # → '巨乳'  (Big Breasts)

# ---- Get name-to-ID mapping (reverse lookup; key is the Chinese tag name) ----
name_map = cat4.get_name_to_id_map()
print(name_map['熟女'])  # → '15'

# ---- Get global mapping across all categories ----
full_map = result.get_full_id_to_name_map()
print(full_map[('4', '15')])   # → '熟女'     (Body Type category, ID 15)
print(full_map[('1', '23')])   # → '淫亂真實'  (Theme category, ID 23)
print(full_map[('7', '28')])   # → '單體作品'  (Type category, ID 28)

# ---- View selected tags in a category ----
cat7 = result.get_category_by_id('7')
for sel in cat7.get_selected():
    print(f"Selected: {sel.name} (ID: {sel.tag_id})")

Return type: TagPageResult (extends IndexPageResult)

Additional field Type Description
categories List[TagCategory] All filter categories
current_selections dict Current selection state {category_id: "tag_ids"}

TagCategory fields:

Field Type Description
category_id str Category ID (corresponds to URL parameter c{N})
name str Category name in Chinese as returned by the API (e.g., "主題" / Theme, "體型" / Body Type)
options List[TagOption] All tag options under this category

TagCategory convenience methods:

Method Returns Description
get_id_to_name_map() dict {tag_id: name} mapping
get_name_to_id_map() dict {name: tag_id} reverse mapping
get_selected() List[TagOption] Currently selected tags

Tag Category ID Reference Table:

The category names and tag values below are the actual identifiers used by the JavDB API (in Traditional Chinese). The English in parentheses is an explanatory translation only — when querying the API or matching against parser output, use the Chinese strings verbatim.

URL Parameter Category Name (CN / EN) Example Tags (value with ID)
c10 基本 (Basic) 可播放 / Playable (6), 含磁鏈 / With Magnet (1), 含字幕 / With Subtitles (2)
c11 年份 (Year) 2026, 2025, 2024...
c1 主題 (Theme) 淫亂真實 / Promiscuous Realistic (23), 出軌 / Infidelity (51), 強姦 / Forced (52)
c2 角色 (Role) 高中女生 / High School Girl (1), 美少女 / Beautiful Girl (5), 已婚婦女 / Married Woman
c3 服裝 (Costume) 眼鏡 / Glasses (3), 角色扮演 / Cosplay (43), 制服 / Uniform
c4 體型 (Body Type) 熟女 / Mature Woman (15), 巨乳 / Big Breasts (17), 蘿莉塔 / Lolita
c5 行爲 (Behavior) 乳交 / Paizuri (14), 中出 / Creampie (18), 多P / Multiple Partners (24)
c6 玩法 (Play Style) 捆綁 / Bondage (29), 凌辱 / Humiliation, SM
c7 類別 (Type) 單體作品 / Solo Work (28), VR (212), 4K (347)
c9 時長 (Duration) lt-45, 45-90, 90-120, gt-120

Note: The numbers in parentheses after a tag are tag IDs. When only a few categories are selected on the page, most tag IDs can be extracted from the HTML. When multiple categories are selected, some tags may have an empty tag_id. It is recommended to extract the complete mapping from a page with fewer selections (e.g., only one category selected).


1.6 Auto-Detecting Page Type

Not sure what type of page the HTML is? Let the parser auto-detect it.

from javdb.parsing import detect_page_type

page_type = detect_page_type(html)
# Returns: 'index', 'detail', 'top250', 'top_movies', 'makers',
#          'publishers', 'series', 'directors', 'video_codes',
#          'actors', 'tags', or 'unknown'

2. REST API

The REST API is a thin wrapper around the Python module API, built with the FastAPI framework. All parsing endpoints accept an HTML string and return JSON.

2.1 Starting the Server

# Development mode (auto-reload)
uvicorn apps.api.server:app --reload --port 8100

# Production mode
uvicorn apps.api.server:app --host 0.0.0.0 --port 8100 --workers 4

After starting, visit http://localhost:8100/docs to view the auto-generated Swagger documentation.

2.2 Health Check

curl http://localhost:8100/api/health
{"status": "ok"}

2.3 Parsing Index Pages

curl -X POST http://localhost:8100/api/parse/index \
  -H "Content-Type: application/json" \
  -d '{"html": "<html>...</html>", "page_num": 1}'

Response example:

{
  "has_movie_list": true,
  "page_title": "JavDB",
  "movies": [
    {
      "href": "/v/ABC-123",
      "video_code": "ABC-123",
      "title": "Movie title...",
      "rate": "4.47",
      "comment_count": "595",
      "release_date": "2026-02-11",
      "tags": ["含中字磁鏈", "今日新種"],
      "cover_url": "https://..../cover.jpg",
      "page": 1,
      "ranking": null
    }
  ]
}

2.4 Parsing Movie Detail Pages

curl -X POST http://localhost:8100/api/parse/detail \
  -H "Content-Type: application/json" \
  -d '{"html": "<html>...</html>"}'

Response example:

{
  "title": "脅迫スイートルーム ...",
  "video_code": "VDD-201",
  "code_prefix_link": "/video_codes/VDD",
  "duration": "130分鍾",
  "release_date": "2026-02-06",
  "maker": {"name": "ドリームチケット", "href": "/makers/wm?f=download"},
  "publisher": null,
  "series": {"name": "脅迫スイートルーム", "href": "/series/KdqA"},
  "directors": [{"name": "沢庵", "href": "/directors/pz9"}],
  "tags": [
    {"name": "美乳", "href": "/tags?c4=..."},
    {"name": "女教師", "href": "/tags?c2=..."}
  ],
  "rate": "3.95",
  "comment_count": "191",
  "poster_url": "https://.../cover.jpg",
  "fanart_urls": ["https://.../sample1.jpg", "https://.../sample2.jpg"],
  "trailer_url": "https://.../preview.mp4",
  "actors": [{"name": "真北祈", "href": "/actors/450wJ", "gender": "female"}],
  "lead_actor": {"name": "真北祈", "href": "/actors/450wJ", "gender": "female"},
  "supporting_actors": [{"name": "マッスル澤野", "href": "...", "gender": "male"}],
  "magnets": [
    {
      "href": "magnet:?xt=urn:btih:...",
      "name": "VDD-201.torrent",
      "tags": ["字幕", "HD"],
      "size": "4.94GB",
      "timestamp": "2026-02-10"
    }
  ],
  "review_count": 4,
  "want_count": 1030,
  "watched_count": 191,
  "parse_success": true
}

2.5 Parsing Category Pages

curl -X POST http://localhost:8100/api/parse/category \
  -H "Content-Type: application/json" \
  -d '{"html": "<html>...</html>", "page_num": 1}'

Response: Same structure as index pages, with additional category_type and category_name fields.

2.6 Parsing Ranking Pages

curl -X POST http://localhost:8100/api/parse/top \
  -H "Content-Type: application/json" \
  -d '{"html": "<html>...</html>", "page_num": 1}'

Response: Same structure as index pages, with additional top_type and period fields. The ranking field on each movie is populated.

2.7 Parsing Tag Filter Pages

curl -X POST http://localhost:8100/api/parse/tags \
  -H "Content-Type: application/json" \
  -d '{"html": "<html>...</html>", "page_num": 1}'

Response example (key portions):

{
  "has_movie_list": true,
  "movies": [...],
  "current_selections": {"1": "23", "7": "28", "11": "2026"},
  "categories": [
    {
      "category_id": "4",
      "name": "體型",
      "options": [
        {"name": "熟女", "tag_id": "15", "selected": false},
        {"name": "巨乳", "tag_id": "17", "selected": false},
        {"name": "蘿莉塔", "tag_id": "19", "selected": false}
      ]
    },
    {
      "category_id": "7",
      "name": "類別",
      "options": [
        {"name": "單體作品", "tag_id": "28", "selected": true},
        {"name": "VR", "tag_id": "212", "selected": false},
        {"name": "4K", "tag_id": "347", "selected": false}
      ]
    }
  ]
}

2.8 Detecting Page Type

curl -X POST http://localhost:8100/api/detect-page-type \
  -H "Content-Type: application/json" \
  -d '{"html": "<html>...</html>"}'
{"page_type": "detail"}

3. Data Model Reference

All models are Python dataclass objects and support the .to_dict() method for conversion to dicts.

Model Inheritance Hierarchy

IndexPageResult
├── CategoryPageResult   (+ category_type, category_name)
├── TopPageResult        (+ top_type, period)
└── TagPageResult        (+ categories, current_selections)

Common Models

Model Fields Description
MovieLink name, href Generic link (actors, directors, makers, etc.)
MagnetInfo href, name, tags, size, timestamp Magnet link
MovieIndexEntry href, video_code, title, rate, comment_count, release_date, tags, cover_url, page, ranking Movie entry on list pages
MovieDetail See Section 1.2 Full detail page information
TagOption name, tag_id, selected Tag filter option
TagCategory category_id, name, options Tag filter category

REST API Endpoint Summary

Method Path Description
GET /api/health Health check
POST /api/parse/index Parse index page
POST /api/parse/detail Parse detail page
POST /api/parse/category Parse category page
POST /api/parse/top Parse ranking page
POST /api/parse/tags Parse tag filter page
POST /api/detect-page-type Detect page type

All POST endpoints accept the following request body format:

{
  "html": "Full HTML string",
  "page_num": 1
}