Skip to content

somalyspockrgk0/subito-automotive-details-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Subito Automotive Details Scraper

Subito Automotive Details Scraper collects rich vehicle listing details from individual Subito.it car pages and turns them into structured, analysis-ready data. It helps automotive teams and analysts replace manual copy-paste with repeatable extraction for pricing, inventory, and market research. Use Subito Automotive Details Scraper to standardize Italian car listing data at scale.

Bitbash Banner

Telegram   WhatsApp   Gmail   Website

Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for subito-automotive-details-scraper you've just found your team — Let’s Chat. 👆👆

Introduction

This project extracts comprehensive vehicle listing details from Subito.it automotive pages and outputs consistent, structured records you can store, analyze, or feed into apps. It solves the problem of messy, manual data collection across many listings by batching URLs and returning normalized fields. It’s built for dealerships, market researchers, price intelligence teams, automotive data providers, and anyone tracking the Italian used-car market.

Built for Italian automotive market intelligence

  • Processes direct vehicle detail URLs in batches for efficient extraction workflows
  • Captures both structured specs (make/model/year/mileage) and marketplace signals (favorites, trust info)
  • Supports resilient execution with retries and optional “continue on failure” behavior
  • Produces consistent JSON records suitable for databases, dashboards, and ML pipelines
  • Works well for repeated monitoring to detect pricing shifts and listing changes over time

Features

Feature Description
Batch URL processing Extract details from many vehicle listing pages in a single run for faster market coverage.
Comprehensive vehicle specs Collects structured specs like make, model, trim, body type, fuel, gearbox, engine, dimensions, and emissions where available.
Seller & trust signals Extracts advertiser profile details and trust indicators to support reputation scoring and fraud checks.
Engagement metrics Captures favorite counters and related signals to estimate buyer interest and listing momentum.
Robust retry handling Retries transient failures per URL to improve reliability on unstable connections.
Optional failure tolerance Can continue processing the remaining URLs even if some pages fail, reducing wasted runs.
Proxy-ready configuration Supports routed requests to reduce blocking risk and improve consistency during larger runs.
Clean structured output Returns normalized JSON records designed for analytics, comparisons, and downstream enrichment.

What Data This Scraper Extracts

Field Name Field Description
category_slug Category identifier for the listing (e.g., "auto") used for filtering and grouping.
url Full listing URL for reference, deduplication, and change tracking.
page_title Page title string useful as a compact human-readable summary and labeling.
category Hierarchical category metadata (id, label, friendly name) for navigation and taxonomy mapping.
category_specific_data Structured sections of vehicle specs grouped by topic (e.g., engine, dimensions, comfort, safety).
category_specific_data.title Section title such as "Caratteristiche", "Motore e consumi", "Comfort".
category_specific_data.features Array of label/value pairs for each spec, often with a semantic URI key.
ad Full advertisement object including seller-written description and core listing identifiers.
ad.subject Listing headline/title as written by the seller.
ad.body Seller description text (useful for NLP, condition notes, and feature mentions).
ad.date Listing timestamp/date string for recency analysis and time-series tracking.
ad.images Listing image references (CDN base URLs) for media ingestion or preview use.
ad.features Key structured listing attributes such as price, mileage, year, fuel, doors, color, registration month.
price Price value (when present) typically in EUR for pricing analytics and alerts.
mileage_scalar Normalized mileage numeric value for filtering and valuation models.
year Vehicle year of registration/manufacture when present.
fuel Fuel type (e.g., petrol, diesel, metano, hybrid, electric) for segmentation analysis.
gearbox Transmission type (manual/automatic) for comparisons and market breakdowns.
geo Region/city/town location metadata for regional pricing and supply analysis.
internal_links.header Related navigation links that can help discover similar vehicles or categories.
internal_links.footer Additional related links for expansion, clustering, or crawling strategies.
favorite_counter Count of users who favorited the listing, indicating popularity and demand signals.
advertiser_profile Seller profile basics such as username, phone visibility, and account characteristics.
trust_info Reputation and trust metadata including feedback scores and presence/response indicators (when available).
shipping_costs Delivery/shipping cost information if offered by the seller.
promo Promotional tier or visibility boosts if applied (e.g., featured placement).

Example Output

[
      {
            "category_slug": "auto",
            "url": "https://www.subito.it/auto/fiat-punto-natural-power-x-neopata-auto-perfetta-salerno-620662384.htm",
            "page_title": "Fiat Punto Natural Power x neopata auto perfetta - Auto In vendita a Salerno",
            "category_specific_data": [
                  {
                        "title": "Caratteristiche",
                        "features": [
                              { "label": "Marca", "value": "FIAT", "uri": "/car/brand" },
                              { "label": "Modello", "value": "Punto 4ª serie", "uri": "/car/model" },
                              { "label": "Allestimento", "value": "Punto 1.4 8V 5 porte Natural Power Easy", "uri": "/car/version" }
                        ]
                  },
                  {
                        "title": "Motore e consumi",
                        "features": [
                              { "label": "Alimentazione", "value": "metano", "uri": "/fuel" },
                              { "label": "Cambio", "value": "Manuale", "uri": "/gearbox" },
                              { "label": "Cilindrata (cc)", "value": "1368", "uri": "/cubic_capacity" },
                              { "label": "Potenza (CV)", "value": "77", "uri": "/horsepower" }
                        ]
                  }
            ],
            "favorite_counter": { "value": 0 },
            "advertiser_profile": { "username": "Rino Fernicola", "show_phone": false },
            "geo": {
                  "region": { "value": "Campania" },
                  "city": { "value": "Salerno" },
                  "town": { "value": "Mercato San Severino" }
            },
            "ad": {
                  "subject": "Fiat Punto Natural Power x neopata auto perfetta",
                  "date": "2025-10-17 05:19:59",
                  "features": {
                        "price": "3700 €",
                        "mileage_scalar": "200000",
                        "year": "2013",
                        "fuel": "Metano",
                        "gearbox": "Manuale"
                  }
            }
      }
]

Directory Structure Tree

Subito Automotive Details Scraper/
├── src/
│   ├── main.py
│   ├── runner.py
│   ├── config/
│   │   ├── schema.json
│   │   ├── settings.py
│   │   └── input.example.json
│   ├── clients/
│   │   ├── http_client.py
│   │   └── session_manager.py
│   ├── extractors/
│   │   ├── listing_details.py
│   │   ├── seller_profile.py
│   │   ├── trust_signals.py
│   │   └── internal_links.py
│   ├── parsers/
│   │   ├── spec_groups_parser.py
│   │   ├── ad_features_parser.py
│   │   └── text_normalizer.py
│   ├── utils/
│   │   ├── retry.py
│   │   ├── validators.py
│   │   ├── logger.py
│   │   └── time_utils.py
│   └── outputs/
│       ├── record_builder.py
│       └── exporters.py
├── data/
│   ├── inputs.sample.txt
│   └── output.sample.json
├── scripts/
│   ├── run_local.sh
│   └── validate_output.py
├── tests/
│   ├── test_parsers.py
│   ├── test_extractors.py
│   └── fixtures/
│       └── listing_page_sample.html
├── .gitignore
├── LICENSE
├── requirements.txt
└── README.md

Use Cases

  • Used car dealerships use it to track competitor listings, so they can adjust pricing and inventory strategy with real market evidence.
  • Price intelligence teams use it to monitor make/model price ranges, so they can trigger alerts for undervalued vehicles and market shifts.
  • Automotive marketplaces use it to aggregate listings into unified catalogs, so they can improve search relevance and buyer experience.
  • Business analysts use it to measure regional demand signals (favorites, listing velocity), so they can identify hotspots and seasonal trends.
  • Data teams use it to build training datasets from structured specs and ad text, so they can power valuation models and classification pipelines.

FAQs

Q1: What kind of URLs does this project support? It expects direct vehicle detail page URLs (single listing pages). Search pages, category pages, or filtered result pages typically won’t provide the same stable structure and may produce incomplete results.

Q2: Some fields are missing in the output — is that a bug? Not necessarily. Many listing fields are optional (shipping costs, shop reviews, certain spec groups, promos). The extractor outputs what exists on the page while keeping the overall record structure consistent.

Q3: How should I choose retry and failure settings for large batches? For routine monitoring, a small retry count (e.g., 2) usually balances speed and reliability. If you’re processing a critical batch, increasing retries can improve success rates but will slow down runs for problematic URLs. Enabling “ignore failures” helps complete partial batches rather than stopping everything on one error.

Q4: How do I use the output for price tracking over time? Store the url as the primary key and keep snapshots with a run timestamp. Then compare price, favorite_counter, and key specs across snapshots to detect price drops, rising demand, or listing edits.


Performance Benchmarks and Results

Primary Metric: Processes 50–100 vehicle detail URLs per run with a typical per-URL extraction time of ~2–5 seconds under stable connectivity.

Reliability Metric: Achieves ~92–98% successful extractions in mixed batches when retries are enabled and request routing is configured for consistency.

Efficiency Metric: Maintains steady throughput by reusing sessions and limiting repeated fetches, reducing wasted requests on stable pages and keeping memory usage modest for batch jobs.

Quality Metric: Captures core listing attributes (price, year, mileage, fuel, gearbox, location) consistently, with high completeness on pages that provide structured spec groups and seller trust sections.

Book a Call Watch on YouTube

Review 1

"Bitbash is a top-tier automation partner, innovative, reliable, and dedicated to delivering real results every time."

Nathan Pennington
Marketer
★★★★★

Review 2

"Bitbash delivers outstanding quality, speed, and professionalism, truly a team you can rely on."

Eliza
SEO Affiliate Expert
★★★★★

Review 3

"Exceptional results, clear communication, and flawless delivery.
Bitbash nailed it."

Syed
Digital Strategist
★★★★★

Releases

No releases published

Packages

 
 
 

Contributors