Skip to content

Jacquline-Parser-Solution #319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

JacqulineMbogo
Copy link

@JacqulineMbogo JacqulineMbogo commented Apr 15, 2025

Jacquline Scraper Solution

A lightweight web scraper built with Kotlin, Spring Boot, and Selenium to extract result details from an HTML page.


✨ Why This Stack?

  • Kotlin + Spring Boot: My preferred stack — mostly because i'm most familiar.
  • Selenium: Handles JavaScript-rendered content, unlike traditional scrapers.

For selecting elements - i used CSS Selectors - flexible element targeting

Example:

val name = element.findElement(By.cssSelector("div.pgNMRc")).text

To wait for page rendering, I used:

val wait = WebDriverWait(driver, Duration.ofSeconds(10))
wait.until(ExpectedConditions.presenceOfElementLocated(By.cssSelector("div.Cz5hV > div")))

Headless Mode

  • I've opted to use headless mode in Selenium because:

  • It’s ideal for automated testing environments — no GUI required.

  • It makes Dockerization smooth and lightweight.

  • It runs faster and consumes fewer resources than full-browser mode.

  • Perfect for CI/CD pipelines or headless environments (like cloud servers).

You can find the setup in the ChromeOptions block:

val options = ChromeOptions()
options.addArguments("--headless")
options.addArguments("--no-sandbox")
options.addArguments("--disable-dev-shm-usage")

How to Run

Prerequisites

  • Java 17+
  • Optional: Docker
  • IDE (I used IntelliJ)
  • Google Chrome + ChromeDriver
    Chromedriver location - /usr/local/bin/chromedriver
    Can be changed from the main class
                    fun main(args: Array<String>) {
                     System.setProperty("webdriver.chrome.driver","/usr/local/bin/chromedriver")
                     runApplication<VanGoughApplication>(*args)
                        }
---

###  Run Locally

1. 
   ```bash
   cd Kotlin-Project
  1. Build the project

    ./gradlew build
  2. Run it

    ./gradlew bootRun

    This exposes an API on port 8080:
    This can be changed at the properties page - Kotlin-Project/src/main/resources/application.properties

server.port=8080

  • VIDEO DEMO - PROJECT SETUP
project.setup1.mp4

Run with Docker (WIP)


Using the API

Send a GET request to:

http://localhost:8080/api?path=../files/test.html

You can also test with curl:

curl "http://localhost:8080/api?path=../files/test.html"

The path is the relative file path inside the ../files/ directory.
If you dont pass the file path, it will default to van-gogh-paintings.html

getResults(@RequestParam(defaultValue = "../files/van-gogh-paintings.html")

VIDEO DEMO - API USAGE -

API.mp4

✅ Example Response

[
  {
    "name": "Sample Name",
    "extensions": ["Ext1", "Ext2"],
    "link": "http://example.com/details",
    "image": "http://example.com/image.jpg"
  }
]

🧪 Running Tests

The assumption is we are supporting pages that have the same structure as the one provided in the example.
So i have created a sample page mimicing the structure and called it test.html - This is what i am using to test

./gradlew test

Includes tests for:

  • Successful HTML parsing
  • File-not-found handling

VIDEO DEMO - TESTS-

tests.mp4

📁 Project Structure

├── Kotlin-Project
├── src
│   ├── main
│   │ 
│   │   └── kotlin/com/example/cotrollers
│   │   └── kotlin/com/example/models
│   │    └──VanGoughApplication.kt -- main class
│   │
│   └── test
│       └── kotlin/com/example/van_gough
│           └── VanGoughApplicationTests.kt --tests
├── Dockerfile
├── build.gradle.kts
└── README.md

@JacqulineMbogo JacqulineMbogo marked this pull request as draft April 15, 2025 21:55
@JacqulineMbogo JacqulineMbogo marked this pull request as ready for review April 16, 2025 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant