Muhammad Owais solution to code challenge #317

moya1295 · 2025-04-08T05:52:00Z

This PR is my solution to the artwork parsing challenge using Python.

Explanation

For this challenge, I built a parser to extract artwork data from HTML doc without making any requests.

CSS vs XPath Selector
I used the python's parsel library and CSS selectors to identify the relevant sections of the HTML containing artwork information. In my experience, whether you use CSS or XPath selectors, any change in the website's structure will break the parser, as both methods rely on the HTML layout. I prefer CSS selectors because they are easier to read and understand. Plus, using CSS makes it simpler for future developers to update class names if the website structure changes, making the parser more flexible and easier to maintain.

Lazy-loaded Images
While extracting most of the fields was straightforward, handling the image data was a little challenge due to lazy-loaded images. To address this, I created a function called find_img_base64_encoded_str that takes an image_id from the img tag's id attribute and locates the relevant script tag to extract the base64-encoded image string using regex. This approach avoids relying on JS rendered HTML doc, which eliminates our need of web drivers for parsing or scraping.

Possible Fix for Breaking Parsers in Production
In a production environment, I have found that having some sort of data sanitization logic within the scraping pipeline makes this challenge of changing HTML structure manageable. This ensures that the scraped data matches the expected format, and any discrepancies are flagged right away. Additionally, setting up monitoring or alerts to notify developers when the parser breaks due to changes in the website’s structure makes it very to update CSS selectors very quickly.

Testing
To ensure the parser works correctly, I wrote a series of tests using pytest. These tests are run on multiple sample HTML files and check that the parser correctly extracts the artwork data. Specifically, the tests verify that the parser:

Does not return None
Returns a list of artworks
Extracts at least one artwork
Correctly identifies required fields (name, extensions, link, and image) for each artwork
Ensures that the image field contains valid base64-encoded data for at least four artworks

These tests ensure the robustness of the parser and confirm that it extracts the necessary data in the correct format.

Notes
I opted for Python instead of Ruby for this challenge due to my familiarity with Python. While I'm not sure about the exact performance differences between the two, Python has been a great tool for web scraping for quite some time. That said, I’m currently learning Ruby and will likely submit an equivalent version of the code in Ruby once I'm more comfortable with the syntax.

my solution

1aa12f4

andypple83 closed this Apr 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Muhammad Owais solution to code challenge #317

Muhammad Owais solution to code challenge #317

Uh oh!

moya1295 commented Apr 8, 2025

Uh oh!

Uh oh!

Muhammad Owais solution to code challenge #317

Muhammad Owais solution to code challenge #317

Uh oh!

Conversation

moya1295 commented Apr 8, 2025

Uh oh!

Uh oh!