Skip to content

Muhammad Owais solution to code challenge #317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

moya1295
Copy link

@moya1295 moya1295 commented Apr 8, 2025

This PR is my solution to the artwork parsing challenge using Python.

Explanation

For this challenge, I built a parser to extract artwork data from HTML doc without making any requests.

CSS vs XPath Selector
I used the python's parsel library and CSS selectors to identify the relevant sections of the HTML containing artwork information. In my experience, whether you use CSS or XPath selectors, any change in the website's structure will break the parser, as both methods rely on the HTML layout. I prefer CSS selectors because they are easier to read and understand. Plus, using CSS makes it simpler for future developers to update class names if the website structure changes, making the parser more flexible and easier to maintain.

Lazy-loaded Images
While extracting most of the fields was straightforward, handling the image data was a little challenge due to lazy-loaded images. To address this, I created a function called find_img_base64_encoded_str that takes an image_id from the img tag's id attribute and locates the relevant script tag to extract the base64-encoded image string using regex. This approach avoids relying on JS rendered HTML doc, which eliminates our need of web drivers for parsing or scraping.

Possible Fix for Breaking Parsers in Production
In a production environment, I have found that having some sort of data sanitization logic within the scraping pipeline makes this challenge of changing HTML structure manageable. This ensures that the scraped data matches the expected format, and any discrepancies are flagged right away. Additionally, setting up monitoring or alerts to notify developers when the parser breaks due to changes in the website’s structure makes it very to update CSS selectors very quickly.

Testing
To ensure the parser works correctly, I wrote a series of tests using pytest. These tests are run on multiple sample HTML files and check that the parser correctly extracts the artwork data. Specifically, the tests verify that the parser:

  • Does not return None
  • Returns a list of artworks
  • Extracts at least one artwork
  • Correctly identifies required fields (name, extensions, link, and image) for each artwork
  • Ensures that the image field contains valid base64-encoded data for at least four artworks

These tests ensure the robustness of the parser and confirm that it extracts the necessary data in the correct format.

Notes
I opted for Python instead of Ruby for this challenge due to my familiarity with Python. While I'm not sure about the exact performance differences between the two, Python has been a great tool for web scraping for quite some time. That said, I’m currently learning Ruby and will likely submit an equivalent version of the code in Ruby once I'm more comfortable with the syntax.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant