Muhammad Owais solution to code challenge #317
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is my solution to the artwork parsing challenge using Python.
Explanation
For this challenge, I built a parser to extract artwork data from HTML doc without making any requests.
CSS vs XPath Selector
I used the python's
parsel
library and CSS selectors to identify the relevant sections of the HTML containing artwork information. In my experience, whether you use CSS or XPath selectors, any change in the website's structure will break the parser, as both methods rely on the HTML layout. I prefer CSS selectors because they are easier to read and understand. Plus, using CSS makes it simpler for future developers to update class names if the website structure changes, making the parser more flexible and easier to maintain.Lazy-loaded Images
While extracting most of the fields was straightforward, handling the image data was a little challenge due to lazy-loaded images. To address this, I created a function called
find_img_base64_encoded_str
that takes animage_id
from theimg
tag'sid
attribute and locates the relevant script tag to extract the base64-encoded image string using regex. This approach avoids relying on JS rendered HTML doc, which eliminates our need of web drivers for parsing or scraping.Possible Fix for Breaking Parsers in Production
In a production environment, I have found that having some sort of data sanitization logic within the scraping pipeline makes this challenge of changing HTML structure manageable. This ensures that the scraped data matches the expected format, and any discrepancies are flagged right away. Additionally, setting up monitoring or alerts to notify developers when the parser breaks due to changes in the website’s structure makes it very to update CSS selectors very quickly.
Testing
To ensure the parser works correctly, I wrote a series of tests using
pytest
. These tests are run on multiple sample HTML files and check that the parser correctly extracts the artwork data. Specifically, the tests verify that the parser:None
name
,extensions
,link
, andimage
) for each artworkimage
field contains valid base64-encoded data for at least four artworksThese tests ensure the robustness of the parser and confirm that it extracts the necessary data in the correct format.
Notes
I opted for Python instead of Ruby for this challenge due to my familiarity with Python. While I'm not sure about the exact performance differences between the two, Python has been a great tool for web scraping for quite some time. That said, I’m currently learning Ruby and will likely submit an equivalent version of the code in Ruby once I'm more comfortable with the syntax.