-
Notifications
You must be signed in to change notification settings - Fork 292
Add artworks parser #316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add artworks parser #316
Conversation
ARTWORK_CSS_CLASS = '.iELo6' | ||
EXTENSION_CSS_CLASS = '.cxzHyb' | ||
NAME_CSS_CLASS = '.pgNMRc' | ||
IMG_CSS_CLASS = '.taFZJe' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[C] I have these hoisted up here so that they are easy to identify and change.
end | ||
|
||
def decode_hex(str) | ||
str.gsub(/\\x([0-9A-Fa-f]{2})/) { [Regexp.last_match(1)].pack('H2') } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[C] There may be a language/built-in library feature that can do this more elegantly but I was unable to find it.
describe GoogleArtworksParser do | ||
subject { described_class.parse } | ||
|
||
fixtures.each do |fixture_name, fixture| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[C] I leaned on the relevant tests here.
I'm not too sure about generating tests via a loop. Rather than second-guess myself, I put the PR up. We can discuss whether or not this is standard practice in Ruby.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. We are setting up our tests this way.
c07f258
to
a9a6969
Compare
a9a6969
to
dcdaf5f
Compare
This is my submission for the coding challenge.
Approach
I have used your
nokolexbor
parser and CSS classes to identify the artworks section within the HTML. As some images appear to be lazy loaded, I first search for them within the document (using script tags containing lazy loaded images based on regex) and generate a hash of{id => img}
. I then parse the artworks out of the HTML and use the hash contents as a priority before falling back to thedata-src
attribute of each img. The rest of the required information seemed readily available in the HTML.I took this approach instead of using a headless browser to load the images (i.e. run the embedded JavaScript) because I feel it is more efficient and it avoids adding another external dependency to the project. We would need a benchmark for the first assumption if this was production code (instead of a coding challenge).
The main drawback I see to this approach is that it may be brittle. It has been a long time since I have done any scraping. I am not sure how often CSS classes will change on these page types and/or how often the underlying JavaScript gets updated. Changes to either will break the parser. I would definitely want a test directly against Google running often to confirm that everything is still working.
Things that are missing/next steps:
Additional info
This is the first time (ever) that I've written Ruby, I used
rbenv
andrubocop
for the project. I took on board most of the "suggestions" fromrubocop
due to my lack of experience with the language. I contemplated addingsorbet
but ultimately decided against it because I could not see any trace innokolexbor
.I have left some comments inline.
Thanks for taking the time to review.