-
Notifications
You must be signed in to change notification settings - Fork 3
Add design doc #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add design doc #54
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,67 @@ | ||||||
| ## Scrapping workflow phase 1 design | ||||||
|
|
||||||
| **General requirements** | ||||||
| * Workflow can support multiple sources/multiple scraping strategies | ||||||
| * Workflow can be scaled easily | ||||||
| * Workflow is observable | ||||||
|
|
||||||
| **What currently exists** | ||||||
| * From [this Design Doc](https://www.notion.so/de8ece87473b4f37a0ea037d607d910a?v=1bec41f87a6b42acb639d348d2282195&p=6d6b1797931e465d9cf6f5d4641c911e), Angostura can be used to publish and persist data to google BigQuery. | ||||||
| * Sambil has been packaging their code as a library: c4v-py (this repo), which can be imported and used elsewhere. | ||||||
|
|
||||||
| **Proposed design** | ||||||
|
|
||||||
| _Goal_: Leverage existing angostura infrastructure as much as possible | ||||||
|  | ||||||
|
|
||||||
| 1. A custom written data uploader processes NGO submitted data and emits the events to the Angostura proxy gate. | ||||||
VKorelsky marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| At first, if the input is CSV, this can be a simple csv uploader script. | ||||||
|
|
||||||
| 2. Angostura does it’s processing and persists the data to a big query table | ||||||
|
|
||||||
| 3. A task defined in airflow queries for yet-to-be-enriched data and proceeds to scrape. | ||||||
| It can then persists the output to the table that classification workflow(s) will start from. | ||||||
|
|
||||||
| More on this job below. | ||||||
|
|
||||||
| * Any further tasks such as cleaning and classification can be done further along the pipeline. | ||||||
|
|
||||||
| **Scraping task** | ||||||
|
|
||||||
| * Here, we can leverage the python_transform job type Angostura offers. (which is backed by the BigQueryPythonOperator). From the Angostura documentation: | ||||||
| > The python_transform job takes a list of SQL files as arguments, a python function as a string and writes the result of calling that function with the arguments into a target BIgQuery table [...] Each transformation function is a Python function that receives pandas data frames as arguments. | ||||||
|
|
||||||
| * This means we could write a python module which | ||||||
| * Exposes one function, which takes a data frame as input and produces a data frame as output. | ||||||
| * The input consists of the rows of data that need scraping. | ||||||
| * The output can be the same rows, with the scraped data now present in the columns where it should be so that they can be persisted. | ||||||
|
|
||||||
| _Then the job becomes_ | ||||||
| 1. Query for all the rows of data that haven’t been scrapped yet. | ||||||
| * As a simple example, if there is a nullable field called scraped_data in the persisted data, then the query can simply be select * where scraped_data is null | ||||||
| * The query can also be modified to select only rows that contain a specific source, and pass those to a specific scraping task etc. That should make it flexible enough for anything we want to do in phase 2. | ||||||
|
|
||||||
| 2. The python function exposed by the module is called with these rows passed in. | ||||||
| 3. Scraping happens. | ||||||
| 4. Scraping output is persisted. | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @VKorelsky maybe we should have another step where the persisted data goes through a cleaning/filtering process into a new column or another table, perhaps. TBD. What would be some of the things we would not want to pass to the classifier?
Some context: This post has nothing to do with our public services issued and ends up being noise to our classifier. Unfortunately, this issue was too frequent that could not be ignored. |
||||||
|
|
||||||
| **The python module** | ||||||
|
|
||||||
| This can be very simple for the MVP. | ||||||
| A sample design can simply be to: | ||||||
| * Define a scraper interface, which takes in a URL (+ anything else needed) and outputs scraped data. | ||||||
| * Implement a top level method, which contains a map of URL domains to scraper implementations | ||||||
| * Output is written to data frame | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
To make things even simpler, since the output is just text content, I propose we write a
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For example, to tackle the point in line 35, if Angostura requires data_frames, we (or Angostura team) can write those adapters without much hassle, test them, and they can become plugins of the I've seen this approach being used many times by projects. If a certain package inside the community packages proves itself to be robust or coreish enough, it becomes part of the
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I like that approach 👍🏽 That way we know what to persist for each URL. I was thinking we could maybe have an internal object in our lib that is used in this top level loop. wdyt?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would make it the other way, but I like your point on persistence. It's important to discuss that. Ideally, our "object" should be agnostic of the persistence mechanisms. In fact, it should not be an object at all. After all, it is just data. Objects imply behavior. As a core library, we have to think about our users. They might (and probably will!) be interested in saving the data as CSV, data frames, XML, JSON, or Excel files. We need flexibility. This is why our data should be a super simple data structure (in the strict computer science of the word). For example, a (Hash)Map—a dictionary in Python. The keys could be the URLs themselves—or a combination of parameters, such as the and timestamp—or a custom hashed computation. In the case of the values, we would have the scraped data itself. To give an example, another map. This approach is already similar to how one would interact with a NoSQL database, but we are only using Plain Old Python Data Structures. The problem when one creates their own objects, whether or not they use a 3rd-party library inside, is that many libraries know how to deal with Python data structures (e.g. Thus, to tackle the persistence problem, we must decouple the "business" logic (should I call it non-profit logic? hehe) from the cross-cutting concerns (i.e. persistence); not couple ourselves to a framework--in this case Below I propose a sample architecture: # follows the contract of URL -> DATA (whatever DATA might be, probably yet another data structure)
scraped_data = {
"url1": { ... }
"url2": { ... }
}Then, in our wrapper module, we have trivial transformational code: # wrappers/pandas.py
import DataFrame from pandas
def scrapedDataAsPandasDataFrame(scraped_data):
DataFrame.from_dict(scraped_data)Obviously, you cannot put any map into a data frame, so our contract would have to take that into account. Then, if you NEED data frames for any reason to persist the data, you can just have something like... # persistence/whatever.py
class BigQueryRepository:
def persist(data):
df = scrapedDataAsPandasDataFrame(data)
persist(df)Besides separating concerns, the testability of this system is higher, since we would only have functions of signature HAVING SAID ALL OF THAT I still agree with @Edilmo in that we should FIRST make it work as fast as you can, and then we can start doing the relevant refactors. That is precisely what I told @LDiazN, to get working code ASAP, and then we can start separating concerns and testing them appropriately. However, it's great that we can discuss all of this while Luis is working on the prototype! As the saying goes... when is the best time to change the code? Before you even wrote it! (if you are wondering where that saying came, I think it was from Raymond H. but I don't recall atm)
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Made note of this. It is not complicated to add this conversions. I open an issue #66 to define the scraper interface.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm going to add the scraper interface design to the tasks board. Will pair on it with @LDiazN possibly tomorrow when we chat. @m4r4c00ch0 I agree with your points. A simple map + wrappers will do just fine. I'll work on this with Luis tomorrow |
||||||
|
|
||||||
| This is modularised, and lets us write specific scrapers for specific sources easily and independently of each other. We can only start with a few scrapers for specific, high-priority data sources. | ||||||
|
|
||||||
| **Picking the first source to write a scraper for** | ||||||
|
|
||||||
| From the excel [_TODO: add link_] that was provided by the partner NGO, we've worked out that the following distribution of sources: | ||||||
|
|
||||||
|  | ||||||
|
|
||||||
| This suggests that we should start writing a scraper for [El Pitazo](https://elpitazo.net/) first. | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was some work already started for el pitazo on #48 , I copied that into a Google Colab for better review. Might be worth to take a look.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
While the code from the PRs/Notebooks work, they are not a scraping module that can be imported or used by users of this library. My guess is that writing a more "formal" scraper to the library (or another library—a scraping library) is a new task to consider. There were many benefits in doing simple but decoupled modules. I think we should strive to separate the following topics into different repositories or modules (inside c4v-py but in a specific module).
In this case, we focus on IE. Information ExtractionThe information extraction library/module. As said earlier, it can be in its own repository ( @Edilmo said that all the work should be inside this library. I am all for it. So, to resolve this conversation, should we create a task of creating a first scraping module using el pitazo as a first example? After Luis (the intern) practices using Scrapy, this can be a good first issue.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @m4r4c00ch0 I agree. Just wanted to clarify, these notebooks I have been mentioning are just a proof of concept. I suggest we take what we can use from them (copy/paste with some modifications) and make it part of a formal module that belongs to the library.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @m4r4c00ch0 I was planning on adding the I will merge #48, only the notebook will be merged into
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
yep @m4r4c00ch0 that sounds good.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
can you, instead, copy the snippet and paste it in a folder named If you create the PR, could you, please, resolve this thread @dieko95 ?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regarding the notebook stuff @m4r4c00ch0, I think it'd be good that you maybe install jupyter - I can see that the cookie cutter data science repo structure suggests keeping notebooks in the repo as well. This seems to imply that the process to review these will then require checking out the branch locally and opening the notebook there, to have a look at it. I agree that it's not a great review experience. We can also maybe consider tools like this one, which could actually help a lot with this: https://www.reviewnb.com/ |
||||||
|
|
||||||
| Of course, we should confirm with the _servicios publicos_ team what they and the ngo see as the highest priority sources. | ||||||
VKorelsky marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
Uh oh!
There was an error while loading. Please reload this page.