Saving scraped data to S3 bucket #93

srdjov18 · 2022-07-07T17:31:34Z

srdjov18
Jul 7, 2022

Hello again!

What is the best way of saving the scraped data into one of my AWS S3 buckets? Or is it more complicated than modifying one of the config/settings files?

Answered by dcaribou

Jul 8, 2022

You basically have two options

Running the project locally: You can modify config and code as you wish and run the project in your machine.
- 1_acquire.py will update raw data in data/raw in your local
- 2_prepare.py will re-create prepared csv in data/prep
Contribute you changes to the main project: Create a PR and have your changes merged to the main project. Acquiring and preparation scripts run on schedule every week and commit any new data to S3, data.world and Kaggle.
- This the option I explained in my previous comment.

In general, I'd recommend option 2, because then you can benefit from the automation that updates the data weekly and simply consume the updated data from data.wo…

View full answer

dcaribou · 2022-07-07T20:41:15Z

dcaribou
Jul 7, 2022
Maintainer

Hi @srdjov18,

If you've modified the config files to add some new competitions, for example, you can create a PR with your changes. Once the updates to the config are merged, they will run weekly as part of the data pipeline.
The data pipeline current runs every Tuesday at 4AM and does 3 things: acquire the raw data > re-create the prepared datasets > publish them to Kaggle and data.world.
After every run, you'll be able to updating your local files with latest changes by running dvc pull

Not sure if this answer your question. Let me know otherwise.

1 reply

srdjov18 Jul 7, 2022
Author

Hey @dcaribou.. OK got it.

So just to clarify:

the scraping is performed remotely, and not on my local machine
dvc pull is what pushes data to my local files, after it is run remotely on a weekly cadence
any updates/additions to the data pipeline need to be merged after a PR

Does that also mean since there are no active/new data within the competitions that are currently in the config files, that a dvc pull will result in no additional data being pushed to my local machine?

Sorry for all the questions. I'm super new to GitHub and all these data techniques so still trying to get hang of it.

dcaribou · 2022-07-08T05:50:51Z

dcaribou
Jul 8, 2022
Maintainer

You basically have two options

Running the project locally: You can modify config and code as you wish and run the project in your machine.
- 1_acquire.py will update raw data in data/raw in your local
- 2_prepare.py will re-create prepared csv in data/prep
Contribute you changes to the main project: Create a PR and have your changes merged to the main project. Acquiring and preparation scripts run on schedule every week and commit any new data to S3, data.world and Kaggle.
- This the option I explained in my previous comment.

In general, I'd recommend option 2, because then you can benefit from the automation that updates the data weekly and simply consume the updated data from data.world, Kaggle, or by running a dvc pull to get the latest data from S3 on your local.

With the first option the updated data will only live in your local machine. Of course, you can copy the somewhere else like an S3 bucket of your own or GDrive, or simply using the data from your local. It's up to you.

Sorry for all the questions. I'm super new to GitHub and all these data techniques so still trying to get hang of it.

I'm happy to answer all questions, no worries.

1 reply

srdjov18 Jul 8, 2022
Author

OK, makes sense now. Thanks so much for all your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Saving scraped data to S3 bucket #93

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Saving scraped data to S3 bucket #93

Uh oh!

srdjov18 Jul 7, 2022

Replies: 2 comments · 2 replies

Uh oh!

dcaribou Jul 7, 2022 Maintainer

Uh oh!

Uh oh!

srdjov18 Jul 7, 2022 Author

Uh oh!

dcaribou Jul 8, 2022 Maintainer

Uh oh!

srdjov18 Jul 8, 2022 Author

srdjov18
Jul 7, 2022

Replies: 2 comments 2 replies

dcaribou
Jul 7, 2022
Maintainer

srdjov18 Jul 7, 2022
Author

dcaribou
Jul 8, 2022
Maintainer

srdjov18 Jul 8, 2022
Author