PriceHopper

Author: Harry He

Project Description

The PriceHopper app allows users to compare grocery store prices and shopping carts between different companies. The backend scrapes grocery prices off the internet into the database, allowing users to use to app to run queries on the best prices and value.

Development Tools Used

React Native: Mobile / web framework used to develop the frontend app
Spring Boot: Backend Java framework used to create the REST API backend
Python Web Scraping Libraries (requests, BeautifulSoup, Selenium, ...): Python libraries used to scrape grocery prices off the web
PostgreSQL: Database used to store prices for queries
Docker: Development tool to package and build parts of the project

Challenges Faced

Problem

The major challenge when creating this project was figuring out the scraping of prices. The issue is that basically all major companies who have a website will use some DDOS protection and captcha service like Cloudflare.

To bypass this issue, an initial approach was to use public proxies and / or Tor network routing to simulate many IP's loading the site for only a few times (maybe like once every 5ish seconds). In this scenario, it would seem that there was just a bunch of regular computers connecting at once. However, there was some issues with using both of these methods.

Public proxies are unreliable
Most public proxies listed on sites such as ProxyScrape do not work, you cannot establish a network connection. Even when they do connect, most do not run JavaScript which is predominantly used in web development, basically preventing you from loading a site without it. To avoid this issue, you can filter only good public proxies, which is done in the WebScraping/Source/PublicProxies folder. However, these proxies endpoints will eventually stop being hosted after a few minutes as there is essentially zero benefit of running a public proxy (since you attract cyber crime).
Tor exit nodes are sketchy
Tor exit nodes are all publically released and with Tor being known for its anonymity, I have found that some sites seem to put captchas when they recieve a request from those IPs, at least more than usual. As a result, it makes it difficult to scrape anything because you essentially have to figure out a way to solve captchas.
Grocery prices are localized
The most important consideration that stopped me from using this initial approach was that most of the prices I would get would be very inconsistent. It is impossible to set cookies on every public proxy to set store locations, so websites will default to the IP address to get the closest store and their prices. As a result, these prices are from vastly different locations and are inaccurate for any user region's use.

Solution

To solve this issue, it was necessary that I use my own IP address. However, it is inevitable that after sending too many requests that look like robots, a company will ban your IP from their site. (In fact, I got banned from TNT Supermarket's website for a few days!) To make all your requests legit, one of the best ways is to actually make them legit, so I settled on a solution that simulated real human actions using keyboard macros. Using a regular web browser, I simulated keyboard strokes to visit websites and this worked! I could parse many pages without captchas and allow me to run the scraper for long periods of time to get all the products I needed. The keyboard macro scrapers are contained in WebScraping/ManualParseAutomater.py and WebScraping/ManualParallelParseAutomater.py

Note: Many sites that are potentials for scraping only need to be visited once, so these workarounds aren't normally necessary. Futhermore, this solution is really only necessary for sites strongly protected from DDOS attacks. Some sites like Wikipedia where you may also want to retrieve data might not be so defensive.

Installation and Running

Clone this repository and enter the folder

git clone https://github.com/he-is-harry/PriceHopper.git
cd PriceHopper

Ensure that you have Docker installed on your computer. If not, you can install it here. We will need the official Postgres Docker image, which is the Postgres database where all the products are stored. However, your database will be empty, so refer to the WebScraping/ScriptingNotes.md file for more information on retrieving the data yourself (or you can ask me for the data).
```
docker pull postgres
```
Start the database
```
cd Database
docker compose up
```
To start the backend for development, there are a few more prerequisites. You will need Java 17+ and Maven. For MacOS users, these can be installed using Homebrew using brew install openjdk and brew install maven. From there, you can start the backend using the following command.
```
cd ../Backend
mvn spring-boot:run
```
To start the frontend, first install the necessary packages. Then we run the app using Expo, the framework used to develop the app.
```
cd ../Frontend
npm install
npx expo start
```
If you have an iOS simulator, you can press i to open the app there. Alternatively, you can open it in your web browser by pressing w. In addition, you can also use your phone by scanning the QR code that should be displayed in the terminal (you may need to install the Expo Go app).

From there you should be finished! However, if you don't have any data, you can access WebScraping/ScriptingNotes.md file for more information on how to get the data.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Backend		Backend
Database		Database
Frontend		Frontend
WebScraping		WebScraping
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PriceHopper

Project Description

Development Tools Used

Challenges Faced

Problem

Solution

Installation and Running

About

Uh oh!

Releases

Packages

Uh oh!

Languages

he-is-harry/PriceHopper

Folders and files

Latest commit

History

Repository files navigation

PriceHopper

Project Description

Development Tools Used

Challenges Faced

Problem

Solution

Installation and Running

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages