Skip to content

Commit 6d3ff98

Browse files
authored
Merge pull request #18 from Vedant950/main
Bug fix and minor update
2 parents 9976dbf + 05067b6 commit 6d3ff98

8 files changed

+374
-25
lines changed

LICENSE

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2021 Vedant Tibrewal, Vedaant Singh.
3+
Copyright (c) 2022 Vedant Tibrewal, Vedaant Singh.
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

README.md

+19-14
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@
66

77
## PyScrappy: powerful Python data scraping toolkit
88

9-
[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)
9+
[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)
1010

11-
[![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://www.python.org/downloads/release/python-360/)
11+
[![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://www.python.org/downloads/release/python-360/)
1212
[![PyPI Latest Release](https://img.shields.io/pypi/v/PyScrappy.svg)](https://pypi.org/project/PyScrappy/)
1313

1414
[![Package Status](https://img.shields.io/pypi/status/PyScrappy.svg)](https://pypi.org/project/PyScrappy/)
@@ -21,21 +21,22 @@
2121

2222
[![](https://img.shields.io/badge/pyscrappy-official%20documentation-blue)](https://pyscrappy.netlify.app/)
2323

24-
2524
## What is it?
2625

2726
**PyScrappy** is a Python package that provides a fast, flexible, and exhaustive way to scrape data from various different sources. Being an
2827
easy and intuitive library. It aims to be the fundamental high-level building block for scraping **data** in Python. Additionally, it has the broader goal of becoming **the most powerful and flexible open source data scraping tool available**.
2928

3029
## Main Features
30+
3131
Here are just a few of the things that PyScrappy does well:
3232

33-
- Easy scraping of [**Data**](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b) available on the internet
34-
- Returns a [**DataFrame**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) for further analysis and research purposes.
35-
- Automatic [**Data Scraping**](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b): Other than a few user input parameters the whole process of scraping the data is automatic.
36-
- Powerful, flexible
33+
- Easy scraping of [**Data**](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b) available on the internet
34+
- Returns a [**DataFrame**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) for further analysis and research purposes.
35+
- Automatic [**Data Scraping**](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b): Other than a few user input parameters the whole process of scraping the data is automatic.
36+
- Powerful, flexible
3737

3838
## Where to get it
39+
3940
The source code is currently hosted on GitHub at:
4041
https://github.com/mldsveda/PyScrappy
4142

@@ -47,13 +48,14 @@ pip install PyScrappy
4748
```
4849

4950
## Dependencies
50-
- [selenium - Selenium is a free (open-source) automated testing framework used to validate web applications across different browsers and platforms.](https://www.selenium.dev/)
51-
- [webdriver-manger - WebDriverManager is an API that allows users to automate the handling of driver executables like chromedriver.exe, geckodriver.exe etc required by Selenium WebDriver API. Now let us see, how can we set path for driver executables for different browsers like Chrome, Firefox etc.](https://github.com/bonigarcia/webdrivermanager)
52-
- [beautifulsoup4 - Beautiful Soup is a Python library for getting data out of HTML, XML, and other markup languages.](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
53-
- [pandas - Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.](https://pandas.pydata.org/)
5451

52+
- [selenium](https://www.selenium.dev/) - Selenium is a free (open-source) automated testing framework used to validate web applications across different browsers and platforms.
53+
- [webdriver-manger](https://github.com/bonigarcia/webdrivermanager) - WebDriverManager is an API that allows users to automate the handling of driver executables like chromedriver.exe, geckodriver.exe etc required by Selenium WebDriver API. Now let us see, how can we set path for driver executables for different browsers like Chrome, Firefox etc.
54+
- [beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - Beautiful Soup is a Python library for getting data out of HTML, XML, and other markup languages.
55+
- [pandas](https://pandas.pydata.org/) - Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
5556

5657
## License
58+
5759
[MIT](https://github.com/mldsveda/PyScrappy/blob/main/LICENSE)
5860

5961
## Getting Help
@@ -62,16 +64,19 @@ For usage questions, the best place to go to is [StackOverflow](https://stackove
6264
Further, general questions and discussions can also take place on GitHub in this [repository](https://github.com/mldsveda/PyScrappy).
6365

6466
## Discussion and Development
67+
6568
Most development discussions take place on GitHub in this [repository](https://github.com/mldsveda/PyScrappy).
6669

6770
Also visit the official documentation of [PyScrappy](https://pyscrappy.netlify.app/) for more information.
6871

6972
## Contributing to PyScrappy
73+
7074
All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.
7175

72-
If you are simply looking to start working with the PyScrappy codebase, navigate to the [GitHub "issues" tab](https://github.com/mldsveda/PyScrappy/issues) and start looking through interesting issues.
76+
If you are simply looking to start working with the PyScrappy codebase, navigate to the GitHub ["issues"](https://github.com/mldsveda/PyScrappy/issues) tab and start looking through interesting issues.
7377

7478
## End Notes
75-
*Learn More about this package on [Medium](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b).*
7679

77-
### ***This package is solely made for educational and research purposes.***
80+
_Learn More about this package on [Medium](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b)._
81+
82+
### **_This package is solely made for educational and research purposes._**

setup.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -6,21 +6,21 @@
66

77
setuptools.setup(
88
name="PyScrappy",
9-
version="0.0.9",
9+
version="0.1.0",
1010
author="Vedant Tibrewal, Vedaant Singh",
1111
author_email="[email protected]",
1212
description="Powerful web scraping tool.",
1313
long_description=long_description,
1414
long_description_content_type="text/markdown",
1515
url="https://github.com/mldsveda/PyScrappy",
16-
keywords=['PyScrappy', 'Scraping', 'E-Commerce', 'Wikipedia', 'Image Scrapper', 'YouTube', 'Scrapy', 'Twitter', 'Social Media', 'Web Scraping', 'News', 'Stocks', 'Songs', 'Food', 'Instagram'],
16+
keywords=['PyScrappy', 'Scraping', 'E-Commerce', 'Wikipedia', 'Image Scrapper', 'YouTube', 'Scrapy', 'Twitter', 'Social Media', 'Web Scraping', 'News', 'Stocks', 'Songs', 'Food', 'Instagram', 'Movies'],
1717
classifiers=[
1818
"Programming Language :: Python :: 3",
1919
"License :: OSI Approved :: MIT License",
2020
"Operating System :: OS Independent",
2121
],
2222
python_requires=">=3.6",
23-
py_modules=["PyScrappy", "alibaba", "flipkart", "image", "instagram", "news", "snapdeal", "soundcloud", "stock", "swiggy", "twitter", "wikipedia", "youtube", "zomato"],
23+
py_modules=["PyScrappy", "alibaba", "amazon", "flipkart", "image", "imdb", "instagram", "news", "snapdeal", "soundcloud", "spotify", "stock", "swiggy", "twitter", "wikipedia", "youtube", "zomato"],
2424
package_dir={"": "src"},
2525
install_requires=[
2626
'selenium',

src/PyScrappy.py

+144-7
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,9 @@ class ECommerceScrapper():
55
66
ECommerece Scrapper: Helps in scrapping data from E-Comm websites
77
1. Alibaba
8-
2. Flipkart
9-
3. Snapdeal
8+
2. Amazon
9+
3. Flipkart
10+
4. Snapdeal
1011
1112
Type: class
1213
@@ -54,6 +55,40 @@ def alibaba_scrapper(self, product_name, n_pages):
5455
return alibaba.scrappi(product_name, n_pages)
5556

5657

58+
############## Amazon Scrapper ##############
59+
def amazon_scrapper(self, product_name, n_pages):
60+
61+
"""
62+
63+
Amazon Scrapper: Helps in scrapping amazon data ('Description', 'Rating', 'Votes', 'Offer Price', 'Actual Price').
64+
return type: DataFrame
65+
66+
Parameters
67+
------------
68+
product_name: Enter the name of desired product
69+
Type: str
70+
71+
n_pages: Enter the number of pages that you want to scrape
72+
Type: int
73+
74+
Note
75+
------
76+
Both the arguments are a compulsion.
77+
If n_pages == 0: A prompt will ask you to enter a valid page number and the scrapper will re-run.
78+
79+
Example
80+
---------
81+
>>> obj.amazon_scrapper('product', 3)
82+
out: Name Number of Items Description Ratings
83+
abc 440 product a 3.5
84+
aec 240 product b 4.5
85+
86+
"""
87+
88+
import amazon
89+
return amazon.scrappi(product_name, n_pages)
90+
91+
5792
############## Flipkart Scrapper ##############
5893
def flipkart_scrapper(self, product_name, n_pages):
5994

@@ -79,8 +114,8 @@ def flipkart_scrapper(self, product_name, n_pages):
79114
---------
80115
>>> obj.flipkart_scrapper("Product Name", 3)
81116
out: Name Price Original Price Description Rating
82-
abc ₹340 ₹440 Product 4.2
83-
aec ₹140 ₹240 Product 4.7
117+
abc ₹340 ₹440 Product 4.2
118+
aec ₹140 ₹240 Product 4.7
84119
85120
"""
86121

@@ -113,8 +148,8 @@ def snapdeal_scrapper(self, product_name, n_pages):
113148
---------
114149
>>> obj.snapdeal_scrapper('product', 3)
115150
out: Name Price Original Price Number of Ratings
116-
abc ₹340 ₹440 40
117-
aec ₹140 ₹240 34
151+
abc ₹340 ₹440 40
152+
aec ₹140 ₹240 34
118153
119154
"""
120155

@@ -216,7 +251,7 @@ def image_scrapper(data_name, n_images=10, img_format='jpg', folder_name='images
216251

217252
"""
218253
219-
Image Scrapper: Helps in scrapping images from "Google", "Yahoo", "Bing".
254+
Image Scrapper: Helps in scrapping images from "Google", "Yahoo", "Bing".
220255
Downloads it to the desired folder.
221256
222257
Parameters
@@ -257,6 +292,75 @@ def image_scrapper(data_name, n_images=10, img_format='jpg', folder_name='images
257292

258293
########################################################################################################################
259294

295+
############## IMDB Scrapper ##############
296+
def imdb_scrapper(genre, n_pages):
297+
298+
"""
299+
300+
IMDB Scrapper: Helps in scrapping movies from IMDB.
301+
return type: DataFrame
302+
303+
Parameters
304+
------------
305+
genre: Enter the genre of the movie
306+
Type: str
307+
308+
n_pages: Enter the number of pages that it will scrape at a single run.
309+
Type: int
310+
311+
Note
312+
------
313+
both the parameters are compulsory.
314+
315+
Example
316+
---------
317+
>>> imdb_scrapper('action', 4)
318+
out: Title Year Certificate Runtime Genre Rating Description Stars Directors Votes
319+
asd 2022 UA 49min action 3.9 about the.. asd dfgv 23
320+
scr 2022 15+ 89min action 4.9 about the.. add dfgv 23
321+
"""
322+
323+
import imdb
324+
return imdb.scrappi(genre, n_pages)
325+
326+
########################################################################################################################
327+
328+
############## LinkedIn Scrapper ##############
329+
def linkedin_scrapper(job_title, n_pages):
330+
331+
"""
332+
333+
LinkedIn Scrapper: Helps in scrapping job related data from LinkedIn (Job Title, Company Name, Location, Salary, Benefits, Date)
334+
return type: DataFrame
335+
336+
Parameters
337+
------------
338+
job_title: Enter the job title or type.
339+
Type: str
340+
341+
n_pages: Enter the number of pages that it will scrape at a single run.
342+
Type: int
343+
344+
Note
345+
------
346+
Both the parameters is a compulsion
347+
348+
Example
349+
---------
350+
>>> linkedin_scrapper('python', 1)
351+
out: Job Title Company Name Location Salary Benefits Date
352+
abc PyScrappy US 2300 Actively Hiring +1 1 day ago
353+
abc PyScrappy US 2300 Actively Hiring +1 1 day ago
354+
...
355+
..
356+
357+
"""
358+
359+
import linkedin
360+
return linkedin.scrappi(job_title, n_pages)
361+
362+
########################################################################################################################
363+
260364
############## News Scrapper ##############
261365
def news_scrapper(n_pages, genre = str()):
262366

@@ -530,6 +634,39 @@ def soundcloud_scrapper(self, track_name, n_pages):
530634
import soundcloud
531635
return soundcloud.soundcloud_tracks(track_name, n_pages)
532636

637+
638+
############## Spotify Scrapper ##############
639+
def spotify_scrapper(self, track_name, n_pages):
640+
641+
"""
642+
643+
Spotify Scrapper: Helps in scrapping data from spotify ('Id', 'Title', 'Singers', 'Album', 'Duration')
644+
return type: DataFrame
645+
646+
Parameters
647+
------------
648+
track_name: Enter the name of desired track/song/music/artist/bodcast
649+
Type: str
650+
651+
n_pages: The number of pages that it will scrape at a single run
652+
Type: int
653+
654+
Note
655+
------
656+
Make sure to enter a valid name
657+
658+
Example
659+
---------
660+
>>> obj.spotify_scrapper('pop', 3)
661+
out: Id Title Singers Album Duration
662+
1 abc abc abc 2:30
663+
2 def def def 2:30
664+
665+
"""
666+
667+
import spotify
668+
return spotify.scrappi(track_name, n_pages)
669+
533670
########################################################################################################################
534671

535672
############## stock Scrapper ##############

src/amazon.py

+60
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
import pandas as pd
2+
from time import sleep
3+
from webdriver_manager.chrome import ChromeDriverManager
4+
from selenium import webdriver
5+
6+
def func(cards):
7+
data = []
8+
for card in cards:
9+
try: info = card.find_element_by_class_name("s-card-container").find_element_by_xpath("./div/div[3]")
10+
except:
11+
try: info = card.find_element_by_class_name("s-card-container").find_element_by_xpath("./div/div[2]")
12+
except:
13+
try: info = card.find_element_by_class_name("s-card-container").find_element_by_xpath("./div/div/div[3]")
14+
except: info = card.find_element_by_class_name("s-card-container").find_element_by_xpath("./div/div/div[2]")
15+
try: description = info.find_element_by_xpath("./div[1]/h2").text
16+
except: description = None
17+
try: rating = info.find_element_by_xpath("./div[2]/div/span").get_attribute("aria-label")
18+
except: rating = None
19+
try: votes = info.find_elements_by_xpath("./div[2]/div/span")[1].text
20+
except: votes = None
21+
try: offer_price = info.find_element_by_class_name("a-price").text.replace("\n", ".")
22+
except: offer_price = None
23+
try: actual_price = info.find_element_by_class_name("a-price").find_element_by_xpath("..//span[@data-a-strike='true']").text
24+
except: actual_price = offer_price
25+
26+
data.append([description, rating, votes, offer_price, actual_price])
27+
28+
return data
29+
30+
def scrappi(product_name, n_pages):
31+
chrome_options = webdriver.ChromeOptions()
32+
chrome_options.add_argument('--headless')
33+
chrome_options.headless = True
34+
driver = webdriver.Chrome(ChromeDriverManager(print_first_line=False).install(), options = chrome_options)
35+
driver.create_options()
36+
37+
url = "https://www.amazon.com/s?k="+product_name
38+
driver.get(url)
39+
sleep(4)
40+
41+
cards = driver.find_elements_by_xpath('//div[@data-component-type="s-search-result"]')
42+
while len(cards) == 0:
43+
driver.get(url)
44+
sleep(4)
45+
46+
max_pages = int(driver.find_element_by_xpath(".//span[@class='s-pagination-strip']/span[last()]").text)
47+
while n_pages > max_pages or n_pages == 0:
48+
print(f"Please Enter a Valid Number of Pages Between 1 to {max_pages}:")
49+
n_pages = int(input())
50+
51+
data = []
52+
53+
while n_pages > 0:
54+
n_pages -= 1
55+
data.extend(func(driver.find_elements_by_xpath('//div[@data-component-type="s-search-result"]')))
56+
driver.find_element_by_class_name("s-pagination-next").click()
57+
sleep(4)
58+
59+
driver.close()
60+
return pd.DataFrame(data, columns=["Description", "Rating", "Votes", "Offer Price", "Actual Price"])

0 commit comments

Comments
 (0)