-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathlinkedsent.py
162 lines (129 loc) · 5.61 KB
/
linkedsent.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# %% [markdown]
# # Linkedin jobs sentiment visualization and web scraping - `Python`, `JavaScript`
# ## Description
# This `Python` script scrapes up to 100 most recent Linkedin Job Postings of any Job Title and creates sentiment visualization in a form of a **word cloud**.
# ## Setting up
# First, we are importing all the necessary libraries. We are also specifying **Job Title** we wish to visualize and the number of **Job Postings** to scrape. We are then creating a link based on a Job Title.
#
# Note by default jobs' location is set to **Canada**. It can be changed to any other location simply by pasting Linkedin URL that contains desired Job Postings in the desired location.
#
# We are also setting up `Chrome Driver` locally, specifying its path. And finally opening the link in a Chrome browser with `selenium`.
# %%
# Importing libraries and specifying URL and Chrome driver path
import time
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
from wordcloud import WordCloud
import advertools as adv
import matplotlib.pyplot as plt
# specifying URL and number of job postings
postings_name = 'Data Scientist'
position_num = 10 # numbers 1 to 100
# building linkedin link
posit = '%20'.join(postings_name.split())
url = f'https://www.linkedin.com/jobs/search?keywords={posit}&location=Canada&geoId=101174742&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0'
# CHROME DRIVER INSTALLATION
# default chromedriver PATH on Mac OS is at /usr/local/bin
# alternatively specify any path like:
# path = '/Users/username/file/path//chromedriver'
# and point to it in the brakets:
# driver = webdriver.Chrome(path)
# opening url in Chrome browser
driver = webdriver.Chrome()
driver.get(url)
# %% [markdown]
# Now the link is opened in **Chrome browser**. However, most job postings are **hidden** and cannot be scraped. We will have to scroll the webpage down before we can scrape.
# ## Loading webpage by scrolling down with `JavaScript`
# Since `Python` does not have a built-in function to scroll pages we are using `JavaScript` to scroll the webpage down and also to check **body height** to determine when to stop the loop.
# %%
# scrolling down the webpage with javascript
# assigning webpage's body height in pixels
previous_height = driver.execute_script('return document.body.scrollHeight')
# scrolling to the end of body tag continuosly until the body height stops increasing
while True:
# scrolling to the bottom of body height (y coordinate)
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
# pausing for 1 sec to load
time.sleep(1)
# assigning webpage's increased body height in pixels
new_height = driver.execute_script('return document.body.scrollHeight')
# breaking the loop once the body stops growing
if new_height == previous_height:
break
# updating previous height for the next loop
previous_height = new_height
# %% [markdown]
# After 100 job postings there is a "See more jobs" button. Since 100 jobs should be enough for hour purposes we don't proceed with pressing the button.
# ## Scraping job postings' links
# Once we have all 100 job postings loaded we can **scrape the links** with `selenium`.
# %%
# extracting hrefs
# specifying a class where hrefs are located
lnks = driver.find_elements_by_class_name('base-card__full-link')
# looping through classes and extracting hrefs into a list
links_list = []
for lnk in lnks[:position_num]:
link_str = (lnk.get_attribute('href'))
# links_list += [link_str]
links_list.append(link_str)
driver.quit()
# previewing list's contents
for y in range(3):
print(links_list[y])
print()
# %% [markdown]
# ## Parsing and scraping each job posting
# Now we can use `requests.get()` to parse each link, then `BeautifulSoup` module to scrape the text of **Job Description**. All text is appended to one string.
# %%
# scraping URLs' contents and combining into one string
# creating text string
time.sleep(1)
words_str = ''
# try/except to avoid mistakes
try:
for link in links_list:
# looping through the links
req = requests.get(link)
print(req)
req = req.text
# converting to BeautifulSoup
soup = BeautifulSoup(req, 'lxml')
# extracting text based in class
markup = soup.find('div', class_="show-more-less-html__markup").text
# appending to a string and converting to lowercase
words_str = f'{words_str} {markup}'.lower()
# pausing for 1 sec to avoid error 429
time.sleep(1)
except Exception as e:
pass
# previewing strings contents
print(words_str[:500])
# %% [markdown]
# ## Plotting word cloud
# Since Canadian Job Postings can be in **English and French** we are using **stop words** from `advertools` module by combining two languages in one set.
#
# We then generate `WordCloud` class and plot it using `matplotlib`, then save as `.png` in **png** folder.
# %%
# plotting word cloud
# joining the sets of stopwords: English and French
sw_en_fr = adv.stopwords['english'].union(adv.stopwords['french'])
# generating word cloud with WordCloud module
wordcloud = WordCloud(width=800, height=800,
background_color='black',
stopwords=sw_en_fr,
min_font_size=10
).generate(words_str)
# plotting the WordCloud image with matplotlib
plt.figure(figsize=(10, 10), facecolor='Black')
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
# assigning file path
p_name = '_'.join(postings_name.split())
f_path = f'{p_name}-{position_num}.png'
# saving png
plt.savefig(f_path)
# printing the plot
plt.show()
# %%