This project consists of an analysis of 'tech jobs' on the Common Crawl database. This set of data is a corpus of internet webpages created to make a history of internet pages. The data was collected using a combination of tools including Scala with Apache Spark, and AWS tools such as EMR, Athena, and S3. We explored the following questions using this data: 1) Where do we see fewer tech ads proportional to population? 2) What companies are posting the most tech jobs? 3) What percent of tech job posters post no more than 3 job ads a month? 4) Is there a significant spike in tech job postings at the end of business quarters? 5) Are there general trends in tech job postings over the past year? 6) What percentage of entry level jobs require experience? 7) What are the top three qualifications or certifications requested by employers?
Additionally, we utilized the web interface to quickly check what's in there. Afterward, we downloaded common crawl indices
via cdx_toolkit and/or HTTP requests from either the command line or Python codes, in which we fetched common crawl cURL contents
to customize usecases via iterating over a series of cdx-associated objects (warcio) using beautiful soup, html2text, markdown and mistletoe. There are other tools as well such as the CDX Index Client for both command line use and comcrawl in Python. They, however, seem less flexible than the primary option as mentioned previously to be in use for this project.
My team worked on answering Q7: What are the top three qualifications or certifications requested by employers?
- Scala 2.11.12
- Spark SQL
- AWS EMR
- AWS Athena
- AWS S3
- Hadoop
- S3cmd
- awscmd
- python 3.8.3
- pandas
- numpy
- matplotlib.pyplot
- seaborn
- jupyter notebook
- PySpark (alternative)
- Docker (alternative)
- Scikit-learn
- TensorFlow or PyTorch (alternative)
- Git + GitHub
In this analysis, we worked with almost 100k different documents, each containing one single job AD. Prior to analyzing every single file, we needed to begin by pre-processing and cleaning our text data. Every Natural Language Processing (NLP) task requires the data be tokenized along with the use of regular expression. TF-IDF stands for Term Frequency, Inverse Document Frequency. TF-IDF weighs each term in a document by how unique it is to the given document, which allows us to summarize the contents of a document using a few key words. To visualize TF-IDF vectorization, we make use of a technique called t-SNE (short for t-Stochastic Neighbour Embedding). Both graphs show a basic trend among red and blue dots. In these two graphs, we see a separation between blue/red groups in which one contains selected key words but the other doesn't. This means that the elements of each group vector with the highest values will be the ones that have words that are unique, i.e. selected keywords here, to that specific document, or at least are rarely used in others.
3D t-SNE | 2D t-SNE |
---|---|
![]() |
![]() |
Machine Learning algorithms don't understand strings. However, they do understand math, which means they understand vectors and matrices. By Vectorizing the text, we just convert the entire text into a vector, where each element in the vector represents a different word. Let's create one frequency distribution by means of tokenizing every single text file. Then, we tried to answer what is the size of the total vocabulary used in every single job ad plain text file. Afterward, we inspected the top 1k most common words. Knowing the frequency with which each word is used is somewhat informative, but without the context of how many words are used in total, it doesn't tell us much. One way we can adjust for this is to use Normalized Word Frequency, which we can compute by dividing each word frequency by the total number of words.
Knowing individual word frequencies is somewhat informative, but in practice, some of these tokens are actually parts of larger phrases that should be treated as a single unit. So now let's create some bigrams, and see which combinations of words are most telling.
We created our own word embeddings by making use of the Word2Vec model. Then, we'll move onto building neural networks that make use of Embedding Layers.
- Train a Word2Vec model and transform words into vectors
- Obtain most similar words by using methods associated with word vectors
Now that I have a trained Word2Vec model, go ahead and explore the relationships between some of the words in the corpus! One cool thing I can use Word2Vec for is to get the most similar words to a given word. Here, I tried getting the most similar word to 'qualification'.
qualification | certification | experience | skills |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
qualification | certification | experience | skills |
---|---|---|---|
btech (Bachelor of Technology) | cpp (Certified Payroll Professional) | five | interpersonal |
mca (Master of Computer Application) | bls (Basic Life Support) | supervisory | excellent |
gnm (General Nursing Midwifery) | az (Azure) | discipline | verbal |
mba (Master of Business Administration) | gauging | prior | organizational |
bca (Bachelor of Computer Application) | awareness | musthave | communication |
mcom (Master of Commerce) | ergonomic | ideally | analytical |
msc (Master of Science) | datacenter | years | multitasking |
technologies | abilities | industrial | degree |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
technologies | abilities | industrial | degree |
---|---|---|---|
database | competencies | mechanical | masters |
hadoop | interpersonal | electrical | bachelor |
cloud | timemanagement | structural | science |
server | presentation | engineering | postsecondary |
paas (platform as a service) | thinking | mining | phd |
architectures | math | discipline | postgraduate |
iaas (infrastructure as a service) | efficiency | production | bsc (Bachelor of Science) |
data | computing | programming | languages |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
data | computing | programming | languages |
---|---|---|---|
structures | components | python | python |
analytics | distributed | java | java |
analysis | linux | javascript | matlab |
minitab | algorithms | react | c |
fpga (Field Programmable Gate Array) | architectures | django | scala |
sas | erp (Enterprise) | git | r |
pipelines | fpga (Field Programmable Gate Array) | nodejs | framework |
qualifications | collaboration | machine | developer |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
qualifications | collaboration | machine | developer |
---|---|---|---|
minimum | competing | prescriptive | web |
representative | priorities | learning | wordpress |
counting | peers | neural | designer |
required | teamwork | algorithms | android |
age | diplomacy | computing | cooperating |
older | relationships | predictive | php |
preferred | effective | deep | mobile |
In the following, I'll utilize transfer learning, loading in the weights from an open-sourced model that has already been trained for a very long time on a massive amount of data. Specifically, I'll work with the GloVe model from the Stanford NLP Group. There's not really any benefit from training the model ourselves, unless my text uses different, specialized vocabulary that isn't likely to be well represented inside an open-source model.
Machine Learning Models | Accuracy |
---|---|
Random Forest | 86.5% |
Support Vector Machine | 78.4% |
Logistic Regression | 78.7% |
Gradient Boosting | 84.2% |
Multi-layer Percepton | (86-87)% |
loss | accuracy |
---|---|
![]() |
![]() |
Before you begin, ensure you have met the following requirements:
-
You have installed the latest version of anaconda3
-
You have a
<Windows/Linux/Mac>
machine. -
You have read a list of links about how to download and manipuate common crawl data as follows:
HTML, or HyperText Markup Language, is a markup language that describes the structure and semantic meaning of web pages. Web browsers, such as Mozilla Firefox, Internet Explorer, and Google Chrome interpret the HTML code and use it to render output. Unlike Python, JavaScript and other programming languages, markup languages like HTML don't have any logic behind them. Instead, they simply surround the content to convey structure and meaning.
HTML lets us mark-up our content with semantic structure. It forms the skeleton of our web page. It would be great to be able to say, "Browser, when we see a p
tag with id
of my-name
, make the first letter be huge!" Or, to get our readers' attention, we might say, "Browser, if you see any tag with a class of warning surround it with a red box!" HTML authors believe that creating marked-up documents and styling marked-up documents are entirely separate tasks. They see a difference between writing content (the data within the HTML document) and specifying presentation, the rules for displaying the rendered elements.
Browsers combine the content (HTML) and presentation (CSS) layers to display web pages. CSS, or "Cascading Style Sheets," tells us how to write rules that define how browsers will present HTML. Rules in CSS won't look like HTML and they usually live in a file apart from our HTML file. CSS is the language for styling web pages. CSS instructions live apart from the HTML elements and have a different look and feel ("syntax"). CSS directives give web pages their specific look and feel. If you have ever been impressed by how a website can be displayed on a desktop browser while the same content looks great on a mobile device, you have CSS to thank for it!
Web pages can be represented by the objects that comprise their structure and content. This representation is known as the Document Object Model (DOM). The purpose of the DOM is to provide an interface for programs to change the structure, style, and content of web pages. The DOM represents the document as nodes and objects. Amongst other things, this allows programming languages to interactively change the page and HTML!
Beautiful Soup is a Python library designed for quick scraping projects. It allows us to select and navigate the tree-like structure of HTML documents, searching for particular tags, attributes or ids. It also allows us to then further traverse the HTML documents through relations like children or siblings. In other words, with Beautiful Soup, we could first select a specific div tag and then search through all of its nested tags. With the help of Beautiful Soup, we can:
- Navigate HTML documents using Beautiful Soup's children and sibling relations
- Select specific elements from HTML using Beautiful Soup
- Use regular expressions to extract items with a certain pattern within Beautiful Soup
- Determine the pagination scheme of a website and scrape multiple pages
- Identify and scrape images from a web page
- Save images from the web as well as display them in a Pandas DataFrame for easy perusal
APIs (short for Application Programming Interfaces) are an important aspect of the modern internet. APIs are what allows everything on the internet to play nicely with each other and work together. An API is a communication protocol between 2 software systems. It describes the mechanism through which if one system requests some information using a predefined format, a remote system responds with an outcome that gets sent back to the first system.
APIs are a way of allowing 2 applications to interact with each other. This is an incredibly common task in modern web-based programs. For instance, if you've ever connected your facebook profile to another service such as Spotify or Instagram, this is done through APIs. An API represents a way for 2 pieces of software to interact with one another. Under the hood, the actual request and response is done as an HTTP Request. The following diagram shows the HTTP Request/Response Cycle:
A Web application (Web app) is an application program that is stored on a remote server and delivered over the Internet through a browser interface. Web services are Web apps by definition and many, although not all, websites contain Web apps. Any website component that performs some function for the user qualifies as a Web app. Google’s search engine is a web app, yet its root concept is hardly different from a phone directory that enables you to search for names or numbers.
Most web apps actually use a browser interface for interaction, i.e. end users request access and request information/service from these applications through a modern web browser interface. There are hundreds of ways to build and configure a Web application but most of them follow the same basic structure: a web client, a web server, and a database.
The client is what the end user interacts with. "Client-side" code is actually responsible for most of what a user actually sees. For requesting
some information as a web page, the client side may be responsible for: includes:
- Defining the structure of the Web page
- Setting the look and feel of the Web page
- Implementing a mechanism for responding to user interactions (clicking buttons, entering text, etc.)
Most of these tasks are managed by HTML/CSS/JavaScript-like technologies to structure the information, style of the page and provide interactive objects for navigation and focus.
A web server in a Web application is what listens to requests coming in from the clients. When you set up an HTTP (HyperText Transfer Protocol - Language of the internet) server, we set it up to listen to a port number. A port number is always associated with the IP address of a computer. You can think of ports as separate channels on a computer that we can use to perform different tasks: one port could be surfing www.facebook.com while another fetches your email. This is possible because each of the applications (the Web browser and the email client) use different port numbers.
Once you've set up an HTTP server to listen to a specific port, the server waits for client requests coming to that specific port. After authenticating the client, the server performs any actions stated by the request and sends any requested data via an HTTP response.
Databases are the foundations of Web architecture. An SQL/NoSQL or a similar type of database is a place to store information so that it can easily be accessed, managed, and updated. If you're building a social media site, for example, you might use a database to store information about your users, posts, comments, etc. When a visitor requests a page, the data inserted into the page comes from the site's database, allowing real-time user interactions with sites like Facebook or apps like Gmail.
In the example image above, we can see the above-mentioned setup in action. A browser sends a request to a web server by calling its domain i.e. www.google.com. Based on who the requester is, the server collects necessary information for an SQL database. This information is wrapped as HTML code and sent back to the client. The web browser reads the structuring and styling information embedded within HTML and displays the page to the user accordingly.
When developing a Web application, the request/response cycle is a useful guide to see how all the components of the app fit together. The request/response cycle traces how a user's request flows through the app. Understanding the request/response cycle is helpful to figure out which files to edit when developing an app (and where to look when things aren't working).
Prior to running Job-Posting-Big-Data
, please go through a recommendation list of following softwares for the further installation as demanded :
Linux: follow me
macOS: follow me
Windows: follow me
- common crawl index search toolkit here
python -m pip install cdx_toolkit.
-
CDX Toolkit in the Command Line
- Web Archive content here
pip install warcio
-
CDX Toolkit in Python
-
CDX index API client here
-
CDX server API here
-
pywb here
- Getting the available collections
- Simple CDX Query
- Adding filters and options
- Handling zero results
- Dealing with Pagination
- Retrieving Content
python -m pip install html2text
- mistletoe here
python -m pip install mistletoe
- comcrawl here
python -m pip install commcrawl
- Word2Vec white-paper / gensim
pip install -U gensim
- TensorFlow-keras keras-API
pip install tensorflow
pip install keras
To use <project_name>, follow these steps:
<usage_example>
Add run commands and examples you think users will find useful. Provide an options reference for bonus points!
To contribute to Job-Posting-Big-Data, follow these steps:
- Fork this repository.
- Create a branch:
git checkout -b <branch_name>
. - Make your changes and commit them:
git commit -m '<commit_message>'
- Push to the original branch:
git push origin Job-Posting-Big-Data/<location>
- Create the pull request.
Alternatively see the GitHub documentation on creating a pull request.
If you want to contact me you can reach me at [email protected].