Scrapping University Researchers

As a project from my Master's in Big Data Analysis I've performed my first solo web scraping. The task consisted on extracting specific information on the university's members of the different research groups. Universitat de les Illes Balears, aka UIB.

Aim:

Identification and extraction of the researchers' information:
- Name
- Gender
- Researcher Level
- University Relationship(Role)
- Title
- CV
- Research Group

Part 1

The researcher's information is available here. The All cateogory displays all the research groups in a easier format so the scraping will start at:

url_en english version.
url_cat catalan version.
url_sp spanish version.

Then the first section consists of two parts:

Getting into an specific research group's web page and find the list of members.
Identify when there are no more pages listing the departments to stop the scrapping.

Part 2

Inside the research team's main page, the researchers are divided in 3 levels:

Main Resercher. Usually just one high level researcher.
Members. Many mid-high level researchers
Collaboratos. Many reserchers with distinct professional levels.

The second part consists of identifying the following information from each member:

Name
Gender
Category of researcher in the team
Role at the univeristy (University Relationship)
Title

Part 3

The last scraped data is a summary of the researcher's curriculum vitae. Some members don't have a personal university web page, hence there won't be any cv extraction in those cases. Moreover, not all researchers have their cv in all languages, so depending on the language some cv will be missing to.

Ensambling

At the end of the notebook the functions and procedures defined in the previous sections are merged together in order to complete the process of scraping all researchers data.

The language of scrapping can be modifyed by replacing the initial url by the corresponding with the desired language. This project has been done with the catalan version of the web page in order to be able to identify the researchers gender, in spanish could be known too.

For non-catalan or spanish speakers, note that the female version of the personal title is like the male's one, adding an a at the end. Eg: Dr./Dra., Sr./Sra. Hence the gender of the researcher can be easily identifyed.

Result

One researcher in Panda's dataframe format:

	name	gender	reasearch level	role	title	cv	research group
2645	Víctor Fernández Juárez	M	Col·laboradors	Tècnic especialista	Sr.		Unitat de Gràfics i Visió per Ordinador i IA (UGiVpOeIA)

Disclaimer:

All the data has been scraped from the UIB's R&D&I web page which is public data. Right of acces to public information.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
Researchers_UIB_Cat.json		Researchers_UIB_Cat.json
WebScraping.ipynb		WebScraping.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrapping University Researchers

Aim:

Part 1

Part 2

Part 3

Ensambling

Result

Disclaimer:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scrapping University Researchers

Aim:

Part 1

Part 2

Part 3

Ensambling

Result

Disclaimer:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages