Skip to content

Josep-at-work/Scraping-University-Researchers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Scrapping University Researchers

As a project from my Master's in Big Data Analysis I've performed my first solo web scraping. The task consisted on extracting specific information on the university's members of the different research groups. Universitat de les Illes Balears, aka UIB.

Aim:

  • Identification and extraction of the researchers' information:
    • Name
    • Gender
    • Researcher Level
    • University Relationship(Role)
    • Title
    • CV
    • Research Group

Part 1

The researcher's information is available here. The All cateogory displays all the research groups in a easier format so the scraping will start at:

Then the first section consists of two parts:

  1. Getting into an specific research group's web page and find the list of members.
  2. Identify when there are no more pages listing the departments to stop the scrapping.

Part 2

Inside the research team's main page, the researchers are divided in 3 levels:

  • Main Resercher. Usually just one high level researcher.
  • Members. Many mid-high level researchers
  • Collaboratos. Many reserchers with distinct professional levels.

The second part consists of identifying the following information from each member:

  • Name
  • Gender
  • Category of researcher in the team
  • Role at the univeristy (University Relationship)
  • Title

Part 3

The last scraped data is a summary of the researcher's curriculum vitae. Some members don't have a personal university web page, hence there won't be any cv extraction in those cases. Moreover, not all researchers have their cv in all languages, so depending on the language some cv will be missing to.

Ensambling

At the end of the notebook the functions and procedures defined in the previous sections are merged together in order to complete the process of scraping all researchers data.

The language of scrapping can be modifyed by replacing the initial url by the corresponding with the desired language. This project has been done with the catalan version of the web page in order to be able to identify the researchers gender, in spanish could be known too.

For non-catalan or spanish speakers, note that the female version of the personal title is like the male's one, adding an a at the end. Eg: Dr./Dra., Sr./Sra. Hence the gender of the researcher can be easily identifyed.

Result

One researcher in Panda's dataframe format:

name gender reasearch level role title cv research group
2645 Víctor Fernández Juárez M Col·laboradors Tècnic especialista Sr. Unitat de Gràfics i Visió per Ordinador i IA (UGiVpOeIA)

Disclaimer:

All the data has been scraped from the UIB's R&D&I web page which is public data. Right of acces to public information.

About

As a project from my Master's in Big Data Analysis I've performed my first solo web scraping. The task consisted on extracting specific information on the university's members of the research groups.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors