Author: Cong Zhu
The pubmeb crawler is multi-threaded web crawler designed to extract publication records from PubMed via user defined key words and timeframe, as well as performing post hoc data management and analyses.
The program consists of three modules:
- PubMed scaping module: pub_retrieve_thread.
- SQL database generation module: sql_dump.
- Visualization module: visualization.
Please save the following .py files in the same folder
pub_retrieve_thread.py, sql_dump.py, visualization.py
from urllib.request import urlopen, urlretrieve
import time
from bs4 import BeautifulSoup
import os
import pandas as pd
import numpy as np
import csv
import requests
import math
import datetime
pub_retrieve_thread module collects paper’s title, authors, publication time, and abstract from PubMed (https://pubmed.ncbi.nlm.nih.gov/) according to user-defined keyword(s) and time frame (YYYY/MM/DD-YYYY/MM/DD).
Search results are returned as two csv files that store abstract information(PMID, paper title, publication date, abstract) and authors names(PMID and author full names) respectively. By default, the files are saved in the same working directory of the module.
First, import the module:
import pub_retrieve_thread as pr
Create an web crawler object and call the function:
abstract_tab, author_tab = pr.pubmed_record().pub_tab_all_main()
The above function returns two pandas dataframes Abstract_tab and author_tab that store abstract records and author names respectively. The window will prompt the user to enter key words and time frame. Press enter button after finishing each entry otherwise kernel will keep running. Once finished, the following output will be displayed:
Please enter a keyword: precision radiotherapy
Number of publications: 651
Number of pages: 66
651 papers were downloaded with 53.54137420654297 seconds
Sql_dump module loads saved csv files that contain search results and convert them into SQL database automatically. The module allows the user to extract full publication records by simply entering the author name(s) only without efforts to implement sql scripts . Extracted records will be saved in csv format.
Use the module:
import sql_dump as sdp
Load .csv files and convert them into SQL databases:
sdp.sql_dump('author_tab.csv', 'abstract_tab.csv')
The user will see following message being displayed once finished:
author_tab.db is created
abstract_tab.db is created
The above SQL databases are saved in the same directory as input .csv files.
Extract publication records by author names from SQL database:
pick_authors = sdp.pick_authors('Sarah Hazell','Katsuyuki Kiura','Mark Kidd','Lisa Bodei').pub_rec("Extract_records")
pick_authors: retrieve records by author names. Break entries by comma.
pub_rec: save the extracted records as .csv file. Saved file name is defined by the user.
Search results are saved in the Extract_records.csv.