Skip to content

odarroyo/ScopusLit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

ScopusLit: An Interactive Bibliometric Analysis Tool for Scopus

ScopusLit is an open-source, single-file web application built with Python, Streamlit, and pybliometrics that enables researchers to conduct comprehensive bibliometric analyses of scientific literature indexed in the Scopus database. The tool provides an end-to-end workflow: from executing Scopus Advanced Search queries and persisting results locally, through generating over 20 types of bibliometric visualizations (including VOSViewer-style co-occurrence networks), to performing filter-aware natural language processing on abstracts and exporting publication-ready figures and datasets.

Status of this document: This README describes the final monolithic version of app.py before the planned modular refactor. The application is still intentionally delivered as a single Python file for portability, but its internal behavior now includes reviewer-facing methods notes, author-ID-aware analyses, filter-aware text analysis, safer exports, and richer topic-model interpretation outputs.


Table of Contents

  1. Motivation and Significance
  2. Comparison with Related Software
  3. Software Architecture
  4. Features Overview
  5. Installation
  6. Configuration
  7. Usage Guide
  8. Illustrative Example
  9. Functional Modules in Detail
  10. Data Model and Persistence
  11. Visualization Specifications
  12. Text Analysis Pipeline
  13. Search Consolidation
  14. Export Capabilities
  15. Error Handling and API Rate Limiting
  16. Dependencies and Technology Stack
  17. Software Metadata
  18. Impact
  19. Limitations and Future Work
  20. How to Cite
  21. License

1. Motivation and Significance

Bibliometric analysis is a cornerstone of systematic literature reviews, research trend identification, and science mapping. Researchers frequently rely on the Scopus database, one of the largest curated abstract and citation databases of peer-reviewed literature, encompassing over 27,000 titles from more than 7,000 publishers. However, conducting rigorous bibliometric studies typically requires combining multiple disconnected tools: Scopus's web interface for search, spreadsheet software for data cleaning, specialized bibliometric software (e.g., VOSviewer, Bibliometrix) for analysis, and graphic design tools for producing publication-quality figures.

ScopusLit addresses this fragmentation by providing a single, unified, browser-based application that handles the entire bibliometric workflow within one interface. The tool is designed for:

  • Researchers conducting systematic literature reviews who need rapid quantitative characterization of a body of literature.
  • Graduate students learning bibliometric methods who benefit from an interactive, visual environment.
  • Research groups that need to compare multiple search strategies or research topics side by side.
  • Authors preparing review articles who require publication-ready charts exported at 300 DPI.

Unlike desktop bibliometric tools that require manual data import/export steps, ScopusLit communicates directly with the Scopus API, enabling a seamless flow from query formulation to final analysis. Unlike purely programmatic approaches (e.g., writing custom Python scripts), ScopusLit provides an interactive graphical interface that requires no programming knowledge from the end user.


2. Comparison with Related Software

The following table compares ScopusLit with established bibliometric tools across key dimensions relevant to researchers:

Feature ScopusLit VOSviewer Bibliometrix (R) Publish or Perish CiteSpace
Interface Web browser (Streamlit) Desktop GUI (Java) RStudio / Shiny web app Desktop GUI Desktop GUI (Java)
Data source integration Direct Scopus API Manual file import Manual file import (multiple formats) Google Scholar, Scopus, WoS Manual file import
Search execution Built-in (Scopus Advanced Search) External External Built-in (multiple sources) External
Result count estimation Yes (before downloading) No No Yes No
Abstract retrieval Built-in (per-document API) No No No No
Data persistence Automatic JSON storage Manual save/load R workspace CSV export Project files
Publications per year Bar + cumulative line Limited Yes Yes Yes
Journal analysis Top N horizontal bar Limited Yes Yes Limited
Country/affiliation analysis Bar + choropleth map Network map Yes (world map) No Limited
Author analysis Top N bar + h-index fetch Network map Yes Yes Yes
Co-authorship analysis Frequency tables + network graph Network visualization Network visualization No Network visualization
Keyword analysis Bar chart + word cloud Network map + overlay Yes + word cloud No Yes
Citation analysis Histogram + top cited + trends Citation network Yes Yes + citation metrics Citation burst detection
TF-IDF on abstracts Yes (unigrams + bigrams) No No No No
Topic modeling (NMF/LDA) Yes (3-10 topics, heatmap, topic summary, document-topic export) No No No No
Search consolidation Union + comparison modes No Multiple file merge No No
Venn diagram (overlap) Yes (2-3 searches) No No No No
Publication-ready export 300 DPI PNG per chart PNG/SVG Multiple formats No PNG
Excel export Multi-sheet (Results, Authors with IDs, Keywords, Summary) CSV Multiple formats CSV No
Co-citation / coupling Not yet Yes Yes No Yes
Network visualization Keyword, author, country co-occurrence (NetworkX + Plotly) Primary strength Yes (NetworkX) No Primary strength
Programming required None None R programming None None
Cost Free (requires Scopus API key) Free Free (R required) Free Free
Language Python Java R C# Java

Key differentiators of ScopusLit:

  1. Integrated API access: ScopusLit is the only tool that executes searches, downloads abstracts, and performs analysis within a single interface without manual file transfer.
  2. Text analysis on abstracts: TF-IDF and topic modeling on full abstract corpora are not available in VOSviewer, CiteSpace, or Publish or Perish.
  3. Search consolidation with Venn overlap: The ability to merge multiple searches (union mode) or compare them side-by-side (comparison mode with Venn diagrams) is unique to ScopusLit.
  4. Zero-code browser interface: Unlike Bibliometrix (which requires R), ScopusLit runs entirely through a web browser with no programming required.

3. Software Architecture

ScopusLit is implemented as a single Python file (app.py, approximately 3,005 lines of code) organized into 10 clearly delineated sections. This monolithic architecture was a deliberate initial design choice to maximize portability, simplify deployment, and minimize configuration overhead. The file is structured using a functional programming paradigm with grouped utility functions rather than classes. This is the final single-file version before the codebase is split into dedicated modules.

3.1 Section Organization

Section Name Approx. Lines Description
1 Imports and Configuration 65 All imports, pybliometrics.init(), constants, stopwords, Streamlit page config
2 Persistence Functions 175 JSON save/load for searches and consolidations
3 Scopus Search Functions 70 Query estimation, execution, DataFrame conversion
4 Abstract Download Functions 55 Retry wrapper, abstract downloading with incremental save
5 Bibliometric Analysis Functions 330 Pure data computations including author-ID-aware analysis and co-occurrence network analysis
6 Text Analysis Functions 150 TF-IDF, topic modeling, abstract word cloud generation, topic summary tables
7 Visualization Functions 390 Plotly and matplotlib chart creation including network graphs
8 Export Functions 110 CSV, multi-sheet Excel generation, filename sanitization
9 Streamlit Interface 1,000 All UI rendering, methods expanders, session state, page routing
10 Main Entry Point 25 main() function and __main__ guard

3.2 Layered Design

The application follows a three-layer separation of concerns:

  1. Data Layer (Sections 2-4): Handles all interactions with external systems. This includes Scopus API communication via pybliometrics (Section 3 for searches, Section 4 for abstract retrieval) and local file system operations for JSON persistence (Section 2). All API calls are wrapped in retry logic with exponential backoff to respect Scopus rate limits.

  2. Analysis Layer (Sections 5-6): Contains pure computation functions that take pandas DataFrames or abstract dictionaries as input and return processed data structures (DataFrames, dictionaries, or lists). These functions have no side effects and no dependency on Streamlit, making them independently testable. Section 5 covers bibliometric computations (publication counts, author rankings, citation statistics), while Section 6 covers natural language processing (TF-IDF, topic modeling, word cloud generation).

  3. Presentation Layer (Sections 7-9): Handles all visualization (Section 7 creates Plotly and matplotlib figures), data export (Section 8 generates CSV and Excel byte streams), and user interface rendering (Section 9 manages Streamlit components, session state, and page routing).

3.3 Function Inventory

The application comprises 98 functions distributed across sections:

Section Functions Key Examples
2 - Persistence 10 save_search, load_all_searches, build_consolidation_dataframe
3 - Search 3 estimate_results, execute_search, results_to_dataframe
4 - Abstracts 2 _api_call_with_retry, download_abstracts
5 - Analysis 24 analyze_publications_per_year, analyze_top_authors, analyze_citations, build_cooccurrence_graph, analyze_document_types, analyze_bradford_zones, analyze_lotka_law, analyze_price_law, analyze_keyword_evolution, build_sankey_data
6 - Text Analysis 8 generate_abstract_wordcloud, compute_tfidf_terms, compute_topic_model, build_topic_summary_table, build_doc_topic_table
7 - Visualization 22 plot_publications_per_year, plot_horizontal_bar, plot_venn_diagram, plot_network_graph, plot_document_type_pie, plot_bradford_curve, plot_lotka_curve, plot_keyword_evolution, plot_sankey
8 - Export 5 sanitize_filename, export_results_excel, export_abstracts_csv
9 - Interface 23 render_sidebar, render_analysis_page, render_methods_expander, render_tab_networks, render_consolidation_page, render_filters_panel, render_tab_doc_types, render_tab_sankey
10 - Main 1 main
Total 98

3.4 State Management

ScopusLit uses Streamlit's st.session_state mechanism to persist application state across user interactions (Streamlit re-executes the entire script on every widget interaction). The following state variables are managed:

State Variable Type Default Purpose
current_page str "New Search" Active navigation page
loaded_search dict or None None Full search data currently loaded for analysis
loaded_df pd.DataFrame or None None Cached DataFrame derived from loaded search results
search_estimate int or None None Most recent result count estimate
h_index_data list[dict] or None None Cached h-index data for top authors
last_search_run dict or None None Most recently executed search (for display on New Search page)
confirm_delete str or None None UUID of search pending deletion confirmation
loaded_consolidation dict or None None Full consolidation data currently loaded
loaded_consol_df pd.DataFrame or None None Cached DataFrame for loaded consolidation
loaded_consol_abstracts dict or None None Merged abstracts for loaded consolidation
topic_result dict or None None Cached topic modeling results keyed by abstracts, method, and number of topics

State lifecycle:

  • Loading a search populates loaded_search, loaded_df, and clears h_index_data.
  • Loading a consolidation populates loaded_consolidation, loaded_consol_df, and loaded_consol_abstracts.
  • Deleting a search or consolidation clears all associated state variables if the deleted item was currently loaded.
  • Topic modeling results persist in topic_result only while the current abstracts, method, and number of topics remain unchanged; changing filters or topic settings invalidates the cached result.

4. Features Overview

ScopusLit implements 26 features organized into five functional phases:

Phase 1: Search and Base Storage

# Feature Description
1 New Search Execute Scopus Advanced Search queries with result count estimation before downloading
2 Download Abstracts Retrieve full abstracts via AbstractRetrieval API with incremental saving every 25 documents
3 Saved Searches Manager List, load, rename, and delete persisted searches with two-step deletion confirmation

Phase 2: Bibliometric Analysis (15 analysis types across 10 tabs + global filters)

A Global Filters Panel (collapsible expander) is displayed above all analysis tabs, offering: year range slider, document type multiselect, country multiselect, and minimum citation threshold. Filters are applied before all analyses.

# Feature Tab Key Metrics / Charts
4 Publications per Year Timeline Bar chart + cumulative trend line (dual y-axis). Summary: total count, year range, peak year.
4b Document Type Analysis Document Types Pie chart + bar chart of document type distribution (subtypeDescription: Article, Review, Conference Paper, etc.). Metrics: distinct types, dominant type percentage.
5 Publications per Journal + Bradford's Law Sources Top 15 journals horizontal bar chart. Bradford's Law of Scattering: journals divided into 3 productivity zones (core, middle, peripheral), log-scale scatterplot with zone demarcation lines.
6 Publications per Country/Affiliation Geography Top 15 countries + top 15 affiliations, horizontal bar charts. Choropleth world map if >= 5 countries.
7 Top Authors + Lotka's/Price's Law Authors Top 20 authors horizontal bar chart using Scopus author IDs when available. Lotka's Law: log-log scatter of author productivity distribution with power law fit (exponent + R²). Price's Law: checks whether √n elite authors produce ≥ 50% of publications.
8 Co-authorship Analysis Co-authorship Frequent co-author pairs table using author IDs when available + author count distribution bar chart.
9 h-index of Top Authors Authors On-demand "Fetch h-index" button for top 10 authors' h-indices via AuthorRetrieval API. Results cached in session.
10 Keyword Analysis + Evolution Keywords Two subtabs: Keyword Frequency (top 30 bar chart + word cloud) and Keyword Evolution (heatmap of top keywords over publication years with configurable N).
11 Citation Analysis Citations Summary metrics (total, mean, median, max). Citation distribution histogram. Top 20 most cited documents table. Average citations per year line chart.
11b Co-occurrence Networks Networks VOSViewer-style network analysis with 3 subtabs: Keyword co-occurrence (author keywords that co-appear in publications), Author co-authorship (collaboration network), Country collaboration (international co-publishing). Duplicate values inside each publication are deduplicated before edge counting. Each subtab features: min co-occurrence slider (1-20), max nodes slider (10-100), layout algorithm selector (Spring/Kamada-Kawai/Circular), community detection with colored clusters, edge width tiers by weight, node size by occurrence count, and network metrics (nodes, edges, communities, density). Built with NetworkX + Plotly.
11c Three-Field Plot (Sankey) Three-Field Plot Interactive Sankey diagram connecting any 3 user-selected fields (Authors, Keywords, Journals, Countries, Affiliations, Document Types) with configurable top-N per field (3-20). Built with Plotly go.Sankey.

Phase 3: Abstract Text Analysis (3 analyses in nested sub-tabs, visible only when abstracts are downloaded)

# Feature Sub-tab Description
12 Abstract Word Cloud Abstract Word Cloud Word cloud from concatenated abstract texts, excluding English, Spanish, and academic filler stopwords.
13 TF-IDF Term Frequency TF-IDF Terms Top 30 terms by average TF-IDF score with unigram + bigram support. Horizontal bar chart.
14 Topic Modeling Topic Modeling NMF or LDA (user selects), configurable 3-10 topics via slider. Results include a topic summary table, topic-word weight heatmap, and document-topic assignment table with CSV exports.

Phase 4: Search Consolidation

# Feature Description
15 Create Consolidation Select 2+ searches via checkboxes, choose union or comparison mode, name and save.
16 Union Mode Analysis Deduplicate by EID, apply full Phase 2 + Phase 3 analysis pipeline to merged corpus. Includes consolidated CSV/Excel export.
17 Comparison Mode Analysis Three comparison tabs: overlaid publication timeline, grouped keyword bar chart, Venn diagram with overlap statistics table (supports 2-3 searches).

Phase 5: Export

# Feature Formats Contents
18 Export Results CSV; Excel (4 sheets: Results, Authors, Keywords, Summary) All search fields + aggregate statistics; author export includes Scopus IDs when available
19 Export Abstracts CSV; Excel EID, DOI, Title, Authors, Year, Journal, Abstract, Keywords

Additionally, every Plotly chart includes an "Export as PNG (300 DPI)" download button when static rendering is available, and every matplotlib figure (word clouds, Venn diagrams) includes an equivalent 300 DPI PNG download button.


5. Installation

5.1 Prerequisites

  • Python 3.10 or higher (tested with Python 3.14)
  • Scopus API access: Requires a valid Scopus API key. Institutional subscribers can also use an InstToken for off-campus access. API keys can be obtained from the Elsevier Developer Portal.

5.2 Setting Up the Environment

# Create and activate a virtual environment
python -m venv scopuslit_env
source scopuslit_env/bin/activate      # macOS / Linux
# scopuslit_env\Scripts\activate       # Windows

# Install all dependencies
pip install streamlit pybliometrics plotly pandas numpy wordcloud \
    matplotlib scikit-learn openpyxl kaleido matplotlib-venn networkx

5.3 Dependency List with Tested Versions

Package Tested Version Purpose
streamlit 1.55.0 Web application framework
pybliometrics 4.4.1 Scopus API client
plotly 6.6.0 Interactive charting library
pandas 2.3.3 Data manipulation and analysis
numpy 2.4.3 Numerical computing
wordcloud 1.9.6 Word cloud generation
matplotlib 3.10.8 Plotting backend for word clouds and Venn diagrams
scikit-learn 1.8.0 TF-IDF vectorization, NMF, LDA topic modeling
openpyxl 3.1.5 Excel file generation
kaleido 1.2.0 Plotly figure export to static PNG images
matplotlib-venn 1.1.2 Venn diagram visualization
networkx 3.6.1 Graph construction, layout algorithms, community detection for co-occurrence networks

5.4 Verify Installation

python -c "import streamlit, pybliometrics, plotly, pandas, numpy, \
    wordcloud, matplotlib, sklearn, openpyxl, kaleido, matplotlib_venn, \
    networkx; print('All dependencies installed successfully.')"

6. Configuration

6.1 pybliometrics Configuration

pybliometrics requires a configuration file at ~/.config/pybliometrics.cfg (Linux/macOS) or %USERPROFILE%\.config\pybliometrics.cfg (Windows). On first run, pybliometrics will prompt for configuration interactively. Alternatively, create the file manually:

[Authentication]
APIKey = YOUR_API_KEY_HERE
InstToken = YOUR_INST_TOKEN_HERE

[Directories]
AbstractRetrieval = ~/.cache/pybliometrics/AbstractRetrieval
AuthorRetrieval = ~/.cache/pybliometrics/AuthorRetrieval
ScopusSearch = ~/.cache/pybliometrics/ScopusSearch
  • The APIKey is mandatory. Obtain one from the Elsevier Developer Portal.
  • The InstToken is optional but required for off-campus access when your institution provides one.
  • Directories define local cache paths. pybliometrics caches API responses to minimize redundant requests.

6.2 Network Requirements

Scopus API access requires either:

  • A connection from an institutional network (campus VPN/IP range) recognized by Scopus, or
  • A valid InstToken for authentication from any network.

ScopusLit calls pybliometrics.init() at startup, which reads the configuration file and initializes the API client.


7. Usage Guide

7.1 Launching the Application

cd /path/to/ScopusLit
streamlit run app.py

The application opens in the default web browser at http://localhost:8501.

7.2 Navigation Structure

The application uses a four-page navigation structure accessible via radio buttons in the left sidebar:

Page Purpose
New Search Execute new Scopus searches and download abstracts
Saved Searches Manage (load, rename, delete) previously saved searches
Analysis View bibliometric and text analysis for a loaded search
Consolidation Combine multiple searches and run comparative analysis

The sidebar also displays:

  • A quick-access list of all saved searches with result counts, dates, abstract indicators (abs), and individual "Load" buttons. Clicking "Load" immediately opens the Analysis page with that search.
  • A list of all saved consolidations (if any exist) with mode labels, search counts, and "Load" buttons. Clicking "Load" immediately opens the Consolidation page with that consolidation's analysis.

7.3 Workflow: Running a New Search

  1. Navigate to the New Search page.
  2. Enter a Scopus Advanced Search query in the text area (e.g., TITLE-ABS-KEY(seismic AND "machine learning")).
  3. Optionally enter a descriptive name and comma-separated tags.
  4. Click Estimate Results to preview the number of matching documents without downloading them. This uses ScopusSearch(query, download=False).get_results_size() and consumes minimal API quota.
  5. Click Run Search to execute the full search. The application calls ScopusSearch(query, subscriber=True) and converts the resulting list of Document namedtuples into a serializable list of dictionaries.
  6. Results are automatically saved to a JSON file in the ./scopuslit_data/ directory.
  7. A summary is displayed: total documents, year range, and the first 10 titles.
  8. Optionally click Download Abstracts to retrieve full abstract texts via AbstractRetrieval(eid) for each document. This is a separate step because it requires one API call per document and can be slow for large result sets. Progress is saved incrementally every 25 abstracts.
  9. Click Load for Analysis to navigate to the Analysis page.

7.4 Workflow: Analyzing a Search

  1. Load a search from the sidebar, the Saved Searches page, or after running a new search.
  2. The Analysis page displays:
    • Search metadata: name, query, date, result count.
    • Export buttons at the top: Results CSV, Results Excel, Abstracts CSV, Abstracts Excel (latter two visible only if abstracts have been downloaded).
    • 10 analysis tabs (11 if abstracts are available):
      • Timeline: Publications per year with summary metrics (total, range, peak year) and dual-axis bar+line chart.
      • Document Types: Pie and bar charts of Scopus document categories.
      • Sources: Top 15 journals as horizontal bar chart with data table and Bradford's Law.
      • Geography: Top 15 countries and affiliations as bar charts, plus choropleth world map.
      • Authors: Top 20 author-ID-aware authors as bar chart, with "Fetch h-index for Top 10 Authors" button, Lotka's Law, and Price's Law.
      • Co-authorship: Most frequent co-author pairs table, authors per article distribution chart.
      • Networks: Keyword, author, and country co-occurrence networks.
      • Keywords: Top 30 keywords bar chart, keyword word cloud, and keyword evolution heatmap.
      • Citations: Summary metrics (total, mean, median, max), citation histogram, top 20 table, average per year line chart.
      • Three-Field Plot: Sankey diagram linking selected bibliometric fields.
      • Text Analysis (if abstracts available): Three nested sub-tabs for abstract word cloud, TF-IDF terms, and topic modeling.
  3. Each analysis tab includes a Methods and parameters expander describing field sources, assumptions, and algorithm settings.
  4. Each Plotly chart has an Export as PNG (300 DPI) button directly below it when Kaleido export is available.
  5. Data tables behind each chart are accessible via expandable sections.

7.5 Workflow: Managing Saved Searches

The Saved Searches page displays each search as a card with:

  • Name (editable inline with "Save Name" button), query preview, date, result count, tags, and abstract status.
  • Load for Analysis button to open the search in the Analysis page.
  • Download Abstracts button (if abstracts have not been downloaded yet).
  • Delete button with a two-step confirmation pattern: first click shows "Are you sure?" with "Yes, delete" and "Cancel" buttons.

7.6 Workflow: Consolidating Searches

  1. Navigate to the Consolidation page.
  2. Create a new consolidation: Select 2+ saved searches using checkboxes, choose a mode (union or comparison), enter a name, and click "Create Consolidation".
  3. Or load an existing consolidation from the list shown at the bottom of the page (each with "Load" and "Delete" buttons), or from the sidebar.
  4. For union mode, the full Phase 2 + Phase 3 analysis pipeline is displayed (same 10-11 tabs as individual search analysis, including filters), plus consolidated export buttons.
  5. For comparison mode, three specialized tabs are displayed:
    • Timeline Comparison: overlaid line charts.
    • Keywords Comparison: grouped bar chart.
    • Document Overlap: pairwise overlap statistics table + Venn diagram.

7.7 Workflow: Exporting Data

From the Analysis page, up to four export buttons are available:

Button Format Contents
Export Results (CSV) .csv All search result fields in a flat table
Export Results (Excel) .xlsx 4 sheets: Results, Authors, Keywords, Summary
Export Abstracts (CSV) .csv EID, DOI, Title, Authors, Year, Journal, Abstract, Keywords
Export Abstracts (Excel) .xlsx Same columns as CSV, in Excel format

The Excel Results export includes a Summary sheet with aggregate statistics: search name, total results, year range, total/mean citations, unique journals, unique countries, and unique authors. The Authors sheet includes Scopus author IDs where available.

Additionally, every individual chart can be exported as a 300 DPI PNG via the download button below each figure when static rendering is available.


8. Illustrative Example

The following walkthrough demonstrates a typical bibliometric analysis session using ScopusLit.

8.1 Scenario

A researcher is preparing a review article on machine learning applications in seismology and needs to characterize the existing literature quantitatively.

8.2 Step-by-Step

Step 1: Define and run the search.

On the "New Search" page, the researcher enters:

  • Query: TITLE-ABS-KEY(seismic AND "machine learning")
  • Name: Seismic ML
  • Tags: seismology, ML, review

Clicking "Estimate Results" reveals approximately 2,400 documents. Satisfied with the scope, the researcher clicks "Run Search". After approximately 30 seconds, the search completes and is saved.

Step 2: Download abstracts.

The researcher clicks "Download Abstracts". With a progress bar indicating status, the application downloads full abstract texts for all 2,400 documents. This process takes approximately 20-40 minutes depending on API response times, with progress saved every 25 abstracts. If interrupted, the download can be resumed later without re-downloading completed abstracts.

Step 3: Analyze the results.

Clicking "Load for Analysis" opens the Analysis page with a global filters panel and 11 tabs:

  • The Timeline tab reveals a rapid growth trend starting around 2017, with peak publications in 2024.
  • The Sources tab shows Geophysics, Computers & Geosciences, and Geophysical Journal International as the top journals.
  • The Geography tab reveals the United States and China as the leading countries, with a choropleth map showing global distribution.
  • The Networks tab provides VOSViewer-style co-occurrence analysis. The keyword co-occurrence subtab reveals clusters of related terms: a "deep learning" cluster connected to "convolutional neural network" and "transfer learning", a "seismology" cluster linking "earthquake", "seismic hazard", and "ground motion". The author co-authorship network shows research groups and their collaboration patterns, while the country collaboration network highlights the US-China research axis with secondary European hubs.
  • The Keywords tab shows "deep learning", "convolutional neural network", and "earthquake" as the most frequent author keywords. The word cloud provides a visual summary.
  • The Text Analysis tab's TF-IDF analysis identifies discriminative terms like "seismic waveform", "phase picking", and "transfer learning" using only abstracts that remain after active filters. Topic modeling with NMF (5 topics) reveals distinct research themes: earthquake detection, signal processing, hazard assessment, subsurface imaging, and ground motion prediction. The topic module also provides a topic summary table and document-topic assignment export.

Step 4: Run a comparison.

To compare with a related field, the researcher creates a second search for TITLE-ABS-KEY(volcanic AND "machine learning"). Then, on the Consolidation page, both searches are selected in comparison mode. The overlaid timeline reveals that seismic ML research started earlier and grows faster. The Venn diagram shows 47 shared documents, indicating a meaningful but limited overlap.

Step 5: Export for the manuscript.

The researcher exports:

  • Individual 300 DPI PNG charts for the publications-per-year figure, the keyword word cloud, and the topic heatmap.
  • A multi-sheet Excel file with the complete results, author list with Scopus IDs where available, keyword frequencies, and summary statistics.
  • An abstracts CSV for use in further text mining outside the application.

9. Functional Modules in Detail

9.1 Persistence Module (Section 2) - 10 functions

Function Signature Description
ensure_data_dir() () -> None Creates ./scopuslit_data/ if it does not exist.
save_search(search_data) (dict) -> str Writes a complete search dict to {uuid}.json. Returns file path.
load_search(search_id) (str) -> dict or None Reads a search by its UUID. Returns None if file not found.
load_all_searches() () -> list[dict] Reads metadata (no full results list) from all non-consolidation JSON files, sorted by date descending.
delete_search(search_id) (str) -> bool Removes a JSON file by UUID.
rename_search(search_id, new_name) (str, str) -> bool Loads file, updates name field, re-saves.
update_search_abstracts(search_id, abstracts) (str, dict) -> bool Partial update: replaces only the abstracts field.
save_consolidation(consolidation_data) (dict) -> str Writes a consolidation dict to {uuid}.json.
load_all_consolidations() () -> list[dict] Reads metadata from all consolidation JSON files.
build_consolidation_dataframe(consolidation) (dict) -> tuple[DataFrame, dict] Loads all referenced searches, merges results into a single DataFrame (deduplicating by EID for union mode), merges abstracts. Returns (df, abstracts_dict).

9.2 Scopus Search Module (Section 3) - 3 functions

Function Signature Description
estimate_results(query) (str) -> int or None Returns result count without downloading. Uses ScopusSearch(query, download=False).get_results_size().
execute_search(query) (str) -> tuple[list[dict], int] or None Executes ScopusSearch(query, subscriber=True), converts Document namedtuples to dicts via ._asdict(), retries on Scopus429Error.
results_to_dataframe(results) (list[dict]) -> DataFrame Converts result dicts to DataFrame, adding year (int from coverDate), citedby_count (int), author_count (int).

9.3 Abstract Download Module (Section 4) - 2 functions

Function Signature Description
_api_call_with_retry(callable_fn) (callable) -> Any Retry wrapper. Catches Scopus429Error, waits with exponential backoff (2s, then 4s), retries up to 3 total attempts.
download_abstracts(search_id, eid_list, ...) (str, list[str], dict or None, progress_bar) -> dict Downloads abstracts via AbstractRetrieval(eid), falls back to .description, saves every 25 docs, skips already-downloaded EIDs.

9.4 Bibliometric Analysis Module (Section 5) - 16 functions

Function Input Output Description
_parse_delimited_field(series, delimiter) Series, str Series Splits delimited strings, strips whitespace, filters empties, explodes. Used for ;-separated (authors, countries) and |-separated (keywords) fields.
analyze_publications_per_year(df) DataFrame DataFrame (year, count, cumulative) Groups by year, counts, computes running cumulative sum.
analyze_publications_per_journal(df, top_n) DataFrame DataFrame (journal, count) Value counts on publicationName, top N. Default 15.
analyze_publications_per_country(df, top_n) DataFrame DataFrame (country, count) Parses ;-separated affiliation_country, top N. Default 15.
analyze_publications_per_affiliation(df, top_n) DataFrame DataFrame (affiliation, count) Parses ;-separated affilname, top N. Default 15.
_author_entries_from_row(row) Series list[(key, display_name, author_id)] Builds unique author identity keys for one record, preferring Scopus author IDs and falling back to normalized names.
_author_identity_series(df) DataFrame Series Returns author identity keys for author productivity laws, preferring Scopus IDs.
analyze_top_authors(df, top_n) DataFrame DataFrame (author, author_id, count) Counts authors by Scopus ID when available; falls back to normalized name identity. Default 20.
analyze_coauthor_pairs(df, top_n) DataFrame DataFrame (author_1, author_2, author_id_1, author_id_2, count) Generates itertools.combinations of author identity keys per document, counts pair frequencies, and displays names plus IDs. Default 20.
analyze_author_count_distribution(df) DataFrame DataFrame (author_count, num_articles) Distribution of the author_count field.
fetch_h_indices(author_data) list[(name, auid)] list[dict] Calls AuthorRetrieval(auid).h_index per author with retry wrapper.
analyze_keywords(df, top_n) DataFrame DataFrame (keyword, count) Parses |-separated authkeywords, lowercases, counts. Default 30.
analyze_citations(df) DataFrame dict Returns summary (total, mean, median, max), distribution (Series), top_cited (top 20 DataFrame), avg_per_year (DataFrame).
build_cooccurrence_graph(df, field, delimiter, min_cooccurrence, max_nodes, lowercase, dedup_per_row) DataFrame, str, str, int, int, bool, bool dict or None Builds a NetworkX co-occurrence graph from a multi-value field. Counts item frequencies and pair co-occurrences via itertools.combinations, filters by minimum threshold, prunes to top N nodes, runs community detection (greedy_modularity_communities). Returns {graph, node_sizes, communities, pairs_df}.
compute_network_layout(graph, algorithm, seed) nx.Graph, str, int dict Wraps NetworkX layout algorithms: Spring (Fruchterman-Reingold with adaptive k), Kamada-Kawai, or Circular. Returns node→(x,y) mapping.
compute_network_metrics(graph, communities) nx.Graph, list dict Computes summary metrics: nodes, edges, communities, density, average degree.

9.5 Text Analysis Module (Section 6) - 8 functions

Function Input Output Description
_get_abstract_texts(abstracts) dict list[str] Filters and returns non-empty abstract strings.
_filter_abstracts_for_df(df, abstracts) DataFrame, dict dict Subsets abstracts to EIDs present in the currently filtered DataFrame.
_abstracts_signature(abstracts, method, n_topics) dict, str, int str Builds a stable cache key so topic results do not leak across filters, datasets, or settings.
generate_abstract_wordcloud(abstracts) dict Figure or None Concatenates abstracts, generates word cloud using WordCloud.generate() with combined stopword list (293 stopwords total).
compute_tfidf_terms(abstracts, top_n) dict, int DataFrame or None TF-IDF with max_features=1000, max_df=0.85, min_df=2, ngram_range=(1,2). Returns top N terms by average score. Requires >= 3 abstracts.
compute_topic_model(abstracts, n_topics, method) dict, int, str dict or None Uses TF-IDF vectors for NMF and count vectors for LDA. Returns {topics, doc_topic_matrix, doc_ids}. Requires >= max(5, n_topics) abstracts.
build_topic_summary_table(topics) list[dict] DataFrame Converts topic terms and weights into a display/export table.
build_doc_topic_table(topic_result) dict DataFrame Builds document-level dominant topic assignments and per-topic weights for display/export.

9.6 Visualization Module (Section 7) - 17 functions

Styling and utilities (3):

Function Description
style_plotly_fig(fig) Applies consistent theme: title 20pt, font 16pt, axes 18pt, ticks 16pt, legend 14pt, height 500px, plotly_white.
plotly_png_download(fig, filename) Renders to PNG at 2100x1500 pixels, scale 2x (~300 DPI at 7x5 inches) via kaleido.
display_chart_with_download(fig, key, filename) Composite: st.plotly_chart(width="stretch") + optional st.download_button() with PNG bytes. If Kaleido export fails, the chart still renders and the app shows a warning.

Chart functions (14):

Function Chart Type Notes
plot_publications_per_year(data) Dual-axis bar + line make_subplots(secondary_y=True)
plot_horizontal_bar(data, x_col, y_col, ...) Horizontal bar Reusable for 6+ charts (journals, countries, affiliations, authors, keywords). Uses dynamic height and explicit category ticks so all labels render.
plot_choropleth_map(data) Choropleth world map Viridis scale; returns None if < 5 countries
plot_author_count_distribution(data) Vertical bar Authors per article
plot_h_index_bar(data) Horizontal bar h-index for top authors
generate_keyword_wordcloud(data) Word cloud (matplotlib) From frequency dict via generate_from_frequencies()
plot_citation_histogram(values) Histogram 30 bins
plot_avg_citations_per_year(data) Line + markers Average citations per year
plot_tfidf_bar(data) Horizontal bar TF-IDF terms ranked by score
plot_topic_heatmap(topics) Heatmap Topic-word weights; dynamic height max(400, 80*n_topics)
plot_comparison_timeline(search_data_list) Overlaid lines One trace per search
plot_comparison_keywords(search_data_list, top_n) Grouped bar Top 15 keywords globally, bars per search
plot_venn_diagram(eid_sets) Venn (matplotlib) 2-set (venn2) or 3-set (venn3)
plot_network_graph(graph_data, positions, title) Network graph (Plotly) Co-occurrence network with community-colored nodes, 3-tier edge widths, hover info (occurrence count, degree, top neighbors). Height 700px, hidden axes.

9.7 Export Module (Section 8) - 5 functions

Function Signature Format Description
sanitize_filename(value, fallback) (str, str) -> str N/A Creates safe filename stems from search names before download buttons are rendered.
export_results_csv(df) (DataFrame) -> bytes CSV All result fields, UTF-8 encoded.
export_results_excel(df, search_name) (DataFrame, str) -> bytes Excel 4 sheets: Results (all fields), Authors (name + ID + count), Keywords (keyword + frequency), Summary (8 aggregate metrics).
export_abstracts_csv(df, abstracts) (DataFrame, dict) -> bytes CSV 8 columns: EID, DOI, Title, Authors, Year, Journal, Abstract, Keywords. Falls back to description field if abstract not available.
export_abstracts_excel(df, abstracts) (DataFrame, dict) -> bytes Excel Same 8 columns on an "Abstracts" sheet.

9.8 Interface Module (Section 9) - 23 functions

Function Purpose
init_session_state() Initializes 11 session state variables with default values.
_load_search_into_state(search_id) Loads a search from disk, populates loaded_search and loaded_df, clears h_index_data, switches to Analysis page.
render_sidebar() Renders sidebar: app title, navigation radio (4 pages), saved searches quick-list with Load buttons, consolidations list with Load buttons.
render_new_search_page() Query input, estimate/run buttons, result summary with first 10 titles, "Load for Analysis" and "Download Abstracts" buttons.
render_saved_searches_page() Search cards with inline rename, Load, Download Abstracts, and Delete (with two-step confirmation).
render_analysis_page() Header, export buttons (4), global filters panel, 10-11 analysis tabs dispatching to individual tab renderers.
render_filters_panel(df, key_prefix) Collapsible filters (year range, doc type, country, min citations). Returns filtered DataFrame.
render_methods_expander(content) Reusable expander for reviewer-facing methods notes in analysis tabs.
render_tab_timeline(df) Summary metrics + dual-axis chart + data table.
render_tab_doc_types(df) Document type metrics + pie chart + bar chart + data table.
render_tab_sources(df) Top journals chart + data table + Bradford's Law (zone metrics, log-scale scatterplot, zone table).
render_tab_geography(df) Countries section (bar + choropleth) + affiliations section (bar).
render_tab_authors(df) Top authors chart + h-index fetch section + Lotka's Law (log-log scatter + fit) + Price's Law (metrics).
render_tab_coauthorship(df) Co-author pairs table + author count distribution chart.
render_tab_keywords(df) Two subtabs: Keyword Frequency (bar + word cloud + data table) and Keyword Evolution (heatmap with configurable N).
render_tab_citations(df) Summary metrics + histogram + top 20 table + avg per year chart.
render_tab_sankey(df) Three-field Sankey diagram with field selectors and top-N sliders.
render_tab_text_analysis(df, abstracts) Three nested sub-tabs: abstract word cloud, TF-IDF, topic modeling. Analyses only abstracts matching the currently filtered DataFrame.
render_consolidation_page() Consolidation creation form + existing consolidation management + dispatches to union or comparison renderer.
render_tab_networks(df) Networks tab with 3 subtabs: keyword co-occurrence, author co-authorship, country collaboration.
_render_network_subtab(df, field, delimiter, label, default_min, lowercase, dedup, key_prefix) Reusable renderer for one network subtab: controls, metrics, chart, data table.
render_union_analysis(df, abstracts) Export buttons + filters + full 10-11 tab analysis (reuses individual tab renderers).
render_comparison_analysis(consol, df) Three comparison tabs: timeline, keywords, overlap (table + Venn diagram).

10. Data Model and Persistence

10.1 Search Data Structure

Each search is stored as a JSON file with the following schema:

{
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "name": "Seismic ML Review",
    "query": "TITLE-ABS-KEY(seismic AND \"machine learning\")",
    "search_date": "2026-03-08T10:30:00.123456",
    "total_results": 245,
    "results": [
        {
            "eid": "2-s2.0-85012345678",
            "doi": "10.1016/j.example.2024.01.001",
            "title": "Machine learning for seismic data analysis",
            "coverDate": "2024-01-15",
            "publicationName": "Computers & Geosciences",
            "author_names": "Smith J.;Doe A.;Brown K.",
            "author_ids": "12345678;23456789;34567890",
            "author_count": "3",
            "affiliation_country": "United States;Germany",
            "affilname": "Massachusetts Institute of Technology;Technical University of Munich",
            "citedby_count": "15",
            "authkeywords": "machine learning|seismic|deep learning",
            "description": "This study proposes..."
        }
    ],
    "abstracts": {
        "2-s2.0-85012345678": "Full abstract text retrieved via AbstractRetrieval..."
    },
    "tags": ["ML", "seismic"]
}

Field origins: The results list contains dictionaries derived from pybliometrics.scopus.ScopusSearch.results, which returns Document namedtuples with 36 fields. Multi-value fields use ; as separator (authors, affiliations, countries) or | (keywords).

Abstracts: Stored as a dictionary mapping EID to abstract text. The description field in results contains a summary from the search API; the abstracts dictionary contains full texts from the AbstractRetrieval API.

10.2 Consolidation Data Structure

{
    "id": "660e8400-e29b-41d4-a716-446655440001",
    "type": "consolidation",
    "name": "ML + Seismic Combined Analysis",
    "mode": "union",
    "search_ids": [
        "550e8400-e29b-41d4-a716-446655440000",
        "550e8400-e29b-41d4-a716-446655440002"
    ],
    "created_date": "2026-03-09T14:00:00.000000"
}

Consolidations reference search IDs rather than duplicating data. The build_consolidation_dataframe() function loads the referenced searches at runtime, merges their results, and handles deduplication for union mode.

10.3 Storage Location

All files are stored in ./scopuslit_data/ relative to app.py. This directory is created automatically on first run. File naming convention: {uuid4}.json.


11. Visualization Specifications

All Plotly charts follow a consistent visual specification enforced by the style_plotly_fig() helper function:

Property Value Rationale
Template plotly_white Clean, minimal background suitable for publication
Title font size 20pt Readable in Streamlit's wide layout
Base font size 16pt Compensates for Streamlit's default small rendering
Axis title font size 18pt Clearly distinguishable from tick labels
Tick label font size 16pt Legible for dense axis labels
Legend font size 14pt Compact but readable
Default chart height 500px Consistent vertical proportion; horizontal bars and heatmaps use dynamic height for dense labels
Color palette px.colors.qualitative.Set2 Color-blind friendly, visually distinct

PNG Export Specification

Every chart can be exported as a high-resolution PNG:

Property Value (Plotly) Value (matplotlib)
Width 2100 pixels Auto (tight bbox)
Height 1500 pixels Auto (tight bbox)
Scale factor 2x N/A
DPI ~300 (effective) 300 (explicit)
Format PNG (lossless) PNG (lossless)
Engine Kaleido (chromium-based) matplotlib Agg backend

If Kaleido or its browser backend is unavailable, Plotly charts still render interactively in the Streamlit page and the app displays a warning instead of failing the analysis tab.

Dense Label Handling

Horizontal bar charts and keyword-evolution heatmaps use dynamic heights and explicit categorical tick arrays so all labels are rendered. This is particularly important for top-author, top-keyword, and TF-IDF charts where default Plotly tick skipping can otherwise hide alternating labels.

Network Graph Specification

Co-occurrence network graphs use Plotly scatter traces to render NetworkX graphs:

Property Value Rationale
Chart height 700px Extra height for network readability
Axes Hidden (no grid, ticks, or zero line) Network layout is spatial, not quantitative
Edge rendering go.Scatter(mode="lines") with None separators Efficient single-trace approach per weight tier
Edge width tiers 3 tiers (0.8, 2.0, 3.5 px) by weight percentiles Distinguishes weak/medium/strong co-occurrences
Edge color rgba(180,180,180,0.5) Subtle, non-distracting
Node rendering One go.Scatter(mode="markers+text") per community Community coloring via 15-color palette
Node size range 10-50 px (normalized by occurrence count) Proportional to item frequency
Node labels Top center, 10pt, truncated at 25 chars Readable without excessive overlap
Community detection greedy_modularity_communities Fast, weight-aware, handles disconnected components
Layout algorithms Spring (default, adaptive k), Kamada-Kawai, Circular Spring best for clustering; KK for cleaner small graphs
Hover info Name, occurrences, degree, top 3 neighbors Detailed exploration without clutter

12. Text Analysis Pipeline

The text analysis module (Section 6) implements a three-stage NLP pipeline for abstract corpus analysis. In the interface, this pipeline is filter-aware: after the global filters are applied, the Text Analysis tab subsets the abstract dictionary to EIDs present in the filtered DataFrame. This means the abstract word cloud, TF-IDF terms, and topic model reflect the same filtered corpus used by the bibliometric tabs.

12.1 Stopword Configuration

The combined stopword list contains 293 unique terms from three sources:

Source Count Examples
English stopwords (wordcloud.STOPWORDS) 192 the, is, at, which, from, have
Spanish stopwords (custom set) 69 de, la, que, el, en, para, con, como
Academic filler words (custom set) 38 study, results, method, proposed, analysis, approach, data, model

The Spanish stopwords support analysis of multilingual abstract corpora common in Latin American and European research. The academic filler words remove high-frequency domain-agnostic terms that carry low discriminative value in bibliometric contexts.

12.2 TF-IDF Vectorization

The compute_tfidf_terms() function uses sklearn.feature_extraction.text.TfidfVectorizer:

Parameter Value Rationale
max_features 1,000 Vocabulary limit for computational efficiency
stop_words "english" scikit-learn's built-in English stopword list
max_df 0.85 Ignore terms appearing in > 85% of documents
min_df 2 Ignore terms appearing in fewer than 2 documents
ngram_range (1, 2) Include both unigrams and bigrams (e.g., "neural network")

The function computes the mean TF-IDF score across all documents for each term, then returns the top N terms ranked by this average score. Requires a minimum of 3 non-empty abstracts.

12.3 Topic Modeling

The compute_topic_model() function supports two decomposition algorithms. Users select the method and number of topics in the interface. The number of topics is not optimized automatically; it is an exploratory parameter selected from 3 to 10, where lower values produce broader themes and higher values produce finer-grained themes.

Non-negative Matrix Factorization (NMF):

  • Vectorizer: TfidfVectorizer(max_features=2000, stop_words="english", max_df=0.85, min_df=2)
  • Parameters: n_components (user-configurable, 3-10), random_state=42, max_iter=300
  • Produces additive, parts-based decomposition
  • Generally produces more interpretable topics for scientific text

Latent Dirichlet Allocation (LDA):

  • Vectorizer: CountVectorizer(max_features=2000, stop_words="english", max_df=0.85, min_df=2)
  • Parameters: n_components (user-configurable, 3-10), random_state=42, max_iter=20
  • Probabilistic generative model

Both algorithms return: (1) a list of topics, each with its top 10 words and their weights, (2) a document-topic assignment matrix, and (3) the EIDs corresponding to that matrix. Requires a minimum of max(5, n_topics) non-empty abstracts.

12.4 Topic Output Interpretation

The Topic Modeling sub-tab displays:

Output Interpretation
Topic Summary table One row per topic, listing the top weighted terms and their weights. Users interpret each topic by assigning a semantic label based on these terms.
Topic-word heatmap Visualizes relative term weights across topics. Darker cells indicate stronger term-topic association.
Document Topic Assignments table One row per abstract EID with dominant topic, dominant weight, and all per-topic weights. This can be exported as CSV for external validation or downstream analysis.

Topic modeling is intended as exploratory thematic summarization, not definitive article classification. Results should be interpreted alongside the search strategy, active filters, and domain knowledge.


13. Search Consolidation

The consolidation feature (Phase 4) enables researchers to combine multiple Scopus searches for integrated analysis.

13.1 Union Mode

In union mode, the application:

  1. Loads all results from each selected search.
  2. Concatenates them into a single DataFrame with a _source_search column.
  3. Deduplicates by EID (Scopus unique identifier), keeping the first occurrence.
  4. Merges abstract dictionaries from all searches (preferring non-empty values).
  5. Applies the complete Phase 2 and Phase 3 analysis pipeline to the merged corpus.
  6. Provides dedicated "Export Consolidated Results" CSV and Excel buttons.

This mode is appropriate when the researcher wants to treat multiple searches as a single body of literature.

13.2 Comparison Mode

In comparison mode, the application:

  1. Loads results from each search separately, preserving source labels.
  2. Generates three comparative visualization tabs:
    • Timeline comparison: Overlaid line charts showing publications per year for each search, with distinct colors per search.
    • Keyword comparison: Grouped bar chart showing the globally top 15 keywords, with per-search frequency bars side by side.
    • Document overlap: A pairwise statistics table (showing shared, only-in-A, only-in-B counts for each pair) and a Venn diagram for 2-3 searches using matplotlib_venn. For more than 3 searches, only the statistics table is shown.

This mode is appropriate when the researcher wants to compare research topics, methodologies, or search strategies.


14. Export Capabilities

14.1 Results Export

CSV format: Flat file containing all fields from the ScopusSearch results. Encoded as UTF-8.

Excel format (.xlsx): Multi-sheet workbook:

Sheet Contents
Results All search result fields (excluding internal columns prefixed with _)
Authors Author name, Scopus author ID when available, and publication count, sorted by frequency descending
Keywords Keyword and frequency count, lowercased, sorted by frequency descending
Summary 8 aggregate metrics: search name, total results, year range, total citations, mean citations, unique journals, unique countries, unique authors

14.2 Abstracts Export

Both CSV and Excel formats contain the same 8 columns:

Column Source
EID Scopus unique identifier from search results
DOI Digital Object Identifier
Title Document title
Authors Author names (;-separated)
Year Publication year (integer)
Journal Publication name
Abstract Full text from AbstractRetrieval, falling back to description field from search
Keywords Author keywords (`

14.3 Chart Export

Every Plotly chart: PNG at 2100x1500 pixels, scale 2x (~300 DPI). Every matplotlib figure: PNG at 300 DPI with tight bounding box. Download buttons appear directly below each chart.

Search names are sanitized before being used in filenames. Excel export failures and Plotly PNG export failures are caught in the interface and shown as warnings, so the rest of the analysis page remains usable.

14.4 Topic Model Exports

When a topic model has been run, two additional CSV exports are available inside the Topic Modeling sub-tab:

Export Contents
Topic Summary Topic label, top terms, and term weights
Document Topic Assignments EID, dominant topic, dominant topic weight, and all topic weights

15. Error Handling and API Rate Limiting

15.1 API Error Handling

All Scopus API calls are wrapped in try/except blocks that catch:

Exception Handling
Scopus429Error Rate limit exceeded. Retry with exponential backoff up to 3 total attempts. Display warning during retries, error after exhaustion.
ScopusHtmlError General API error. Display error with troubleshooting hints (check network, API key, InstToken).
Exception Catch-all. Display error with exception details.

Note: Exception classes are imported from pybliometrics.exception (not pybliometrics.scopus.exception).

15.2 Retry Mechanism

The _api_call_with_retry(callable_fn) function implements exponential backoff:

Attempt 1: Execute immediately
Attempt 2: Wait 2 seconds, then retry
Attempt 3: Wait 4 seconds, then retry
After 3 failures: Raise the original exception

Wait times are computed as BACKOFF_BASE_SECONDS * (2 ** attempt) where BACKOFF_BASE_SECONDS = 2 and attempt starts at 0. This yields waits of 2s and 4s before the second and third attempts respectively.

15.3 Incremental Abstract Saving

The abstract download function saves progress to disk every 25 abstracts. If the process is interrupted, previously downloaded abstracts are preserved. Re-running the download skips already-downloaded EIDs (those with non-empty values in the abstracts dict).

15.4 User-Facing Error Messages

All error messages are displayed in English via st.error() and include actionable troubleshooting suggestions:

  • Connection failures suggest checking institutional network, API key, or InstToken.
  • Rate limit errors suggest waiting a few minutes before retrying.
  • Missing data warnings inform the user which analyses could not be performed.

16. Dependencies and Technology Stack

16.1 Core Framework

Component Technology Role
Web framework Streamlit 1.55 Reactive web UI with widgets, layout, session state
API client pybliometrics 4.4.1 Scopus API communication (ScopusSearch, AbstractRetrieval, AuthorRetrieval)
Data manipulation pandas 2.3 DataFrame operations, groupby, value counts, field parsing
Numerical computing NumPy 2.4 Array operations, NaN handling

16.2 Visualization

Component Technology Role
Interactive charts Plotly 6.6 Bar charts, histograms, line charts, choropleth maps, heatmaps
Static figures matplotlib 3.10 Word cloud rendering, Venn diagram rendering
Word clouds wordcloud 1.9 Word cloud generation from text and frequency dictionaries
Venn diagrams matplotlib-venn 1.1 2-set and 3-set proportional Venn diagrams
Image export Kaleido 1.2 Chromium-based headless rendering of Plotly figures to PNG

16.3 Text Analysis

Component Technology Role
TF-IDF scikit-learn 1.8 (TfidfVectorizer) Term frequency-inverse document frequency vectorization
NMF scikit-learn 1.8 (NMF) Non-negative matrix factorization for topic modeling
LDA scikit-learn 1.8 (LatentDirichletAllocation) Latent Dirichlet allocation for topic modeling
Count vectors scikit-learn 1.8 (CountVectorizer) Raw count vectorization for LDA topic modeling

16.4 Data Export

Component Technology Role
Excel writing openpyxl 3.1 Multi-sheet .xlsx file generation
CSV writing pandas (built-in) UTF-8 encoded CSV generation

16.5 Standard Library Usage

The application uses the following Python standard library modules: os, json, uuid, time, io, hashlib, datetime, collections.Counter, itertools.combinations.


17. Software Metadata

Field Value
Software name ScopusLit
Version 1.0.0
Programming language Python (>= 3.10)
Tested Python version 3.14
Operating systems macOS, Linux, Windows (any OS supporting Python and Streamlit)
Size of software Single file, approximately 3,005 lines of Python code
Dependencies 12 Python packages (see Section 16)
External API Scopus API via pybliometrics (requires API key)
Interface Web browser (served locally by Streamlit)
Parallelism Single-threaded (Streamlit execution model)
Data storage Local JSON files in ./scopuslit_data/
Repository [To be added]
License [To be determined]
Development institution Universidad Industrial de Santander (UIS), Bucaramanga, Colombia

18. Impact

ScopusLit has the potential to benefit the research community in several ways:

Lowering the barrier to bibliometric analysis. By integrating search, analysis, and visualization into a single browser-based tool with no programming requirement, ScopusLit makes quantitative literature analysis accessible to researchers who lack programming skills or familiarity with specialized bibliometric software.

Enabling reproducible bibliometric workflows. Each search is persisted as a self-contained JSON file that captures the query, execution date, full result set, and downloaded abstracts. This enables exact reproduction of analyses and facilitates sharing of bibliometric datasets between collaborators.

Supporting multilingual research communities. The inclusion of Spanish stopwords alongside English ones reflects the tool's origin at a Latin American institution and supports analysis of bibliographic corpora where abstracts may contain Spanish text, a common scenario in engineering and geosciences literature from Latin America and Spain.

Accelerating systematic review preparation. The search consolidation feature, with both union and comparison modes, directly supports the multi-query workflow typical of systematic reviews (PRISMA methodology), where researchers must execute multiple search strings across different conceptual facets and then analyze the combined and overlapping result sets.

Providing publication-ready outputs. Every visualization can be exported at 300 DPI, meeting the minimum resolution requirements of most scientific journals (typically 300 DPI for color figures). The multi-sheet Excel export provides immediately usable supplementary materials.


19. Limitations and Future Work

19.1 Current Limitations

  1. Scopus-only: The tool is designed exclusively for the Scopus database. Support for Web of Science, PubMed, OpenAlex, or other databases is not included.
  2. API quota constraints: Scopus API imposes rate limits (typically 6-9 requests per second) and weekly quotas (typically 5,000-20,000 requests depending on the API key type). Large searches (> 5,000 results) or extensive abstract downloads may exhaust quotas.
  3. No co-citation or bibliographic coupling analysis: The current version does not implement reference-based analyses (co-citation networks, bibliographic coupling) which require cited reference data not available from ScopusSearch.results.
  4. Single-user, local deployment: The application runs locally and does not support concurrent multi-user access or cloud deployment out of the box.
  5. Abstract-dependent text analysis: TF-IDF, topic modeling, and abstract word clouds require downloading full abstracts, which consumes one API call per document.
  6. Venn diagram limit: Document overlap visualization is limited to 2-3 searches due to limitations of the matplotlib-venn library. Larger comparisons use only the overlap statistics table.
  7. No BibTeX export: Direct export to BibTeX format for integration with reference managers (Zotero, Mendeley, EndNote) is not yet supported.
  8. Monolithic codebase: This final pre-refactor version remains a single app.py file for portability. The next development stage should split the application into dedicated modules for storage, API access, analysis, plotting, exports, and UI.

19.2 Planned Future Enhancements

  • Co-citation and bibliographic coupling analysis using AbstractRetrieval.references.
  • Integration with OpenAlex or Semantic Scholar for open-access metadata enrichment.
  • Cloud deployment template (e.g., Streamlit Community Cloud, Docker).
  • BibTeX and RIS export for reference managers.
  • Modular refactor of the monolithic application into maintainable packages.
  • Automated tests for core analysis functions and export functions.
  • Topic-model validation aids such as coherence scoring or perplexity diagnostics.
  • Author collaboration internationalization metrics.

20. How to Cite

If you use ScopusLit in your research, please cite it as:

Arroyo, O. (2026). ScopusLit: An end-to-end Web-based tool for bibliometric analysis. SoftwareX, 34, 102733.


21. License

[License to be determined]


ScopusLit is developed at Universidad Industrial de Santander (UIS), Bucaramanga, Colombia.

About

An Interactive Web-Based Tool for Bibliometric Analysis of Scopus Literature

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages