ScopusLit: An Interactive Bibliometric Analysis Tool for Scopus

ScopusLit is an open-source, single-file web application built with Python, Streamlit, and pybliometrics that enables researchers to conduct comprehensive bibliometric analyses of scientific literature indexed in the Scopus database. The tool provides an end-to-end workflow: from executing Scopus Advanced Search queries and persisting results locally, through generating over 20 types of bibliometric visualizations (including VOSViewer-style co-occurrence networks), to performing filter-aware natural language processing on abstracts and exporting publication-ready figures and datasets.

Status of this document: This README describes the final monolithic version of app.py before the planned modular refactor. The application is still intentionally delivered as a single Python file for portability, but its internal behavior now includes reviewer-facing methods notes, author-ID-aware analyses, filter-aware text analysis, safer exports, and richer topic-model interpretation outputs.

Motivation and Significance
Comparison with Related Software
Software Architecture
Features Overview
Installation
Configuration
Usage Guide
Illustrative Example
Functional Modules in Detail
Data Model and Persistence
Visualization Specifications
Text Analysis Pipeline
Search Consolidation
Export Capabilities
Error Handling and API Rate Limiting
Dependencies and Technology Stack
Software Metadata
Impact
Limitations and Future Work
How to Cite
License

1. Motivation and Significance

Bibliometric analysis is a cornerstone of systematic literature reviews, research trend identification, and science mapping. Researchers frequently rely on the Scopus database, one of the largest curated abstract and citation databases of peer-reviewed literature, encompassing over 27,000 titles from more than 7,000 publishers. However, conducting rigorous bibliometric studies typically requires combining multiple disconnected tools: Scopus's web interface for search, spreadsheet software for data cleaning, specialized bibliometric software (e.g., VOSviewer, Bibliometrix) for analysis, and graphic design tools for producing publication-quality figures.

ScopusLit addresses this fragmentation by providing a single, unified, browser-based application that handles the entire bibliometric workflow within one interface. The tool is designed for:

Researchers conducting systematic literature reviews who need rapid quantitative characterization of a body of literature.
Graduate students learning bibliometric methods who benefit from an interactive, visual environment.
Research groups that need to compare multiple search strategies or research topics side by side.
Authors preparing review articles who require publication-ready charts exported at 300 DPI.

Unlike desktop bibliometric tools that require manual data import/export steps, ScopusLit communicates directly with the Scopus API, enabling a seamless flow from query formulation to final analysis. Unlike purely programmatic approaches (e.g., writing custom Python scripts), ScopusLit provides an interactive graphical interface that requires no programming knowledge from the end user.

2. Comparison with Related Software

The following table compares ScopusLit with established bibliometric tools across key dimensions relevant to researchers:

Feature	ScopusLit	VOSviewer	Bibliometrix (R)	Publish or Perish	CiteSpace
Interface	Web browser (Streamlit)	Desktop GUI (Java)	RStudio / Shiny web app	Desktop GUI	Desktop GUI (Java)
Data source integration	Direct Scopus API	Manual file import	Manual file import (multiple formats)	Google Scholar, Scopus, WoS	Manual file import
Search execution	Built-in (Scopus Advanced Search)	External	External	Built-in (multiple sources)	External
Result count estimation	Yes (before downloading)	No	No	Yes	No
Abstract retrieval	Built-in (per-document API)	No	No	No	No
Data persistence	Automatic JSON storage	Manual save/load	R workspace	CSV export	Project files
Publications per year	Bar + cumulative line	Limited	Yes	Yes	Yes
Journal analysis	Top N horizontal bar	Limited	Yes	Yes	Limited
Country/affiliation analysis	Bar + choropleth map	Network map	Yes (world map)	No	Limited
Author analysis	Top N bar + h-index fetch	Network map	Yes	Yes	Yes
Co-authorship analysis	Frequency tables + network graph	Network visualization	Network visualization	No	Network visualization
Keyword analysis	Bar chart + word cloud	Network map + overlay	Yes + word cloud	No	Yes
Citation analysis	Histogram + top cited + trends	Citation network	Yes	Yes + citation metrics	Citation burst detection
TF-IDF on abstracts	Yes (unigrams + bigrams)	No	No	No	No
Topic modeling (NMF/LDA)	Yes (3-10 topics, heatmap, topic summary, document-topic export)	No	No	No	No
Search consolidation	Union + comparison modes	No	Multiple file merge	No	No
Venn diagram (overlap)	Yes (2-3 searches)	No	No	No	No
Publication-ready export	300 DPI PNG per chart	PNG/SVG	Multiple formats	No	PNG
Excel export	Multi-sheet (Results, Authors with IDs, Keywords, Summary)	CSV	Multiple formats	CSV	No
Co-citation / coupling	Not yet	Yes	Yes	No	Yes
Network visualization	Keyword, author, country co-occurrence (NetworkX + Plotly)	Primary strength	Yes (NetworkX)	No	Primary strength
Programming required	None	None	R programming	None	None
Cost	Free (requires Scopus API key)	Free	Free (R required)	Free	Free
Language	Python	Java	R	C#	Java

Key differentiators of ScopusLit:

Integrated API access: ScopusLit is the only tool that executes searches, downloads abstracts, and performs analysis within a single interface without manual file transfer.
Text analysis on abstracts: TF-IDF and topic modeling on full abstract corpora are not available in VOSviewer, CiteSpace, or Publish or Perish.
Search consolidation with Venn overlap: The ability to merge multiple searches (union mode) or compare them side-by-side (comparison mode with Venn diagrams) is unique to ScopusLit.
Zero-code browser interface: Unlike Bibliometrix (which requires R), ScopusLit runs entirely through a web browser with no programming required.

3. Software Architecture

ScopusLit is implemented as a single Python file (app.py, approximately 3,005 lines of code) organized into 10 clearly delineated sections. This monolithic architecture was a deliberate initial design choice to maximize portability, simplify deployment, and minimize configuration overhead. The file is structured using a functional programming paradigm with grouped utility functions rather than classes. This is the final single-file version before the codebase is split into dedicated modules.

3.1 Section Organization

Section	Name	Approx. Lines	Description
1	Imports and Configuration	65	All imports, `pybliometrics.init()`, constants, stopwords, Streamlit page config
2	Persistence Functions	175	JSON save/load for searches and consolidations
3	Scopus Search Functions	70	Query estimation, execution, DataFrame conversion
4	Abstract Download Functions	55	Retry wrapper, abstract downloading with incremental save
5	Bibliometric Analysis Functions	330	Pure data computations including author-ID-aware analysis and co-occurrence network analysis
6	Text Analysis Functions	150	TF-IDF, topic modeling, abstract word cloud generation, topic summary tables
7	Visualization Functions	390	Plotly and matplotlib chart creation including network graphs
8	Export Functions	110	CSV, multi-sheet Excel generation, filename sanitization
9	Streamlit Interface	1,000	All UI rendering, methods expanders, session state, page routing
10	Main Entry Point	25	`main()` function and `__main__` guard

3.2 Layered Design

The application follows a three-layer separation of concerns:

Data Layer (Sections 2-4): Handles all interactions with external systems. This includes Scopus API communication via pybliometrics (Section 3 for searches, Section 4 for abstract retrieval) and local file system operations for JSON persistence (Section 2). All API calls are wrapped in retry logic with exponential backoff to respect Scopus rate limits.
Analysis Layer (Sections 5-6): Contains pure computation functions that take pandas DataFrames or abstract dictionaries as input and return processed data structures (DataFrames, dictionaries, or lists). These functions have no side effects and no dependency on Streamlit, making them independently testable. Section 5 covers bibliometric computations (publication counts, author rankings, citation statistics), while Section 6 covers natural language processing (TF-IDF, topic modeling, word cloud generation).
Presentation Layer (Sections 7-9): Handles all visualization (Section 7 creates Plotly and matplotlib figures), data export (Section 8 generates CSV and Excel byte streams), and user interface rendering (Section 9 manages Streamlit components, session state, and page routing).

3.3 Function Inventory

The application comprises 98 functions distributed across sections:

Section	Functions	Key Examples
2 - Persistence	10	`save_search`, `load_all_searches`, `build_consolidation_dataframe`
3 - Search	3	`estimate_results`, `execute_search`, `results_to_dataframe`
4 - Abstracts	2	`_api_call_with_retry`, `download_abstracts`
5 - Analysis	24	`analyze_publications_per_year`, `analyze_top_authors`, `analyze_citations`, `build_cooccurrence_graph`, `analyze_document_types`, `analyze_bradford_zones`, `analyze_lotka_law`, `analyze_price_law`, `analyze_keyword_evolution`, `build_sankey_data`
6 - Text Analysis	8	`generate_abstract_wordcloud`, `compute_tfidf_terms`, `compute_topic_model`, `build_topic_summary_table`, `build_doc_topic_table`
7 - Visualization	22	`plot_publications_per_year`, `plot_horizontal_bar`, `plot_venn_diagram`, `plot_network_graph`, `plot_document_type_pie`, `plot_bradford_curve`, `plot_lotka_curve`, `plot_keyword_evolution`, `plot_sankey`
8 - Export	5	`sanitize_filename`, `export_results_excel`, `export_abstracts_csv`
9 - Interface	23	`render_sidebar`, `render_analysis_page`, `render_methods_expander`, `render_tab_networks`, `render_consolidation_page`, `render_filters_panel`, `render_tab_doc_types`, `render_tab_sankey`
10 - Main	1	`main`
Total	98

3.4 State Management

ScopusLit uses Streamlit's st.session_state mechanism to persist application state across user interactions (Streamlit re-executes the entire script on every widget interaction). The following state variables are managed:

State Variable	Type	Default	Purpose
`current_page`	`str`	`"New Search"`	Active navigation page
`loaded_search`	`dict` or `None`	`None`	Full search data currently loaded for analysis
`loaded_df`	`pd.DataFrame` or `None`	`None`	Cached DataFrame derived from loaded search results
`search_estimate`	`int` or `None`	`None`	Most recent result count estimate
`h_index_data`	`list[dict]` or `None`	`None`	Cached h-index data for top authors
`last_search_run`	`dict` or `None`	`None`	Most recently executed search (for display on New Search page)
`confirm_delete`	`str` or `None`	`None`	UUID of search pending deletion confirmation
`loaded_consolidation`	`dict` or `None`	`None`	Full consolidation data currently loaded
`loaded_consol_df`	`pd.DataFrame` or `None`	`None`	Cached DataFrame for loaded consolidation
`loaded_consol_abstracts`	`dict` or `None`	`None`	Merged abstracts for loaded consolidation
`topic_result`	`dict` or `None`	`None`	Cached topic modeling results keyed by abstracts, method, and number of topics

State lifecycle:

Loading a search populates loaded_search, loaded_df, and clears h_index_data.
Loading a consolidation populates loaded_consolidation, loaded_consol_df, and loaded_consol_abstracts.
Deleting a search or consolidation clears all associated state variables if the deleted item was currently loaded.
Topic modeling results persist in topic_result only while the current abstracts, method, and number of topics remain unchanged; changing filters or topic settings invalidates the cached result.

4. Features Overview

ScopusLit implements 26 features organized into five functional phases:

Phase 1: Search and Base Storage

#	Feature	Description
1	New Search	Execute Scopus Advanced Search queries with result count estimation before downloading
2	Download Abstracts	Retrieve full abstracts via AbstractRetrieval API with incremental saving every 25 documents
3	Saved Searches Manager	List, load, rename, and delete persisted searches with two-step deletion confirmation

Phase 2: Bibliometric Analysis (15 analysis types across 10 tabs + global filters)

A Global Filters Panel (collapsible expander) is displayed above all analysis tabs, offering: year range slider, document type multiselect, country multiselect, and minimum citation threshold. Filters are applied before all analyses.

#	Feature	Tab	Key Metrics / Charts
4	Publications per Year	Timeline	Bar chart + cumulative trend line (dual y-axis). Summary: total count, year range, peak year.
4b	Document Type Analysis	Document Types	Pie chart + bar chart of document type distribution (`subtypeDescription`: Article, Review, Conference Paper, etc.). Metrics: distinct types, dominant type percentage.
5	Publications per Journal + Bradford's Law	Sources	Top 15 journals horizontal bar chart. Bradford's Law of Scattering: journals divided into 3 productivity zones (core, middle, peripheral), log-scale scatterplot with zone demarcation lines.
6	Publications per Country/Affiliation	Geography	Top 15 countries + top 15 affiliations, horizontal bar charts. Choropleth world map if >= 5 countries.
7	Top Authors + Lotka's/Price's Law	Authors	Top 20 authors horizontal bar chart using Scopus author IDs when available. Lotka's Law: log-log scatter of author productivity distribution with power law fit (exponent + R²). Price's Law: checks whether √n elite authors produce ≥ 50% of publications.
8	Co-authorship Analysis	Co-authorship	Frequent co-author pairs table using author IDs when available + author count distribution bar chart.
9	h-index of Top Authors	Authors	On-demand "Fetch h-index" button for top 10 authors' h-indices via AuthorRetrieval API. Results cached in session.
10	Keyword Analysis + Evolution	Keywords	Two subtabs: Keyword Frequency (top 30 bar chart + word cloud) and Keyword Evolution (heatmap of top keywords over publication years with configurable N).
11	Citation Analysis	Citations	Summary metrics (total, mean, median, max). Citation distribution histogram. Top 20 most cited documents table. Average citations per year line chart.
11b	Co-occurrence Networks	Networks	VOSViewer-style network analysis with 3 subtabs: Keyword co-occurrence (author keywords that co-appear in publications), Author co-authorship (collaboration network), Country collaboration (international co-publishing). Duplicate values inside each publication are deduplicated before edge counting. Each subtab features: min co-occurrence slider (1-20), max nodes slider (10-100), layout algorithm selector (Spring/Kamada-Kawai/Circular), community detection with colored clusters, edge width tiers by weight, node size by occurrence count, and network metrics (nodes, edges, communities, density). Built with NetworkX + Plotly.
11c	Three-Field Plot (Sankey)	Three-Field Plot	Interactive Sankey diagram connecting any 3 user-selected fields (Authors, Keywords, Journals, Countries, Affiliations, Document Types) with configurable top-N per field (3-20). Built with Plotly `go.Sankey`.

Phase 3: Abstract Text Analysis (3 analyses in nested sub-tabs, visible only when abstracts are downloaded)

#	Feature	Sub-tab	Description
12	Abstract Word Cloud	Abstract Word Cloud	Word cloud from concatenated abstract texts, excluding English, Spanish, and academic filler stopwords.
13	TF-IDF Term Frequency	TF-IDF Terms	Top 30 terms by average TF-IDF score with unigram + bigram support. Horizontal bar chart.
14	Topic Modeling	Topic Modeling	NMF or LDA (user selects), configurable 3-10 topics via slider. Results include a topic summary table, topic-word weight heatmap, and document-topic assignment table with CSV exports.

Phase 4: Search Consolidation

#	Feature	Description
15	Create Consolidation	Select 2+ searches via checkboxes, choose union or comparison mode, name and save.
16	Union Mode Analysis	Deduplicate by EID, apply full Phase 2 + Phase 3 analysis pipeline to merged corpus. Includes consolidated CSV/Excel export.
17	Comparison Mode Analysis	Three comparison tabs: overlaid publication timeline, grouped keyword bar chart, Venn diagram with overlap statistics table (supports 2-3 searches).

Phase 5: Export

#	Feature	Formats	Contents
18	Export Results	CSV; Excel (4 sheets: Results, Authors, Keywords, Summary)	All search fields + aggregate statistics; author export includes Scopus IDs when available
19	Export Abstracts	CSV; Excel	EID, DOI, Title, Authors, Year, Journal, Abstract, Keywords

Additionally, every Plotly chart includes an "Export as PNG (300 DPI)" download button when static rendering is available, and every matplotlib figure (word clouds, Venn diagrams) includes an equivalent 300 DPI PNG download button.

5. Installation

5.1 Prerequisites

Python 3.10 or higher (tested with Python 3.14)
Scopus API access: Requires a valid Scopus API key. Institutional subscribers can also use an InstToken for off-campus access. API keys can be obtained from the Elsevier Developer Portal.

5.2 Setting Up the Environment

# Create and activate a virtual environment
python -m venv scopuslit_env
source scopuslit_env/bin/activate      # macOS / Linux
# scopuslit_env\Scripts\activate       # Windows

# Install all dependencies
pip install streamlit pybliometrics plotly pandas numpy wordcloud \
    matplotlib scikit-learn openpyxl kaleido matplotlib-venn networkx

5.3 Dependency List with Tested Versions

Package	Tested Version	Purpose
`streamlit`	1.55.0	Web application framework
`pybliometrics`	4.4.1	Scopus API client
`plotly`	6.6.0	Interactive charting library
`pandas`	2.3.3	Data manipulation and analysis
`numpy`	2.4.3	Numerical computing
`wordcloud`	1.9.6	Word cloud generation
`matplotlib`	3.10.8	Plotting backend for word clouds and Venn diagrams
`scikit-learn`	1.8.0	TF-IDF vectorization, NMF, LDA topic modeling
`openpyxl`	3.1.5	Excel file generation
`kaleido`	1.2.0	Plotly figure export to static PNG images
`matplotlib-venn`	1.1.2	Venn diagram visualization
`networkx`	3.6.1	Graph construction, layout algorithms, community detection for co-occurrence networks

5.4 Verify Installation

python -c "import streamlit, pybliometrics, plotly, pandas, numpy, \
    wordcloud, matplotlib, sklearn, openpyxl, kaleido, matplotlib_venn, \
    networkx; print('All dependencies installed successfully.')"

6. Configuration

6.1 pybliometrics Configuration

pybliometrics requires a configuration file at ~/.config/pybliometrics.cfg (Linux/macOS) or %USERPROFILE%\.config\pybliometrics.cfg (Windows). On first run, pybliometrics will prompt for configuration interactively. Alternatively, create the file manually:

[Authentication]
APIKey = YOUR_API_KEY_HERE
InstToken = YOUR_INST_TOKEN_HERE

[Directories]
AbstractRetrieval = ~/.cache/pybliometrics/AbstractRetrieval
AuthorRetrieval = ~/.cache/pybliometrics/AuthorRetrieval
ScopusSearch = ~/.cache/pybliometrics/ScopusSearch

The APIKey is mandatory. Obtain one from the Elsevier Developer Portal.
The InstToken is optional but required for off-campus access when your institution provides one.
Directories define local cache paths. pybliometrics caches API responses to minimize redundant requests.

6.2 Network Requirements

Scopus API access requires either:

A connection from an institutional network (campus VPN/IP range) recognized by Scopus, or
A valid InstToken for authentication from any network.

ScopusLit calls pybliometrics.init() at startup, which reads the configuration file and initializes the API client.

7. Usage Guide

7.1 Launching the Application

cd /path/to/ScopusLit
streamlit run app.py

The application opens in the default web browser at http://localhost:8501.

7.2 Navigation Structure

The application uses a four-page navigation structure accessible via radio buttons in the left sidebar:

Page	Purpose
New Search	Execute new Scopus searches and download abstracts
Saved Searches	Manage (load, rename, delete) previously saved searches
Analysis	View bibliometric and text analysis for a loaded search
Consolidation	Combine multiple searches and run comparative analysis

The sidebar also displays:

A quick-access list of all saved searches with result counts, dates, abstract indicators (abs), and individual "Load" buttons. Clicking "Load" immediately opens the Analysis page with that search.
A list of all saved consolidations (if any exist) with mode labels, search counts, and "Load" buttons. Clicking "Load" immediately opens the Consolidation page with that consolidation's analysis.

7.3 Workflow: Running a New Search

Navigate to the New Search page.
Enter a Scopus Advanced Search query in the text area (e.g., TITLE-ABS-KEY(seismic AND "machine learning")).
Optionally enter a descriptive name and comma-separated tags.
Click Estimate Results to preview the number of matching documents without downloading them. This uses ScopusSearch(query, download=False).get_results_size() and consumes minimal API quota.
Click Run Search to execute the full search. The application calls ScopusSearch(query, subscriber=True) and converts the resulting list of Document namedtuples into a serializable list of dictionaries.
Results are automatically saved to a JSON file in the ./scopuslit_data/ directory.
A summary is displayed: total documents, year range, and the first 10 titles.
Optionally click Download Abstracts to retrieve full abstract texts via AbstractRetrieval(eid) for each document. This is a separate step because it requires one API call per document and can be slow for large result sets. Progress is saved incrementally every 25 abstracts.
Click Load for Analysis to navigate to the Analysis page.

7.4 Workflow: Analyzing a Search

Load a search from the sidebar, the Saved Searches page, or after running a new search.
The Analysis page displays:
- Search metadata: name, query, date, result count.
- Export buttons at the top: Results CSV, Results Excel, Abstracts CSV, Abstracts Excel (latter two visible only if abstracts have been downloaded).
- 10 analysis tabs (11 if abstracts are available):
  - Timeline: Publications per year with summary metrics (total, range, peak year) and dual-axis bar+line chart.
  - Document Types: Pie and bar charts of Scopus document categories.
  - Sources: Top 15 journals as horizontal bar chart with data table and Bradford's Law.
  - Geography: Top 15 countries and affiliations as bar charts, plus choropleth world map.
  - Authors: Top 20 author-ID-aware authors as bar chart, with "Fetch h-index for Top 10 Authors" button, Lotka's Law, and Price's Law.
  - Co-authorship: Most frequent co-author pairs table, authors per article distribution chart.
  - Networks: Keyword, author, and country co-occurrence networks.
  - Keywords: Top 30 keywords bar chart, keyword word cloud, and keyword evolution heatmap.
  - Citations: Summary metrics (total, mean, median, max), citation histogram, top 20 table, average per year line chart.
  - Three-Field Plot: Sankey diagram linking selected bibliometric fields.
  - Text Analysis (if abstracts available): Three nested sub-tabs for abstract word cloud, TF-IDF terms, and topic modeling.
Each analysis tab includes a Methods and parameters expander describing field sources, assumptions, and algorithm settings.
Each Plotly chart has an Export as PNG (300 DPI) button directly below it when Kaleido export is available.
Data tables behind each chart are accessible via expandable sections.

7.5 Workflow: Managing Saved Searches

The Saved Searches page displays each search as a card with:

Name (editable inline with "Save Name" button), query preview, date, result count, tags, and abstract status.
Load for Analysis button to open the search in the Analysis page.
Download Abstracts button (if abstracts have not been downloaded yet).
Delete button with a two-step confirmation pattern: first click shows "Are you sure?" with "Yes, delete" and "Cancel" buttons.

7.6 Workflow: Consolidating Searches

Navigate to the Consolidation page.
Create a new consolidation: Select 2+ saved searches using checkboxes, choose a mode (union or comparison), enter a name, and click "Create Consolidation".
Or load an existing consolidation from the list shown at the bottom of the page (each with "Load" and "Delete" buttons), or from the sidebar.
For union mode, the full Phase 2 + Phase 3 analysis pipeline is displayed (same 10-11 tabs as individual search analysis, including filters), plus consolidated export buttons.
For comparison mode, three specialized tabs are displayed:
- Timeline Comparison: overlaid line charts.
- Keywords Comparison: grouped bar chart.
- Document Overlap: pairwise overlap statistics table + Venn diagram.

7.7 Workflow: Exporting Data

From the Analysis page, up to four export buttons are available:

Button	Format	Contents
Export Results (CSV)	`.csv`	All search result fields in a flat table
Export Results (Excel)	`.xlsx`	4 sheets: Results, Authors, Keywords, Summary
Export Abstracts (CSV)	`.csv`	EID, DOI, Title, Authors, Year, Journal, Abstract, Keywords
Export Abstracts (Excel)	`.xlsx`	Same columns as CSV, in Excel format

The Excel Results export includes a Summary sheet with aggregate statistics: search name, total results, year range, total/mean citations, unique journals, unique countries, and unique authors. The Authors sheet includes Scopus author IDs where available.

Additionally, every individual chart can be exported as a 300 DPI PNG via the download button below each figure when static rendering is available.

8. Illustrative Example

The following walkthrough demonstrates a typical bibliometric analysis session using ScopusLit.

8.1 Scenario

A researcher is preparing a review article on machine learning applications in seismology and needs to characterize the existing literature quantitatively.

8.2 Step-by-Step

Step 1: Define and run the search.

On the "New Search" page, the researcher enters:

Query: TITLE-ABS-KEY(seismic AND "machine learning")
Name: Seismic ML
Tags: seismology, ML, review

Clicking "Estimate Results" reveals approximately 2,400 documents. Satisfied with the scope, the researcher clicks "Run Search". After approximately 30 seconds, the search completes and is saved.

Step 2: Download abstracts.

The researcher clicks "Download Abstracts". With a progress bar indicating status, the application downloads full abstract texts for all 2,400 documents. This process takes approximately 20-40 minutes depending on API response times, with progress saved every 25 abstracts. If interrupted, the download can be resumed later without re-downloading completed abstracts.

Step 3: Analyze the results.

Clicking "Load for Analysis" opens the Analysis page with a global filters panel and 11 tabs:

The Timeline tab reveals a rapid growth trend starting around 2017, with peak publications in 2024.
The Sources tab shows Geophysics, Computers & Geosciences, and Geophysical Journal International as the top journals.
The Geography tab reveals the United States and China as the leading countries, with a choropleth map showing global distribution.
The Networks tab provides VOSViewer-style co-occurrence analysis. The keyword co-occurrence subtab reveals clusters of related terms: a "deep learning" cluster connected to "convolutional neural network" and "transfer learning", a "seismology" cluster linking "earthquake", "seismic hazard", and "ground motion". The author co-authorship network shows research groups and their collaboration patterns, while the country collaboration network highlights the US-China research axis with secondary European hubs.
The Keywords tab shows "deep learning", "convolutional neural network", and "earthquake" as the most frequent author keywords. The word cloud provides a visual summary.
The Text Analysis tab's TF-IDF analysis identifies discriminative terms like "seismic waveform", "phase picking", and "transfer learning" using only abstracts that remain after active filters. Topic modeling with NMF (5 topics) reveals distinct research themes: earthquake detection, signal processing, hazard assessment, subsurface imaging, and ground motion prediction. The topic module also provides a topic summary table and document-topic assignment export.

Step 4: Run a comparison.

To compare with a related field, the researcher creates a second search for TITLE-ABS-KEY(volcanic AND "machine learning"). Then, on the Consolidation page, both searches are selected in comparison mode. The overlaid timeline reveals that seismic ML research started earlier and grows faster. The Venn diagram shows 47 shared documents, indicating a meaningful but limited overlap.

Step 5: Export for the manuscript.

The researcher exports:

Individual 300 DPI PNG charts for the publications-per-year figure, the keyword word cloud, and the topic heatmap.
A multi-sheet Excel file with the complete results, author list with Scopus IDs where available, keyword frequencies, and summary statistics.
An abstracts CSV for use in further text mining outside the application.

9. Functional Modules in Detail

9.1 Persistence Module (Section 2) - 10 functions

Function	Signature	Description
`ensure_data_dir()`	`() -> None`	Creates `./scopuslit_data/` if it does not exist.
`save_search(search_data)`	`(dict) -> str`	Writes a complete search dict to `{uuid}.json`. Returns file path.
`load_search(search_id)`	`(str) -> dict or None`	Reads a search by its UUID. Returns `None` if file not found.
`load_all_searches()`	`() -> list[dict]`	Reads metadata (no full results list) from all non-consolidation JSON files, sorted by date descending.
`delete_search(search_id)`	`(str) -> bool`	Removes a JSON file by UUID.
`rename_search(search_id, new_name)`	`(str, str) -> bool`	Loads file, updates `name` field, re-saves.
`update_search_abstracts(search_id, abstracts)`	`(str, dict) -> bool`	Partial update: replaces only the `abstracts` field.
`save_consolidation(consolidation_data)`	`(dict) -> str`	Writes a consolidation dict to `{uuid}.json`.
`load_all_consolidations()`	`() -> list[dict]`	Reads metadata from all consolidation JSON files.
`build_consolidation_dataframe(consolidation)`	`(dict) -> tuple[DataFrame, dict]`	Loads all referenced searches, merges results into a single DataFrame (deduplicating by EID for union mode), merges abstracts. Returns `(df, abstracts_dict)`.

9.2 Scopus Search Module (Section 3) - 3 functions

Function	Signature	Description
`estimate_results(query)`	`(str) -> int or None`	Returns result count without downloading. Uses `ScopusSearch(query, download=False).get_results_size()`.
`execute_search(query)`	`(str) -> tuple[list[dict], int] or None`	Executes `ScopusSearch(query, subscriber=True)`, converts `Document` namedtuples to dicts via `._asdict()`, retries on `Scopus429Error`.
`results_to_dataframe(results)`	`(list[dict]) -> DataFrame`	Converts result dicts to DataFrame, adding `year` (int from `coverDate`), `citedby_count` (int), `author_count` (int).

9.3 Abstract Download Module (Section 4) - 2 functions

Function	Signature	Description
`_api_call_with_retry(callable_fn)`	`(callable) -> Any`	Retry wrapper. Catches `Scopus429Error`, waits with exponential backoff (2s, then 4s), retries up to 3 total attempts.
`download_abstracts(search_id, eid_list, ...)`	`(str, list[str], dict or None, progress_bar) -> dict`	Downloads abstracts via `AbstractRetrieval(eid)`, falls back to `.description`, saves every 25 docs, skips already-downloaded EIDs.

9.4 Bibliometric Analysis Module (Section 5) - 16 functions

Function	Input	Output	Description
`_parse_delimited_field(series, delimiter)`	`Series, str`	`Series`	Splits delimited strings, strips whitespace, filters empties, explodes. Used for `;`-separated (authors, countries) and `\|`-separated (keywords) fields.
`analyze_publications_per_year(df)`	`DataFrame`	`DataFrame (year, count, cumulative)`	Groups by year, counts, computes running cumulative sum.
`analyze_publications_per_journal(df, top_n)`	`DataFrame`	`DataFrame (journal, count)`	Value counts on `publicationName`, top N. Default 15.
`analyze_publications_per_country(df, top_n)`	`DataFrame`	`DataFrame (country, count)`	Parses `;`-separated `affiliation_country`, top N. Default 15.
`analyze_publications_per_affiliation(df, top_n)`	`DataFrame`	`DataFrame (affiliation, count)`	Parses `;`-separated `affilname`, top N. Default 15.
`_author_entries_from_row(row)`	`Series`	`list[(key, display_name, author_id)]`	Builds unique author identity keys for one record, preferring Scopus author IDs and falling back to normalized names.
`_author_identity_series(df)`	`DataFrame`	`Series`	Returns author identity keys for author productivity laws, preferring Scopus IDs.
`analyze_top_authors(df, top_n)`	`DataFrame`	`DataFrame (author, author_id, count)`	Counts authors by Scopus ID when available; falls back to normalized name identity. Default 20.
`analyze_coauthor_pairs(df, top_n)`	`DataFrame`	`DataFrame (author_1, author_2, author_id_1, author_id_2, count)`	Generates `itertools.combinations` of author identity keys per document, counts pair frequencies, and displays names plus IDs. Default 20.
`analyze_author_count_distribution(df)`	`DataFrame`	`DataFrame (author_count, num_articles)`	Distribution of the `author_count` field.
`fetch_h_indices(author_data)`	`list[(name, auid)]`	`list[dict]`	Calls `AuthorRetrieval(auid).h_index` per author with retry wrapper.
`analyze_keywords(df, top_n)`	`DataFrame`	`DataFrame (keyword, count)`	Parses `\|`-separated `authkeywords`, lowercases, counts. Default 30.
`analyze_citations(df)`	`DataFrame`	`dict`	Returns `summary` (total, mean, median, max), `distribution` (Series), `top_cited` (top 20 DataFrame), `avg_per_year` (DataFrame).
`build_cooccurrence_graph(df, field, delimiter, min_cooccurrence, max_nodes, lowercase, dedup_per_row)`	`DataFrame, str, str, int, int, bool, bool`	`dict or None`	Builds a NetworkX co-occurrence graph from a multi-value field. Counts item frequencies and pair co-occurrences via `itertools.combinations`, filters by minimum threshold, prunes to top N nodes, runs community detection (`greedy_modularity_communities`). Returns `{graph, node_sizes, communities, pairs_df}`.
`compute_network_layout(graph, algorithm, seed)`	`nx.Graph, str, int`	`dict`	Wraps NetworkX layout algorithms: Spring (Fruchterman-Reingold with adaptive `k`), Kamada-Kawai, or Circular. Returns node→(x,y) mapping.
`compute_network_metrics(graph, communities)`	`nx.Graph, list`	`dict`	Computes summary metrics: nodes, edges, communities, density, average degree.

9.5 Text Analysis Module (Section 6) - 8 functions

Function	Input	Output	Description
`_get_abstract_texts(abstracts)`	`dict`	`list[str]`	Filters and returns non-empty abstract strings.
`_filter_abstracts_for_df(df, abstracts)`	`DataFrame, dict`	`dict`	Subsets abstracts to EIDs present in the currently filtered DataFrame.
`_abstracts_signature(abstracts, method, n_topics)`	`dict, str, int`	`str`	Builds a stable cache key so topic results do not leak across filters, datasets, or settings.
`generate_abstract_wordcloud(abstracts)`	`dict`	`Figure or None`	Concatenates abstracts, generates word cloud using `WordCloud.generate()` with combined stopword list (293 stopwords total).
`compute_tfidf_terms(abstracts, top_n)`	`dict, int`	`DataFrame or None`	TF-IDF with `max_features=1000`, `max_df=0.85`, `min_df=2`, `ngram_range=(1,2)`. Returns top N terms by average score. Requires >= 3 abstracts.
`compute_topic_model(abstracts, n_topics, method)`	`dict, int, str`	`dict or None`	Uses TF-IDF vectors for NMF and count vectors for LDA. Returns `{topics, doc_topic_matrix, doc_ids}`. Requires >= max(5, n_topics) abstracts.
`build_topic_summary_table(topics)`	`list[dict]`	`DataFrame`	Converts topic terms and weights into a display/export table.
`build_doc_topic_table(topic_result)`	`dict`	`DataFrame`	Builds document-level dominant topic assignments and per-topic weights for display/export.

9.6 Visualization Module (Section 7) - 17 functions

Styling and utilities (3):

Function	Description
`style_plotly_fig(fig)`	Applies consistent theme: title 20pt, font 16pt, axes 18pt, ticks 16pt, legend 14pt, height 500px, `plotly_white`.
`plotly_png_download(fig, filename)`	Renders to PNG at 2100x1500 pixels, scale 2x (~300 DPI at 7x5 inches) via kaleido.
`display_chart_with_download(fig, key, filename)`	Composite: `st.plotly_chart(width="stretch")` + optional `st.download_button()` with PNG bytes. If Kaleido export fails, the chart still renders and the app shows a warning.

Chart functions (14):

Function	Chart Type	Notes
`plot_publications_per_year(data)`	Dual-axis bar + line	`make_subplots(secondary_y=True)`
`plot_horizontal_bar(data, x_col, y_col, ...)`	Horizontal bar	Reusable for 6+ charts (journals, countries, affiliations, authors, keywords). Uses dynamic height and explicit category ticks so all labels render.
`plot_choropleth_map(data)`	Choropleth world map	Viridis scale; returns `None` if < 5 countries
`plot_author_count_distribution(data)`	Vertical bar	Authors per article
`plot_h_index_bar(data)`	Horizontal bar	h-index for top authors
`generate_keyword_wordcloud(data)`	Word cloud (matplotlib)	From frequency dict via `generate_from_frequencies()`
`plot_citation_histogram(values)`	Histogram	30 bins
`plot_avg_citations_per_year(data)`	Line + markers	Average citations per year
`plot_tfidf_bar(data)`	Horizontal bar	TF-IDF terms ranked by score
`plot_topic_heatmap(topics)`	Heatmap	Topic-word weights; dynamic height `max(400, 80*n_topics)`
`plot_comparison_timeline(search_data_list)`	Overlaid lines	One trace per search
`plot_comparison_keywords(search_data_list, top_n)`	Grouped bar	Top 15 keywords globally, bars per search
`plot_venn_diagram(eid_sets)`	Venn (matplotlib)	2-set (`venn2`) or 3-set (`venn3`)
`plot_network_graph(graph_data, positions, title)`	Network graph (Plotly)	Co-occurrence network with community-colored nodes, 3-tier edge widths, hover info (occurrence count, degree, top neighbors). Height 700px, hidden axes.

9.7 Export Module (Section 8) - 5 functions

Function	Signature	Format	Description
`sanitize_filename(value, fallback)`	`(str, str) -> str`	N/A	Creates safe filename stems from search names before download buttons are rendered.
`export_results_csv(df)`	`(DataFrame) -> bytes`	CSV	All result fields, UTF-8 encoded.
`export_results_excel(df, search_name)`	`(DataFrame, str) -> bytes`	Excel	4 sheets: Results (all fields), Authors (name + ID + count), Keywords (keyword + frequency), Summary (8 aggregate metrics).
`export_abstracts_csv(df, abstracts)`	`(DataFrame, dict) -> bytes`	CSV	8 columns: EID, DOI, Title, Authors, Year, Journal, Abstract, Keywords. Falls back to `description` field if abstract not available.
`export_abstracts_excel(df, abstracts)`	`(DataFrame, dict) -> bytes`	Excel	Same 8 columns on an "Abstracts" sheet.

9.8 Interface Module (Section 9) - 23 functions

Function	Purpose
`init_session_state()`	Initializes 11 session state variables with default values.
`_load_search_into_state(search_id)`	Loads a search from disk, populates `loaded_search` and `loaded_df`, clears `h_index_data`, switches to Analysis page.
`render_sidebar()`	Renders sidebar: app title, navigation radio (4 pages), saved searches quick-list with Load buttons, consolidations list with Load buttons.
`render_new_search_page()`	Query input, estimate/run buttons, result summary with first 10 titles, "Load for Analysis" and "Download Abstracts" buttons.
`render_saved_searches_page()`	Search cards with inline rename, Load, Download Abstracts, and Delete (with two-step confirmation).
`render_analysis_page()`	Header, export buttons (4), global filters panel, 10-11 analysis tabs dispatching to individual tab renderers.
`render_filters_panel(df, key_prefix)`	Collapsible filters (year range, doc type, country, min citations). Returns filtered DataFrame.
`render_methods_expander(content)`	Reusable expander for reviewer-facing methods notes in analysis tabs.
`render_tab_timeline(df)`	Summary metrics + dual-axis chart + data table.
`render_tab_doc_types(df)`	Document type metrics + pie chart + bar chart + data table.
`render_tab_sources(df)`	Top journals chart + data table + Bradford's Law (zone metrics, log-scale scatterplot, zone table).
`render_tab_geography(df)`	Countries section (bar + choropleth) + affiliations section (bar).
`render_tab_authors(df)`	Top authors chart + h-index fetch section + Lotka's Law (log-log scatter + fit) + Price's Law (metrics).
`render_tab_coauthorship(df)`	Co-author pairs table + author count distribution chart.
`render_tab_keywords(df)`	Two subtabs: Keyword Frequency (bar + word cloud + data table) and Keyword Evolution (heatmap with configurable N).
`render_tab_citations(df)`	Summary metrics + histogram + top 20 table + avg per year chart.
`render_tab_sankey(df)`	Three-field Sankey diagram with field selectors and top-N sliders.
`render_tab_text_analysis(df, abstracts)`	Three nested sub-tabs: abstract word cloud, TF-IDF, topic modeling. Analyses only abstracts matching the currently filtered DataFrame.
`render_consolidation_page()`	Consolidation creation form + existing consolidation management + dispatches to union or comparison renderer.
`render_tab_networks(df)`	Networks tab with 3 subtabs: keyword co-occurrence, author co-authorship, country collaboration.
`_render_network_subtab(df, field, delimiter, label, default_min, lowercase, dedup, key_prefix)`	Reusable renderer for one network subtab: controls, metrics, chart, data table.
`render_union_analysis(df, abstracts)`	Export buttons + filters + full 10-11 tab analysis (reuses individual tab renderers).
`render_comparison_analysis(consol, df)`	Three comparison tabs: timeline, keywords, overlap (table + Venn diagram).

10. Data Model and Persistence

10.1 Search Data Structure

Each search is stored as a JSON file with the following schema:

{
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "name": "Seismic ML Review",
    "query": "TITLE-ABS-KEY(seismic AND \"machine learning\")",
    "search_date": "2026-03-08T10:30:00.123456",
    "total_results": 245,
    "results": [
        {
            "eid": "2-s2.0-85012345678",
            "doi": "10.1016/j.example.2024.01.001",
            "title": "Machine learning for seismic data analysis",
            "coverDate": "2024-01-15",
            "publicationName": "Computers & Geosciences",
            "author_names": "Smith J.;Doe A.;Brown K.",
            "author_ids": "12345678;23456789;34567890",
            "author_count": "3",
            "affiliation_country": "United States;Germany",
            "affilname": "Massachusetts Institute of Technology;Technical University of Munich",
            "citedby_count": "15",
            "authkeywords": "machine learning|seismic|deep learning",
            "description": "This study proposes..."
        }
    ],
    "abstracts": {
        "2-s2.0-85012345678": "Full abstract text retrieved via AbstractRetrieval..."
    },
    "tags": ["ML", "seismic"]
}

Field origins: The results list contains dictionaries derived from pybliometrics.scopus.ScopusSearch.results, which returns Document namedtuples with 36 fields. Multi-value fields use ; as separator (authors, affiliations, countries) or | (keywords).

Abstracts: Stored as a dictionary mapping EID to abstract text. The description field in results contains a summary from the search API; the abstracts dictionary contains full texts from the AbstractRetrieval API.

10.2 Consolidation Data Structure

{
    "id": "660e8400-e29b-41d4-a716-446655440001",
    "type": "consolidation",
    "name": "ML + Seismic Combined Analysis",
    "mode": "union",
    "search_ids": [
        "550e8400-e29b-41d4-a716-446655440000",
        "550e8400-e29b-41d4-a716-446655440002"
    ],
    "created_date": "2026-03-09T14:00:00.000000"
}

Consolidations reference search IDs rather than duplicating data. The build_consolidation_dataframe() function loads the referenced searches at runtime, merges their results, and handles deduplication for union mode.

10.3 Storage Location

All files are stored in ./scopuslit_data/ relative to app.py. This directory is created automatically on first run. File naming convention: {uuid4}.json.

11. Visualization Specifications

All Plotly charts follow a consistent visual specification enforced by the style_plotly_fig() helper function:

Property	Value	Rationale
Template	`plotly_white`	Clean, minimal background suitable for publication
Title font size	20pt	Readable in Streamlit's wide layout
Base font size	16pt	Compensates for Streamlit's default small rendering
Axis title font size	18pt	Clearly distinguishable from tick labels
Tick label font size	16pt	Legible for dense axis labels
Legend font size	14pt	Compact but readable
Default chart height	500px	Consistent vertical proportion; horizontal bars and heatmaps use dynamic height for dense labels
Color palette	`px.colors.qualitative.Set2`	Color-blind friendly, visually distinct

PNG Export Specification

Every chart can be exported as a high-resolution PNG:

Property	Value (Plotly)	Value (matplotlib)
Width	2100 pixels	Auto (tight bbox)
Height	1500 pixels	Auto (tight bbox)
Scale factor	2x	N/A
DPI	~300 (effective)	300 (explicit)
Format	PNG (lossless)	PNG (lossless)
Engine	Kaleido (chromium-based)	matplotlib Agg backend

If Kaleido or its browser backend is unavailable, Plotly charts still render interactively in the Streamlit page and the app displays a warning instead of failing the analysis tab.

Dense Label Handling

Horizontal bar charts and keyword-evolution heatmaps use dynamic heights and explicit categorical tick arrays so all labels are rendered. This is particularly important for top-author, top-keyword, and TF-IDF charts where default Plotly tick skipping can otherwise hide alternating labels.

Network Graph Specification

Co-occurrence network graphs use Plotly scatter traces to render NetworkX graphs:

Property	Value	Rationale
Chart height	700px	Extra height for network readability
Axes	Hidden (no grid, ticks, or zero line)	Network layout is spatial, not quantitative
Edge rendering	`go.Scatter(mode="lines")` with `None` separators	Efficient single-trace approach per weight tier
Edge width tiers	3 tiers (0.8, 2.0, 3.5 px) by weight percentiles	Distinguishes weak/medium/strong co-occurrences
Edge color	`rgba(180,180,180,0.5)`	Subtle, non-distracting
Node rendering	One `go.Scatter(mode="markers+text")` per community	Community coloring via 15-color palette
Node size range	10-50 px (normalized by occurrence count)	Proportional to item frequency
Node labels	Top center, 10pt, truncated at 25 chars	Readable without excessive overlap
Community detection	`greedy_modularity_communities`	Fast, weight-aware, handles disconnected components
Layout algorithms	Spring (default, adaptive k), Kamada-Kawai, Circular	Spring best for clustering; KK for cleaner small graphs
Hover info	Name, occurrences, degree, top 3 neighbors	Detailed exploration without clutter

12. Text Analysis Pipeline

The text analysis module (Section 6) implements a three-stage NLP pipeline for abstract corpus analysis. In the interface, this pipeline is filter-aware: after the global filters are applied, the Text Analysis tab subsets the abstract dictionary to EIDs present in the filtered DataFrame. This means the abstract word cloud, TF-IDF terms, and topic model reflect the same filtered corpus used by the bibliometric tabs.

12.1 Stopword Configuration

The combined stopword list contains 293 unique terms from three sources:

Source	Count	Examples
English stopwords (`wordcloud.STOPWORDS`)	192	the, is, at, which, from, have
Spanish stopwords (custom set)	69	de, la, que, el, en, para, con, como
Academic filler words (custom set)	38	study, results, method, proposed, analysis, approach, data, model

The Spanish stopwords support analysis of multilingual abstract corpora common in Latin American and European research. The academic filler words remove high-frequency domain-agnostic terms that carry low discriminative value in bibliometric contexts.

12.2 TF-IDF Vectorization

The compute_tfidf_terms() function uses sklearn.feature_extraction.text.TfidfVectorizer:

Parameter	Value	Rationale
`max_features`	1,000	Vocabulary limit for computational efficiency
`stop_words`	`"english"`	scikit-learn's built-in English stopword list
`max_df`	0.85	Ignore terms appearing in > 85% of documents
`min_df`	2	Ignore terms appearing in fewer than 2 documents
`ngram_range`	(1, 2)	Include both unigrams and bigrams (e.g., "neural network")

The function computes the mean TF-IDF score across all documents for each term, then returns the top N terms ranked by this average score. Requires a minimum of 3 non-empty abstracts.

12.3 Topic Modeling

The compute_topic_model() function supports two decomposition algorithms. Users select the method and number of topics in the interface. The number of topics is not optimized automatically; it is an exploratory parameter selected from 3 to 10, where lower values produce broader themes and higher values produce finer-grained themes.

Non-negative Matrix Factorization (NMF):

Vectorizer: TfidfVectorizer(max_features=2000, stop_words="english", max_df=0.85, min_df=2)
Parameters: n_components (user-configurable, 3-10), random_state=42, max_iter=300
Produces additive, parts-based decomposition
Generally produces more interpretable topics for scientific text

Latent Dirichlet Allocation (LDA):

Vectorizer: CountVectorizer(max_features=2000, stop_words="english", max_df=0.85, min_df=2)
Parameters: n_components (user-configurable, 3-10), random_state=42, max_iter=20
Probabilistic generative model

Both algorithms return: (1) a list of topics, each with its top 10 words and their weights, (2) a document-topic assignment matrix, and (3) the EIDs corresponding to that matrix. Requires a minimum of max(5, n_topics) non-empty abstracts.

12.4 Topic Output Interpretation

The Topic Modeling sub-tab displays:

Output	Interpretation
Topic Summary table	One row per topic, listing the top weighted terms and their weights. Users interpret each topic by assigning a semantic label based on these terms.
Topic-word heatmap	Visualizes relative term weights across topics. Darker cells indicate stronger term-topic association.
Document Topic Assignments table	One row per abstract EID with dominant topic, dominant weight, and all per-topic weights. This can be exported as CSV for external validation or downstream analysis.

Topic modeling is intended as exploratory thematic summarization, not definitive article classification. Results should be interpreted alongside the search strategy, active filters, and domain knowledge.

13. Search Consolidation

The consolidation feature (Phase 4) enables researchers to combine multiple Scopus searches for integrated analysis.

13.1 Union Mode

In union mode, the application:

Loads all results from each selected search.
Concatenates them into a single DataFrame with a _source_search column.
Deduplicates by EID (Scopus unique identifier), keeping the first occurrence.
Merges abstract dictionaries from all searches (preferring non-empty values).
Applies the complete Phase 2 and Phase 3 analysis pipeline to the merged corpus.
Provides dedicated "Export Consolidated Results" CSV and Excel buttons.

This mode is appropriate when the researcher wants to treat multiple searches as a single body of literature.

13.2 Comparison Mode

In comparison mode, the application:

Loads results from each search separately, preserving source labels.
Generates three comparative visualization tabs:
- Timeline comparison: Overlaid line charts showing publications per year for each search, with distinct colors per search.
- Keyword comparison: Grouped bar chart showing the globally top 15 keywords, with per-search frequency bars side by side.
- Document overlap: A pairwise statistics table (showing shared, only-in-A, only-in-B counts for each pair) and a Venn diagram for 2-3 searches using matplotlib_venn. For more than 3 searches, only the statistics table is shown.

This mode is appropriate when the researcher wants to compare research topics, methodologies, or search strategies.

14. Export Capabilities

14.1 Results Export

CSV format: Flat file containing all fields from the ScopusSearch results. Encoded as UTF-8.

Excel format (.xlsx): Multi-sheet workbook:

Sheet	Contents
Results	All search result fields (excluding internal columns prefixed with `_`)
Authors	Author name, Scopus author ID when available, and publication count, sorted by frequency descending
Keywords	Keyword and frequency count, lowercased, sorted by frequency descending
Summary	8 aggregate metrics: search name, total results, year range, total citations, mean citations, unique journals, unique countries, unique authors

14.2 Abstracts Export

Both CSV and Excel formats contain the same 8 columns:

Column	Source
EID	Scopus unique identifier from search results
DOI	Digital Object Identifier
Title	Document title
Authors	Author names (`;`-separated)
Year	Publication year (integer)
Journal	Publication name
Abstract	Full text from `AbstractRetrieval`, falling back to `description` field from search
Keywords	Author keywords (`

14.3 Chart Export

Every Plotly chart: PNG at 2100x1500 pixels, scale 2x (~300 DPI). Every matplotlib figure: PNG at 300 DPI with tight bounding box. Download buttons appear directly below each chart.

Search names are sanitized before being used in filenames. Excel export failures and Plotly PNG export failures are caught in the interface and shown as warnings, so the rest of the analysis page remains usable.

14.4 Topic Model Exports

When a topic model has been run, two additional CSV exports are available inside the Topic Modeling sub-tab:

Export	Contents
Topic Summary	Topic label, top terms, and term weights
Document Topic Assignments	EID, dominant topic, dominant topic weight, and all topic weights

15. Error Handling and API Rate Limiting

15.1 API Error Handling

All Scopus API calls are wrapped in try/except blocks that catch:

Exception	Handling
`Scopus429Error`	Rate limit exceeded. Retry with exponential backoff up to 3 total attempts. Display warning during retries, error after exhaustion.
`ScopusHtmlError`	General API error. Display error with troubleshooting hints (check network, API key, InstToken).
`Exception`	Catch-all. Display error with exception details.

Note: Exception classes are imported from pybliometrics.exception (not pybliometrics.scopus.exception).

15.2 Retry Mechanism

The _api_call_with_retry(callable_fn) function implements exponential backoff:

Attempt 1: Execute immediately
Attempt 2: Wait 2 seconds, then retry
Attempt 3: Wait 4 seconds, then retry
After 3 failures: Raise the original exception

Wait times are computed as BACKOFF_BASE_SECONDS * (2 ** attempt) where BACKOFF_BASE_SECONDS = 2 and attempt starts at 0. This yields waits of 2s and 4s before the second and third attempts respectively.

15.3 Incremental Abstract Saving

The abstract download function saves progress to disk every 25 abstracts. If the process is interrupted, previously downloaded abstracts are preserved. Re-running the download skips already-downloaded EIDs (those with non-empty values in the abstracts dict).

15.4 User-Facing Error Messages

All error messages are displayed in English via st.error() and include actionable troubleshooting suggestions:

Connection failures suggest checking institutional network, API key, or InstToken.
Rate limit errors suggest waiting a few minutes before retrying.
Missing data warnings inform the user which analyses could not be performed.

16. Dependencies and Technology Stack

16.1 Core Framework

Component	Technology	Role
Web framework	Streamlit 1.55	Reactive web UI with widgets, layout, session state
API client	pybliometrics 4.4.1	Scopus API communication (ScopusSearch, AbstractRetrieval, AuthorRetrieval)
Data manipulation	pandas 2.3	DataFrame operations, groupby, value counts, field parsing
Numerical computing	NumPy 2.4	Array operations, NaN handling

16.2 Visualization

Component	Technology	Role
Interactive charts	Plotly 6.6	Bar charts, histograms, line charts, choropleth maps, heatmaps
Static figures	matplotlib 3.10	Word cloud rendering, Venn diagram rendering
Word clouds	wordcloud 1.9	Word cloud generation from text and frequency dictionaries
Venn diagrams	matplotlib-venn 1.1	2-set and 3-set proportional Venn diagrams
Image export	Kaleido 1.2	Chromium-based headless rendering of Plotly figures to PNG

16.3 Text Analysis

Component	Technology	Role
TF-IDF	scikit-learn 1.8 (`TfidfVectorizer`)	Term frequency-inverse document frequency vectorization
NMF	scikit-learn 1.8 (`NMF`)	Non-negative matrix factorization for topic modeling
LDA	scikit-learn 1.8 (`LatentDirichletAllocation`)	Latent Dirichlet allocation for topic modeling
Count vectors	scikit-learn 1.8 (`CountVectorizer`)	Raw count vectorization for LDA topic modeling

16.4 Data Export

Component	Technology	Role
Excel writing	openpyxl 3.1	Multi-sheet `.xlsx` file generation
CSV writing	pandas (built-in)	UTF-8 encoded CSV generation

16.5 Standard Library Usage

The application uses the following Python standard library modules: os, json, uuid, time, io, hashlib, datetime, collections.Counter, itertools.combinations.

17. Software Metadata

Field	Value
Software name	ScopusLit
Version	1.0.0
Programming language	Python (>= 3.10)
Tested Python version	3.14
Operating systems	macOS, Linux, Windows (any OS supporting Python and Streamlit)
Size of software	Single file, approximately 3,005 lines of Python code
Dependencies	12 Python packages (see Section 16)
External API	Scopus API via pybliometrics (requires API key)
Interface	Web browser (served locally by Streamlit)
Parallelism	Single-threaded (Streamlit execution model)
Data storage	Local JSON files in `./scopuslit_data/`
Repository	[To be added]
License	[To be determined]
Development institution	Universidad Industrial de Santander (UIS), Bucaramanga, Colombia

18. Impact

ScopusLit has the potential to benefit the research community in several ways:

Lowering the barrier to bibliometric analysis. By integrating search, analysis, and visualization into a single browser-based tool with no programming requirement, ScopusLit makes quantitative literature analysis accessible to researchers who lack programming skills or familiarity with specialized bibliometric software.

Enabling reproducible bibliometric workflows. Each search is persisted as a self-contained JSON file that captures the query, execution date, full result set, and downloaded abstracts. This enables exact reproduction of analyses and facilitates sharing of bibliometric datasets between collaborators.

Supporting multilingual research communities. The inclusion of Spanish stopwords alongside English ones reflects the tool's origin at a Latin American institution and supports analysis of bibliographic corpora where abstracts may contain Spanish text, a common scenario in engineering and geosciences literature from Latin America and Spain.

Accelerating systematic review preparation. The search consolidation feature, with both union and comparison modes, directly supports the multi-query workflow typical of systematic reviews (PRISMA methodology), where researchers must execute multiple search strings across different conceptual facets and then analyze the combined and overlapping result sets.

Providing publication-ready outputs. Every visualization can be exported at 300 DPI, meeting the minimum resolution requirements of most scientific journals (typically 300 DPI for color figures). The multi-sheet Excel export provides immediately usable supplementary materials.

19. Limitations and Future Work

19.1 Current Limitations

Scopus-only: The tool is designed exclusively for the Scopus database. Support for Web of Science, PubMed, OpenAlex, or other databases is not included.
API quota constraints: Scopus API imposes rate limits (typically 6-9 requests per second) and weekly quotas (typically 5,000-20,000 requests depending on the API key type). Large searches (> 5,000 results) or extensive abstract downloads may exhaust quotas.
No co-citation or bibliographic coupling analysis: The current version does not implement reference-based analyses (co-citation networks, bibliographic coupling) which require cited reference data not available from ScopusSearch.results.
Single-user, local deployment: The application runs locally and does not support concurrent multi-user access or cloud deployment out of the box.
Abstract-dependent text analysis: TF-IDF, topic modeling, and abstract word clouds require downloading full abstracts, which consumes one API call per document.
Venn diagram limit: Document overlap visualization is limited to 2-3 searches due to limitations of the matplotlib-venn library. Larger comparisons use only the overlap statistics table.
No BibTeX export: Direct export to BibTeX format for integration with reference managers (Zotero, Mendeley, EndNote) is not yet supported.
Monolithic codebase: This final pre-refactor version remains a single app.py file for portability. The next development stage should split the application into dedicated modules for storage, API access, analysis, plotting, exports, and UI.

19.2 Planned Future Enhancements

Co-citation and bibliographic coupling analysis using AbstractRetrieval.references.
Integration with OpenAlex or Semantic Scholar for open-access metadata enrichment.
Cloud deployment template (e.g., Streamlit Community Cloud, Docker).
BibTeX and RIS export for reference managers.
Modular refactor of the monolithic application into maintainable packages.
Automated tests for core analysis functions and export functions.
Topic-model validation aids such as coherence scoring or perplexity diagnostics.
Author collaboration internationalization metrics.

20. How to Cite

If you use ScopusLit in your research, please cite it as:

Arroyo, O. (2026). ScopusLit: An end-to-end Web-based tool for bibliometric analysis. SoftwareX, 34, 102733.

21. License

[License to be determined]

ScopusLit is developed at Universidad Industrial de Santander (UIS), Bucaramanga, Colombia.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
app.py		app.py

Folders and files

Latest commit

History

Repository files navigation