ScopusLit is an open-source, single-file web application built with Python, Streamlit, and pybliometrics that enables researchers to conduct comprehensive bibliometric analyses of scientific literature indexed in the Scopus database. The tool provides an end-to-end workflow: from executing Scopus Advanced Search queries and persisting results locally, through generating over 20 types of bibliometric visualizations (including VOSViewer-style co-occurrence networks), to performing filter-aware natural language processing on abstracts and exporting publication-ready figures and datasets.
Status of this document: This README describes the final monolithic version of
app.pybefore the planned modular refactor. The application is still intentionally delivered as a single Python file for portability, but its internal behavior now includes reviewer-facing methods notes, author-ID-aware analyses, filter-aware text analysis, safer exports, and richer topic-model interpretation outputs.
- Motivation and Significance
- Comparison with Related Software
- Software Architecture
- Features Overview
- Installation
- Configuration
- Usage Guide
- Illustrative Example
- Functional Modules in Detail
- Data Model and Persistence
- Visualization Specifications
- Text Analysis Pipeline
- Search Consolidation
- Export Capabilities
- Error Handling and API Rate Limiting
- Dependencies and Technology Stack
- Software Metadata
- Impact
- Limitations and Future Work
- How to Cite
- License
Bibliometric analysis is a cornerstone of systematic literature reviews, research trend identification, and science mapping. Researchers frequently rely on the Scopus database, one of the largest curated abstract and citation databases of peer-reviewed literature, encompassing over 27,000 titles from more than 7,000 publishers. However, conducting rigorous bibliometric studies typically requires combining multiple disconnected tools: Scopus's web interface for search, spreadsheet software for data cleaning, specialized bibliometric software (e.g., VOSviewer, Bibliometrix) for analysis, and graphic design tools for producing publication-quality figures.
ScopusLit addresses this fragmentation by providing a single, unified, browser-based application that handles the entire bibliometric workflow within one interface. The tool is designed for:
- Researchers conducting systematic literature reviews who need rapid quantitative characterization of a body of literature.
- Graduate students learning bibliometric methods who benefit from an interactive, visual environment.
- Research groups that need to compare multiple search strategies or research topics side by side.
- Authors preparing review articles who require publication-ready charts exported at 300 DPI.
Unlike desktop bibliometric tools that require manual data import/export steps, ScopusLit communicates directly with the Scopus API, enabling a seamless flow from query formulation to final analysis. Unlike purely programmatic approaches (e.g., writing custom Python scripts), ScopusLit provides an interactive graphical interface that requires no programming knowledge from the end user.
The following table compares ScopusLit with established bibliometric tools across key dimensions relevant to researchers:
| Feature | ScopusLit | VOSviewer | Bibliometrix (R) | Publish or Perish | CiteSpace |
|---|---|---|---|---|---|
| Interface | Web browser (Streamlit) | Desktop GUI (Java) | RStudio / Shiny web app | Desktop GUI | Desktop GUI (Java) |
| Data source integration | Direct Scopus API | Manual file import | Manual file import (multiple formats) | Google Scholar, Scopus, WoS | Manual file import |
| Search execution | Built-in (Scopus Advanced Search) | External | External | Built-in (multiple sources) | External |
| Result count estimation | Yes (before downloading) | No | No | Yes | No |
| Abstract retrieval | Built-in (per-document API) | No | No | No | No |
| Data persistence | Automatic JSON storage | Manual save/load | R workspace | CSV export | Project files |
| Publications per year | Bar + cumulative line | Limited | Yes | Yes | Yes |
| Journal analysis | Top N horizontal bar | Limited | Yes | Yes | Limited |
| Country/affiliation analysis | Bar + choropleth map | Network map | Yes (world map) | No | Limited |
| Author analysis | Top N bar + h-index fetch | Network map | Yes | Yes | Yes |
| Co-authorship analysis | Frequency tables + network graph | Network visualization | Network visualization | No | Network visualization |
| Keyword analysis | Bar chart + word cloud | Network map + overlay | Yes + word cloud | No | Yes |
| Citation analysis | Histogram + top cited + trends | Citation network | Yes | Yes + citation metrics | Citation burst detection |
| TF-IDF on abstracts | Yes (unigrams + bigrams) | No | No | No | No |
| Topic modeling (NMF/LDA) | Yes (3-10 topics, heatmap, topic summary, document-topic export) | No | No | No | No |
| Search consolidation | Union + comparison modes | No | Multiple file merge | No | No |
| Venn diagram (overlap) | Yes (2-3 searches) | No | No | No | No |
| Publication-ready export | 300 DPI PNG per chart | PNG/SVG | Multiple formats | No | PNG |
| Excel export | Multi-sheet (Results, Authors with IDs, Keywords, Summary) | CSV | Multiple formats | CSV | No |
| Co-citation / coupling | Not yet | Yes | Yes | No | Yes |
| Network visualization | Keyword, author, country co-occurrence (NetworkX + Plotly) | Primary strength | Yes (NetworkX) | No | Primary strength |
| Programming required | None | None | R programming | None | None |
| Cost | Free (requires Scopus API key) | Free | Free (R required) | Free | Free |
| Language | Python | Java | R | C# | Java |
Key differentiators of ScopusLit:
- Integrated API access: ScopusLit is the only tool that executes searches, downloads abstracts, and performs analysis within a single interface without manual file transfer.
- Text analysis on abstracts: TF-IDF and topic modeling on full abstract corpora are not available in VOSviewer, CiteSpace, or Publish or Perish.
- Search consolidation with Venn overlap: The ability to merge multiple searches (union mode) or compare them side-by-side (comparison mode with Venn diagrams) is unique to ScopusLit.
- Zero-code browser interface: Unlike Bibliometrix (which requires R), ScopusLit runs entirely through a web browser with no programming required.
ScopusLit is implemented as a single Python file (app.py, approximately 3,005 lines of code) organized into 10 clearly delineated sections. This monolithic architecture was a deliberate initial design choice to maximize portability, simplify deployment, and minimize configuration overhead. The file is structured using a functional programming paradigm with grouped utility functions rather than classes. This is the final single-file version before the codebase is split into dedicated modules.
| Section | Name | Approx. Lines | Description |
|---|---|---|---|
| 1 | Imports and Configuration | 65 | All imports, pybliometrics.init(), constants, stopwords, Streamlit page config |
| 2 | Persistence Functions | 175 | JSON save/load for searches and consolidations |
| 3 | Scopus Search Functions | 70 | Query estimation, execution, DataFrame conversion |
| 4 | Abstract Download Functions | 55 | Retry wrapper, abstract downloading with incremental save |
| 5 | Bibliometric Analysis Functions | 330 | Pure data computations including author-ID-aware analysis and co-occurrence network analysis |
| 6 | Text Analysis Functions | 150 | TF-IDF, topic modeling, abstract word cloud generation, topic summary tables |
| 7 | Visualization Functions | 390 | Plotly and matplotlib chart creation including network graphs |
| 8 | Export Functions | 110 | CSV, multi-sheet Excel generation, filename sanitization |
| 9 | Streamlit Interface | 1,000 | All UI rendering, methods expanders, session state, page routing |
| 10 | Main Entry Point | 25 | main() function and __main__ guard |
The application follows a three-layer separation of concerns:
-
Data Layer (Sections 2-4): Handles all interactions with external systems. This includes Scopus API communication via pybliometrics (Section 3 for searches, Section 4 for abstract retrieval) and local file system operations for JSON persistence (Section 2). All API calls are wrapped in retry logic with exponential backoff to respect Scopus rate limits.
-
Analysis Layer (Sections 5-6): Contains pure computation functions that take pandas DataFrames or abstract dictionaries as input and return processed data structures (DataFrames, dictionaries, or lists). These functions have no side effects and no dependency on Streamlit, making them independently testable. Section 5 covers bibliometric computations (publication counts, author rankings, citation statistics), while Section 6 covers natural language processing (TF-IDF, topic modeling, word cloud generation).
-
Presentation Layer (Sections 7-9): Handles all visualization (Section 7 creates Plotly and matplotlib figures), data export (Section 8 generates CSV and Excel byte streams), and user interface rendering (Section 9 manages Streamlit components, session state, and page routing).
The application comprises 98 functions distributed across sections:
| Section | Functions | Key Examples |
|---|---|---|
| 2 - Persistence | 10 | save_search, load_all_searches, build_consolidation_dataframe |
| 3 - Search | 3 | estimate_results, execute_search, results_to_dataframe |
| 4 - Abstracts | 2 | _api_call_with_retry, download_abstracts |
| 5 - Analysis | 24 | analyze_publications_per_year, analyze_top_authors, analyze_citations, build_cooccurrence_graph, analyze_document_types, analyze_bradford_zones, analyze_lotka_law, analyze_price_law, analyze_keyword_evolution, build_sankey_data |
| 6 - Text Analysis | 8 | generate_abstract_wordcloud, compute_tfidf_terms, compute_topic_model, build_topic_summary_table, build_doc_topic_table |
| 7 - Visualization | 22 | plot_publications_per_year, plot_horizontal_bar, plot_venn_diagram, plot_network_graph, plot_document_type_pie, plot_bradford_curve, plot_lotka_curve, plot_keyword_evolution, plot_sankey |
| 8 - Export | 5 | sanitize_filename, export_results_excel, export_abstracts_csv |
| 9 - Interface | 23 | render_sidebar, render_analysis_page, render_methods_expander, render_tab_networks, render_consolidation_page, render_filters_panel, render_tab_doc_types, render_tab_sankey |
| 10 - Main | 1 | main |
| Total | 98 |
ScopusLit uses Streamlit's st.session_state mechanism to persist application state across user interactions (Streamlit re-executes the entire script on every widget interaction). The following state variables are managed:
| State Variable | Type | Default | Purpose |
|---|---|---|---|
current_page |
str |
"New Search" |
Active navigation page |
loaded_search |
dict or None |
None |
Full search data currently loaded for analysis |
loaded_df |
pd.DataFrame or None |
None |
Cached DataFrame derived from loaded search results |
search_estimate |
int or None |
None |
Most recent result count estimate |
h_index_data |
list[dict] or None |
None |
Cached h-index data for top authors |
last_search_run |
dict or None |
None |
Most recently executed search (for display on New Search page) |
confirm_delete |
str or None |
None |
UUID of search pending deletion confirmation |
loaded_consolidation |
dict or None |
None |
Full consolidation data currently loaded |
loaded_consol_df |
pd.DataFrame or None |
None |
Cached DataFrame for loaded consolidation |
loaded_consol_abstracts |
dict or None |
None |
Merged abstracts for loaded consolidation |
topic_result |
dict or None |
None |
Cached topic modeling results keyed by abstracts, method, and number of topics |
State lifecycle:
- Loading a search populates
loaded_search,loaded_df, and clearsh_index_data. - Loading a consolidation populates
loaded_consolidation,loaded_consol_df, andloaded_consol_abstracts. - Deleting a search or consolidation clears all associated state variables if the deleted item was currently loaded.
- Topic modeling results persist in
topic_resultonly while the current abstracts, method, and number of topics remain unchanged; changing filters or topic settings invalidates the cached result.
ScopusLit implements 26 features organized into five functional phases:
| # | Feature | Description |
|---|---|---|
| 1 | New Search | Execute Scopus Advanced Search queries with result count estimation before downloading |
| 2 | Download Abstracts | Retrieve full abstracts via AbstractRetrieval API with incremental saving every 25 documents |
| 3 | Saved Searches Manager | List, load, rename, and delete persisted searches with two-step deletion confirmation |
A Global Filters Panel (collapsible expander) is displayed above all analysis tabs, offering: year range slider, document type multiselect, country multiselect, and minimum citation threshold. Filters are applied before all analyses.
| # | Feature | Tab | Key Metrics / Charts |
|---|---|---|---|
| 4 | Publications per Year | Timeline | Bar chart + cumulative trend line (dual y-axis). Summary: total count, year range, peak year. |
| 4b | Document Type Analysis | Document Types | Pie chart + bar chart of document type distribution (subtypeDescription: Article, Review, Conference Paper, etc.). Metrics: distinct types, dominant type percentage. |
| 5 | Publications per Journal + Bradford's Law | Sources | Top 15 journals horizontal bar chart. Bradford's Law of Scattering: journals divided into 3 productivity zones (core, middle, peripheral), log-scale scatterplot with zone demarcation lines. |
| 6 | Publications per Country/Affiliation | Geography | Top 15 countries + top 15 affiliations, horizontal bar charts. Choropleth world map if >= 5 countries. |
| 7 | Top Authors + Lotka's/Price's Law | Authors | Top 20 authors horizontal bar chart using Scopus author IDs when available. Lotka's Law: log-log scatter of author productivity distribution with power law fit (exponent + R²). Price's Law: checks whether √n elite authors produce ≥ 50% of publications. |
| 8 | Co-authorship Analysis | Co-authorship | Frequent co-author pairs table using author IDs when available + author count distribution bar chart. |
| 9 | h-index of Top Authors | Authors | On-demand "Fetch h-index" button for top 10 authors' h-indices via AuthorRetrieval API. Results cached in session. |
| 10 | Keyword Analysis + Evolution | Keywords | Two subtabs: Keyword Frequency (top 30 bar chart + word cloud) and Keyword Evolution (heatmap of top keywords over publication years with configurable N). |
| 11 | Citation Analysis | Citations | Summary metrics (total, mean, median, max). Citation distribution histogram. Top 20 most cited documents table. Average citations per year line chart. |
| 11b | Co-occurrence Networks | Networks | VOSViewer-style network analysis with 3 subtabs: Keyword co-occurrence (author keywords that co-appear in publications), Author co-authorship (collaboration network), Country collaboration (international co-publishing). Duplicate values inside each publication are deduplicated before edge counting. Each subtab features: min co-occurrence slider (1-20), max nodes slider (10-100), layout algorithm selector (Spring/Kamada-Kawai/Circular), community detection with colored clusters, edge width tiers by weight, node size by occurrence count, and network metrics (nodes, edges, communities, density). Built with NetworkX + Plotly. |
| 11c | Three-Field Plot (Sankey) | Three-Field Plot | Interactive Sankey diagram connecting any 3 user-selected fields (Authors, Keywords, Journals, Countries, Affiliations, Document Types) with configurable top-N per field (3-20). Built with Plotly go.Sankey. |
Phase 3: Abstract Text Analysis (3 analyses in nested sub-tabs, visible only when abstracts are downloaded)
| # | Feature | Sub-tab | Description |
|---|---|---|---|
| 12 | Abstract Word Cloud | Abstract Word Cloud | Word cloud from concatenated abstract texts, excluding English, Spanish, and academic filler stopwords. |
| 13 | TF-IDF Term Frequency | TF-IDF Terms | Top 30 terms by average TF-IDF score with unigram + bigram support. Horizontal bar chart. |
| 14 | Topic Modeling | Topic Modeling | NMF or LDA (user selects), configurable 3-10 topics via slider. Results include a topic summary table, topic-word weight heatmap, and document-topic assignment table with CSV exports. |
| # | Feature | Description |
|---|---|---|
| 15 | Create Consolidation | Select 2+ searches via checkboxes, choose union or comparison mode, name and save. |
| 16 | Union Mode Analysis | Deduplicate by EID, apply full Phase 2 + Phase 3 analysis pipeline to merged corpus. Includes consolidated CSV/Excel export. |
| 17 | Comparison Mode Analysis | Three comparison tabs: overlaid publication timeline, grouped keyword bar chart, Venn diagram with overlap statistics table (supports 2-3 searches). |
| # | Feature | Formats | Contents |
|---|---|---|---|
| 18 | Export Results | CSV; Excel (4 sheets: Results, Authors, Keywords, Summary) | All search fields + aggregate statistics; author export includes Scopus IDs when available |
| 19 | Export Abstracts | CSV; Excel | EID, DOI, Title, Authors, Year, Journal, Abstract, Keywords |
Additionally, every Plotly chart includes an "Export as PNG (300 DPI)" download button when static rendering is available, and every matplotlib figure (word clouds, Venn diagrams) includes an equivalent 300 DPI PNG download button.
- Python 3.10 or higher (tested with Python 3.14)
- Scopus API access: Requires a valid Scopus API key. Institutional subscribers can also use an InstToken for off-campus access. API keys can be obtained from the Elsevier Developer Portal.
# Create and activate a virtual environment
python -m venv scopuslit_env
source scopuslit_env/bin/activate # macOS / Linux
# scopuslit_env\Scripts\activate # Windows
# Install all dependencies
pip install streamlit pybliometrics plotly pandas numpy wordcloud \
matplotlib scikit-learn openpyxl kaleido matplotlib-venn networkx| Package | Tested Version | Purpose |
|---|---|---|
streamlit |
1.55.0 | Web application framework |
pybliometrics |
4.4.1 | Scopus API client |
plotly |
6.6.0 | Interactive charting library |
pandas |
2.3.3 | Data manipulation and analysis |
numpy |
2.4.3 | Numerical computing |
wordcloud |
1.9.6 | Word cloud generation |
matplotlib |
3.10.8 | Plotting backend for word clouds and Venn diagrams |
scikit-learn |
1.8.0 | TF-IDF vectorization, NMF, LDA topic modeling |
openpyxl |
3.1.5 | Excel file generation |
kaleido |
1.2.0 | Plotly figure export to static PNG images |
matplotlib-venn |
1.1.2 | Venn diagram visualization |
networkx |
3.6.1 | Graph construction, layout algorithms, community detection for co-occurrence networks |
python -c "import streamlit, pybliometrics, plotly, pandas, numpy, \
wordcloud, matplotlib, sklearn, openpyxl, kaleido, matplotlib_venn, \
networkx; print('All dependencies installed successfully.')"pybliometrics requires a configuration file at ~/.config/pybliometrics.cfg (Linux/macOS) or %USERPROFILE%\.config\pybliometrics.cfg (Windows). On first run, pybliometrics will prompt for configuration interactively. Alternatively, create the file manually:
[Authentication]
APIKey = YOUR_API_KEY_HERE
InstToken = YOUR_INST_TOKEN_HERE
[Directories]
AbstractRetrieval = ~/.cache/pybliometrics/AbstractRetrieval
AuthorRetrieval = ~/.cache/pybliometrics/AuthorRetrieval
ScopusSearch = ~/.cache/pybliometrics/ScopusSearch- The APIKey is mandatory. Obtain one from the Elsevier Developer Portal.
- The InstToken is optional but required for off-campus access when your institution provides one.
- Directories define local cache paths. pybliometrics caches API responses to minimize redundant requests.
Scopus API access requires either:
- A connection from an institutional network (campus VPN/IP range) recognized by Scopus, or
- A valid InstToken for authentication from any network.
ScopusLit calls pybliometrics.init() at startup, which reads the configuration file and initializes the API client.
cd /path/to/ScopusLit
streamlit run app.pyThe application opens in the default web browser at http://localhost:8501.
The application uses a four-page navigation structure accessible via radio buttons in the left sidebar:
| Page | Purpose |
|---|---|
| New Search | Execute new Scopus searches and download abstracts |
| Saved Searches | Manage (load, rename, delete) previously saved searches |
| Analysis | View bibliometric and text analysis for a loaded search |
| Consolidation | Combine multiple searches and run comparative analysis |
The sidebar also displays:
- A quick-access list of all saved searches with result counts, dates, abstract indicators
(abs), and individual "Load" buttons. Clicking "Load" immediately opens the Analysis page with that search. - A list of all saved consolidations (if any exist) with mode labels, search counts, and "Load" buttons. Clicking "Load" immediately opens the Consolidation page with that consolidation's analysis.
- Navigate to the New Search page.
- Enter a Scopus Advanced Search query in the text area (e.g.,
TITLE-ABS-KEY(seismic AND "machine learning")). - Optionally enter a descriptive name and comma-separated tags.
- Click Estimate Results to preview the number of matching documents without downloading them. This uses
ScopusSearch(query, download=False).get_results_size()and consumes minimal API quota. - Click Run Search to execute the full search. The application calls
ScopusSearch(query, subscriber=True)and converts the resulting list ofDocumentnamedtuples into a serializable list of dictionaries. - Results are automatically saved to a JSON file in the
./scopuslit_data/directory. - A summary is displayed: total documents, year range, and the first 10 titles.
- Optionally click Download Abstracts to retrieve full abstract texts via
AbstractRetrieval(eid)for each document. This is a separate step because it requires one API call per document and can be slow for large result sets. Progress is saved incrementally every 25 abstracts. - Click Load for Analysis to navigate to the Analysis page.
- Load a search from the sidebar, the Saved Searches page, or after running a new search.
- The Analysis page displays:
- Search metadata: name, query, date, result count.
- Export buttons at the top: Results CSV, Results Excel, Abstracts CSV, Abstracts Excel (latter two visible only if abstracts have been downloaded).
- 10 analysis tabs (11 if abstracts are available):
- Timeline: Publications per year with summary metrics (total, range, peak year) and dual-axis bar+line chart.
- Document Types: Pie and bar charts of Scopus document categories.
- Sources: Top 15 journals as horizontal bar chart with data table and Bradford's Law.
- Geography: Top 15 countries and affiliations as bar charts, plus choropleth world map.
- Authors: Top 20 author-ID-aware authors as bar chart, with "Fetch h-index for Top 10 Authors" button, Lotka's Law, and Price's Law.
- Co-authorship: Most frequent co-author pairs table, authors per article distribution chart.
- Networks: Keyword, author, and country co-occurrence networks.
- Keywords: Top 30 keywords bar chart, keyword word cloud, and keyword evolution heatmap.
- Citations: Summary metrics (total, mean, median, max), citation histogram, top 20 table, average per year line chart.
- Three-Field Plot: Sankey diagram linking selected bibliometric fields.
- Text Analysis (if abstracts available): Three nested sub-tabs for abstract word cloud, TF-IDF terms, and topic modeling.
- Each analysis tab includes a Methods and parameters expander describing field sources, assumptions, and algorithm settings.
- Each Plotly chart has an Export as PNG (300 DPI) button directly below it when Kaleido export is available.
- Data tables behind each chart are accessible via expandable sections.
The Saved Searches page displays each search as a card with:
- Name (editable inline with "Save Name" button), query preview, date, result count, tags, and abstract status.
- Load for Analysis button to open the search in the Analysis page.
- Download Abstracts button (if abstracts have not been downloaded yet).
- Delete button with a two-step confirmation pattern: first click shows "Are you sure?" with "Yes, delete" and "Cancel" buttons.
- Navigate to the Consolidation page.
- Create a new consolidation: Select 2+ saved searches using checkboxes, choose a mode (union or comparison), enter a name, and click "Create Consolidation".
- Or load an existing consolidation from the list shown at the bottom of the page (each with "Load" and "Delete" buttons), or from the sidebar.
- For union mode, the full Phase 2 + Phase 3 analysis pipeline is displayed (same 10-11 tabs as individual search analysis, including filters), plus consolidated export buttons.
- For comparison mode, three specialized tabs are displayed:
- Timeline Comparison: overlaid line charts.
- Keywords Comparison: grouped bar chart.
- Document Overlap: pairwise overlap statistics table + Venn diagram.
From the Analysis page, up to four export buttons are available:
| Button | Format | Contents |
|---|---|---|
| Export Results (CSV) | .csv |
All search result fields in a flat table |
| Export Results (Excel) | .xlsx |
4 sheets: Results, Authors, Keywords, Summary |
| Export Abstracts (CSV) | .csv |
EID, DOI, Title, Authors, Year, Journal, Abstract, Keywords |
| Export Abstracts (Excel) | .xlsx |
Same columns as CSV, in Excel format |
The Excel Results export includes a Summary sheet with aggregate statistics: search name, total results, year range, total/mean citations, unique journals, unique countries, and unique authors. The Authors sheet includes Scopus author IDs where available.
Additionally, every individual chart can be exported as a 300 DPI PNG via the download button below each figure when static rendering is available.
The following walkthrough demonstrates a typical bibliometric analysis session using ScopusLit.
A researcher is preparing a review article on machine learning applications in seismology and needs to characterize the existing literature quantitatively.
Step 1: Define and run the search.
On the "New Search" page, the researcher enters:
- Query:
TITLE-ABS-KEY(seismic AND "machine learning") - Name:
Seismic ML - Tags:
seismology, ML, review
Clicking "Estimate Results" reveals approximately 2,400 documents. Satisfied with the scope, the researcher clicks "Run Search". After approximately 30 seconds, the search completes and is saved.
Step 2: Download abstracts.
The researcher clicks "Download Abstracts". With a progress bar indicating status, the application downloads full abstract texts for all 2,400 documents. This process takes approximately 20-40 minutes depending on API response times, with progress saved every 25 abstracts. If interrupted, the download can be resumed later without re-downloading completed abstracts.
Step 3: Analyze the results.
Clicking "Load for Analysis" opens the Analysis page with a global filters panel and 11 tabs:
- The Timeline tab reveals a rapid growth trend starting around 2017, with peak publications in 2024.
- The Sources tab shows Geophysics, Computers & Geosciences, and Geophysical Journal International as the top journals.
- The Geography tab reveals the United States and China as the leading countries, with a choropleth map showing global distribution.
- The Networks tab provides VOSViewer-style co-occurrence analysis. The keyword co-occurrence subtab reveals clusters of related terms: a "deep learning" cluster connected to "convolutional neural network" and "transfer learning", a "seismology" cluster linking "earthquake", "seismic hazard", and "ground motion". The author co-authorship network shows research groups and their collaboration patterns, while the country collaboration network highlights the US-China research axis with secondary European hubs.
- The Keywords tab shows "deep learning", "convolutional neural network", and "earthquake" as the most frequent author keywords. The word cloud provides a visual summary.
- The Text Analysis tab's TF-IDF analysis identifies discriminative terms like "seismic waveform", "phase picking", and "transfer learning" using only abstracts that remain after active filters. Topic modeling with NMF (5 topics) reveals distinct research themes: earthquake detection, signal processing, hazard assessment, subsurface imaging, and ground motion prediction. The topic module also provides a topic summary table and document-topic assignment export.
Step 4: Run a comparison.
To compare with a related field, the researcher creates a second search for TITLE-ABS-KEY(volcanic AND "machine learning"). Then, on the Consolidation page, both searches are selected in comparison mode. The overlaid timeline reveals that seismic ML research started earlier and grows faster. The Venn diagram shows 47 shared documents, indicating a meaningful but limited overlap.
Step 5: Export for the manuscript.
The researcher exports:
- Individual 300 DPI PNG charts for the publications-per-year figure, the keyword word cloud, and the topic heatmap.
- A multi-sheet Excel file with the complete results, author list with Scopus IDs where available, keyword frequencies, and summary statistics.
- An abstracts CSV for use in further text mining outside the application.
| Function | Signature | Description |
|---|---|---|
ensure_data_dir() |
() -> None |
Creates ./scopuslit_data/ if it does not exist. |
save_search(search_data) |
(dict) -> str |
Writes a complete search dict to {uuid}.json. Returns file path. |
load_search(search_id) |
(str) -> dict or None |
Reads a search by its UUID. Returns None if file not found. |
load_all_searches() |
() -> list[dict] |
Reads metadata (no full results list) from all non-consolidation JSON files, sorted by date descending. |
delete_search(search_id) |
(str) -> bool |
Removes a JSON file by UUID. |
rename_search(search_id, new_name) |
(str, str) -> bool |
Loads file, updates name field, re-saves. |
update_search_abstracts(search_id, abstracts) |
(str, dict) -> bool |
Partial update: replaces only the abstracts field. |
save_consolidation(consolidation_data) |
(dict) -> str |
Writes a consolidation dict to {uuid}.json. |
load_all_consolidations() |
() -> list[dict] |
Reads metadata from all consolidation JSON files. |
build_consolidation_dataframe(consolidation) |
(dict) -> tuple[DataFrame, dict] |
Loads all referenced searches, merges results into a single DataFrame (deduplicating by EID for union mode), merges abstracts. Returns (df, abstracts_dict). |
| Function | Signature | Description |
|---|---|---|
estimate_results(query) |
(str) -> int or None |
Returns result count without downloading. Uses ScopusSearch(query, download=False).get_results_size(). |
execute_search(query) |
(str) -> tuple[list[dict], int] or None |
Executes ScopusSearch(query, subscriber=True), converts Document namedtuples to dicts via ._asdict(), retries on Scopus429Error. |
results_to_dataframe(results) |
(list[dict]) -> DataFrame |
Converts result dicts to DataFrame, adding year (int from coverDate), citedby_count (int), author_count (int). |
| Function | Signature | Description |
|---|---|---|
_api_call_with_retry(callable_fn) |
(callable) -> Any |
Retry wrapper. Catches Scopus429Error, waits with exponential backoff (2s, then 4s), retries up to 3 total attempts. |
download_abstracts(search_id, eid_list, ...) |
(str, list[str], dict or None, progress_bar) -> dict |
Downloads abstracts via AbstractRetrieval(eid), falls back to .description, saves every 25 docs, skips already-downloaded EIDs. |
| Function | Input | Output | Description |
|---|---|---|---|
_parse_delimited_field(series, delimiter) |
Series, str |
Series |
Splits delimited strings, strips whitespace, filters empties, explodes. Used for ;-separated (authors, countries) and |-separated (keywords) fields. |
analyze_publications_per_year(df) |
DataFrame |
DataFrame (year, count, cumulative) |
Groups by year, counts, computes running cumulative sum. |
analyze_publications_per_journal(df, top_n) |
DataFrame |
DataFrame (journal, count) |
Value counts on publicationName, top N. Default 15. |
analyze_publications_per_country(df, top_n) |
DataFrame |
DataFrame (country, count) |
Parses ;-separated affiliation_country, top N. Default 15. |
analyze_publications_per_affiliation(df, top_n) |
DataFrame |
DataFrame (affiliation, count) |
Parses ;-separated affilname, top N. Default 15. |
_author_entries_from_row(row) |
Series |
list[(key, display_name, author_id)] |
Builds unique author identity keys for one record, preferring Scopus author IDs and falling back to normalized names. |
_author_identity_series(df) |
DataFrame |
Series |
Returns author identity keys for author productivity laws, preferring Scopus IDs. |
analyze_top_authors(df, top_n) |
DataFrame |
DataFrame (author, author_id, count) |
Counts authors by Scopus ID when available; falls back to normalized name identity. Default 20. |
analyze_coauthor_pairs(df, top_n) |
DataFrame |
DataFrame (author_1, author_2, author_id_1, author_id_2, count) |
Generates itertools.combinations of author identity keys per document, counts pair frequencies, and displays names plus IDs. Default 20. |
analyze_author_count_distribution(df) |
DataFrame |
DataFrame (author_count, num_articles) |
Distribution of the author_count field. |
fetch_h_indices(author_data) |
list[(name, auid)] |
list[dict] |
Calls AuthorRetrieval(auid).h_index per author with retry wrapper. |
analyze_keywords(df, top_n) |
DataFrame |
DataFrame (keyword, count) |
Parses |-separated authkeywords, lowercases, counts. Default 30. |
analyze_citations(df) |
DataFrame |
dict |
Returns summary (total, mean, median, max), distribution (Series), top_cited (top 20 DataFrame), avg_per_year (DataFrame). |
build_cooccurrence_graph(df, field, delimiter, min_cooccurrence, max_nodes, lowercase, dedup_per_row) |
DataFrame, str, str, int, int, bool, bool |
dict or None |
Builds a NetworkX co-occurrence graph from a multi-value field. Counts item frequencies and pair co-occurrences via itertools.combinations, filters by minimum threshold, prunes to top N nodes, runs community detection (greedy_modularity_communities). Returns {graph, node_sizes, communities, pairs_df}. |
compute_network_layout(graph, algorithm, seed) |
nx.Graph, str, int |
dict |
Wraps NetworkX layout algorithms: Spring (Fruchterman-Reingold with adaptive k), Kamada-Kawai, or Circular. Returns node→(x,y) mapping. |
compute_network_metrics(graph, communities) |
nx.Graph, list |
dict |
Computes summary metrics: nodes, edges, communities, density, average degree. |
| Function | Input | Output | Description |
|---|---|---|---|
_get_abstract_texts(abstracts) |
dict |
list[str] |
Filters and returns non-empty abstract strings. |
_filter_abstracts_for_df(df, abstracts) |
DataFrame, dict |
dict |
Subsets abstracts to EIDs present in the currently filtered DataFrame. |
_abstracts_signature(abstracts, method, n_topics) |
dict, str, int |
str |
Builds a stable cache key so topic results do not leak across filters, datasets, or settings. |
generate_abstract_wordcloud(abstracts) |
dict |
Figure or None |
Concatenates abstracts, generates word cloud using WordCloud.generate() with combined stopword list (293 stopwords total). |
compute_tfidf_terms(abstracts, top_n) |
dict, int |
DataFrame or None |
TF-IDF with max_features=1000, max_df=0.85, min_df=2, ngram_range=(1,2). Returns top N terms by average score. Requires >= 3 abstracts. |
compute_topic_model(abstracts, n_topics, method) |
dict, int, str |
dict or None |
Uses TF-IDF vectors for NMF and count vectors for LDA. Returns {topics, doc_topic_matrix, doc_ids}. Requires >= max(5, n_topics) abstracts. |
build_topic_summary_table(topics) |
list[dict] |
DataFrame |
Converts topic terms and weights into a display/export table. |
build_doc_topic_table(topic_result) |
dict |
DataFrame |
Builds document-level dominant topic assignments and per-topic weights for display/export. |
Styling and utilities (3):
| Function | Description |
|---|---|
style_plotly_fig(fig) |
Applies consistent theme: title 20pt, font 16pt, axes 18pt, ticks 16pt, legend 14pt, height 500px, plotly_white. |
plotly_png_download(fig, filename) |
Renders to PNG at 2100x1500 pixels, scale 2x (~300 DPI at 7x5 inches) via kaleido. |
display_chart_with_download(fig, key, filename) |
Composite: st.plotly_chart(width="stretch") + optional st.download_button() with PNG bytes. If Kaleido export fails, the chart still renders and the app shows a warning. |
Chart functions (14):
| Function | Chart Type | Notes |
|---|---|---|
plot_publications_per_year(data) |
Dual-axis bar + line | make_subplots(secondary_y=True) |
plot_horizontal_bar(data, x_col, y_col, ...) |
Horizontal bar | Reusable for 6+ charts (journals, countries, affiliations, authors, keywords). Uses dynamic height and explicit category ticks so all labels render. |
plot_choropleth_map(data) |
Choropleth world map | Viridis scale; returns None if < 5 countries |
plot_author_count_distribution(data) |
Vertical bar | Authors per article |
plot_h_index_bar(data) |
Horizontal bar | h-index for top authors |
generate_keyword_wordcloud(data) |
Word cloud (matplotlib) | From frequency dict via generate_from_frequencies() |
plot_citation_histogram(values) |
Histogram | 30 bins |
plot_avg_citations_per_year(data) |
Line + markers | Average citations per year |
plot_tfidf_bar(data) |
Horizontal bar | TF-IDF terms ranked by score |
plot_topic_heatmap(topics) |
Heatmap | Topic-word weights; dynamic height max(400, 80*n_topics) |
plot_comparison_timeline(search_data_list) |
Overlaid lines | One trace per search |
plot_comparison_keywords(search_data_list, top_n) |
Grouped bar | Top 15 keywords globally, bars per search |
plot_venn_diagram(eid_sets) |
Venn (matplotlib) | 2-set (venn2) or 3-set (venn3) |
plot_network_graph(graph_data, positions, title) |
Network graph (Plotly) | Co-occurrence network with community-colored nodes, 3-tier edge widths, hover info (occurrence count, degree, top neighbors). Height 700px, hidden axes. |
| Function | Signature | Format | Description |
|---|---|---|---|
sanitize_filename(value, fallback) |
(str, str) -> str |
N/A | Creates safe filename stems from search names before download buttons are rendered. |
export_results_csv(df) |
(DataFrame) -> bytes |
CSV | All result fields, UTF-8 encoded. |
export_results_excel(df, search_name) |
(DataFrame, str) -> bytes |
Excel | 4 sheets: Results (all fields), Authors (name + ID + count), Keywords (keyword + frequency), Summary (8 aggregate metrics). |
export_abstracts_csv(df, abstracts) |
(DataFrame, dict) -> bytes |
CSV | 8 columns: EID, DOI, Title, Authors, Year, Journal, Abstract, Keywords. Falls back to description field if abstract not available. |
export_abstracts_excel(df, abstracts) |
(DataFrame, dict) -> bytes |
Excel | Same 8 columns on an "Abstracts" sheet. |
| Function | Purpose |
|---|---|
init_session_state() |
Initializes 11 session state variables with default values. |
_load_search_into_state(search_id) |
Loads a search from disk, populates loaded_search and loaded_df, clears h_index_data, switches to Analysis page. |
render_sidebar() |
Renders sidebar: app title, navigation radio (4 pages), saved searches quick-list with Load buttons, consolidations list with Load buttons. |
render_new_search_page() |
Query input, estimate/run buttons, result summary with first 10 titles, "Load for Analysis" and "Download Abstracts" buttons. |
render_saved_searches_page() |
Search cards with inline rename, Load, Download Abstracts, and Delete (with two-step confirmation). |
render_analysis_page() |
Header, export buttons (4), global filters panel, 10-11 analysis tabs dispatching to individual tab renderers. |
render_filters_panel(df, key_prefix) |
Collapsible filters (year range, doc type, country, min citations). Returns filtered DataFrame. |
render_methods_expander(content) |
Reusable expander for reviewer-facing methods notes in analysis tabs. |
render_tab_timeline(df) |
Summary metrics + dual-axis chart + data table. |
render_tab_doc_types(df) |
Document type metrics + pie chart + bar chart + data table. |
render_tab_sources(df) |
Top journals chart + data table + Bradford's Law (zone metrics, log-scale scatterplot, zone table). |
render_tab_geography(df) |
Countries section (bar + choropleth) + affiliations section (bar). |
render_tab_authors(df) |
Top authors chart + h-index fetch section + Lotka's Law (log-log scatter + fit) + Price's Law (metrics). |
render_tab_coauthorship(df) |
Co-author pairs table + author count distribution chart. |
render_tab_keywords(df) |
Two subtabs: Keyword Frequency (bar + word cloud + data table) and Keyword Evolution (heatmap with configurable N). |
render_tab_citations(df) |
Summary metrics + histogram + top 20 table + avg per year chart. |
render_tab_sankey(df) |
Three-field Sankey diagram with field selectors and top-N sliders. |
render_tab_text_analysis(df, abstracts) |
Three nested sub-tabs: abstract word cloud, TF-IDF, topic modeling. Analyses only abstracts matching the currently filtered DataFrame. |
render_consolidation_page() |
Consolidation creation form + existing consolidation management + dispatches to union or comparison renderer. |
render_tab_networks(df) |
Networks tab with 3 subtabs: keyword co-occurrence, author co-authorship, country collaboration. |
_render_network_subtab(df, field, delimiter, label, default_min, lowercase, dedup, key_prefix) |
Reusable renderer for one network subtab: controls, metrics, chart, data table. |
render_union_analysis(df, abstracts) |
Export buttons + filters + full 10-11 tab analysis (reuses individual tab renderers). |
render_comparison_analysis(consol, df) |
Three comparison tabs: timeline, keywords, overlap (table + Venn diagram). |
Each search is stored as a JSON file with the following schema:
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"name": "Seismic ML Review",
"query": "TITLE-ABS-KEY(seismic AND \"machine learning\")",
"search_date": "2026-03-08T10:30:00.123456",
"total_results": 245,
"results": [
{
"eid": "2-s2.0-85012345678",
"doi": "10.1016/j.example.2024.01.001",
"title": "Machine learning for seismic data analysis",
"coverDate": "2024-01-15",
"publicationName": "Computers & Geosciences",
"author_names": "Smith J.;Doe A.;Brown K.",
"author_ids": "12345678;23456789;34567890",
"author_count": "3",
"affiliation_country": "United States;Germany",
"affilname": "Massachusetts Institute of Technology;Technical University of Munich",
"citedby_count": "15",
"authkeywords": "machine learning|seismic|deep learning",
"description": "This study proposes..."
}
],
"abstracts": {
"2-s2.0-85012345678": "Full abstract text retrieved via AbstractRetrieval..."
},
"tags": ["ML", "seismic"]
}Field origins: The results list contains dictionaries derived from pybliometrics.scopus.ScopusSearch.results, which returns Document namedtuples with 36 fields. Multi-value fields use ; as separator (authors, affiliations, countries) or | (keywords).
Abstracts: Stored as a dictionary mapping EID to abstract text. The description field in results contains a summary from the search API; the abstracts dictionary contains full texts from the AbstractRetrieval API.
{
"id": "660e8400-e29b-41d4-a716-446655440001",
"type": "consolidation",
"name": "ML + Seismic Combined Analysis",
"mode": "union",
"search_ids": [
"550e8400-e29b-41d4-a716-446655440000",
"550e8400-e29b-41d4-a716-446655440002"
],
"created_date": "2026-03-09T14:00:00.000000"
}Consolidations reference search IDs rather than duplicating data. The build_consolidation_dataframe() function loads the referenced searches at runtime, merges their results, and handles deduplication for union mode.
All files are stored in ./scopuslit_data/ relative to app.py. This directory is created automatically on first run. File naming convention: {uuid4}.json.
All Plotly charts follow a consistent visual specification enforced by the style_plotly_fig() helper function:
| Property | Value | Rationale |
|---|---|---|
| Template | plotly_white |
Clean, minimal background suitable for publication |
| Title font size | 20pt | Readable in Streamlit's wide layout |
| Base font size | 16pt | Compensates for Streamlit's default small rendering |
| Axis title font size | 18pt | Clearly distinguishable from tick labels |
| Tick label font size | 16pt | Legible for dense axis labels |
| Legend font size | 14pt | Compact but readable |
| Default chart height | 500px | Consistent vertical proportion; horizontal bars and heatmaps use dynamic height for dense labels |
| Color palette | px.colors.qualitative.Set2 |
Color-blind friendly, visually distinct |
Every chart can be exported as a high-resolution PNG:
| Property | Value (Plotly) | Value (matplotlib) |
|---|---|---|
| Width | 2100 pixels | Auto (tight bbox) |
| Height | 1500 pixels | Auto (tight bbox) |
| Scale factor | 2x | N/A |
| DPI | ~300 (effective) | 300 (explicit) |
| Format | PNG (lossless) | PNG (lossless) |
| Engine | Kaleido (chromium-based) | matplotlib Agg backend |
If Kaleido or its browser backend is unavailable, Plotly charts still render interactively in the Streamlit page and the app displays a warning instead of failing the analysis tab.
Horizontal bar charts and keyword-evolution heatmaps use dynamic heights and explicit categorical tick arrays so all labels are rendered. This is particularly important for top-author, top-keyword, and TF-IDF charts where default Plotly tick skipping can otherwise hide alternating labels.
Co-occurrence network graphs use Plotly scatter traces to render NetworkX graphs:
| Property | Value | Rationale |
|---|---|---|
| Chart height | 700px | Extra height for network readability |
| Axes | Hidden (no grid, ticks, or zero line) | Network layout is spatial, not quantitative |
| Edge rendering | go.Scatter(mode="lines") with None separators |
Efficient single-trace approach per weight tier |
| Edge width tiers | 3 tiers (0.8, 2.0, 3.5 px) by weight percentiles | Distinguishes weak/medium/strong co-occurrences |
| Edge color | rgba(180,180,180,0.5) |
Subtle, non-distracting |
| Node rendering | One go.Scatter(mode="markers+text") per community |
Community coloring via 15-color palette |
| Node size range | 10-50 px (normalized by occurrence count) | Proportional to item frequency |
| Node labels | Top center, 10pt, truncated at 25 chars | Readable without excessive overlap |
| Community detection | greedy_modularity_communities |
Fast, weight-aware, handles disconnected components |
| Layout algorithms | Spring (default, adaptive k), Kamada-Kawai, Circular | Spring best for clustering; KK for cleaner small graphs |
| Hover info | Name, occurrences, degree, top 3 neighbors | Detailed exploration without clutter |
The text analysis module (Section 6) implements a three-stage NLP pipeline for abstract corpus analysis. In the interface, this pipeline is filter-aware: after the global filters are applied, the Text Analysis tab subsets the abstract dictionary to EIDs present in the filtered DataFrame. This means the abstract word cloud, TF-IDF terms, and topic model reflect the same filtered corpus used by the bibliometric tabs.
The combined stopword list contains 293 unique terms from three sources:
| Source | Count | Examples |
|---|---|---|
English stopwords (wordcloud.STOPWORDS) |
192 | the, is, at, which, from, have |
| Spanish stopwords (custom set) | 69 | de, la, que, el, en, para, con, como |
| Academic filler words (custom set) | 38 | study, results, method, proposed, analysis, approach, data, model |
The Spanish stopwords support analysis of multilingual abstract corpora common in Latin American and European research. The academic filler words remove high-frequency domain-agnostic terms that carry low discriminative value in bibliometric contexts.
The compute_tfidf_terms() function uses sklearn.feature_extraction.text.TfidfVectorizer:
| Parameter | Value | Rationale |
|---|---|---|
max_features |
1,000 | Vocabulary limit for computational efficiency |
stop_words |
"english" |
scikit-learn's built-in English stopword list |
max_df |
0.85 | Ignore terms appearing in > 85% of documents |
min_df |
2 | Ignore terms appearing in fewer than 2 documents |
ngram_range |
(1, 2) | Include both unigrams and bigrams (e.g., "neural network") |
The function computes the mean TF-IDF score across all documents for each term, then returns the top N terms ranked by this average score. Requires a minimum of 3 non-empty abstracts.
The compute_topic_model() function supports two decomposition algorithms. Users select the method and number of topics in the interface. The number of topics is not optimized automatically; it is an exploratory parameter selected from 3 to 10, where lower values produce broader themes and higher values produce finer-grained themes.
Non-negative Matrix Factorization (NMF):
- Vectorizer:
TfidfVectorizer(max_features=2000, stop_words="english", max_df=0.85, min_df=2) - Parameters:
n_components(user-configurable, 3-10),random_state=42,max_iter=300 - Produces additive, parts-based decomposition
- Generally produces more interpretable topics for scientific text
Latent Dirichlet Allocation (LDA):
- Vectorizer:
CountVectorizer(max_features=2000, stop_words="english", max_df=0.85, min_df=2) - Parameters:
n_components(user-configurable, 3-10),random_state=42,max_iter=20 - Probabilistic generative model
Both algorithms return: (1) a list of topics, each with its top 10 words and their weights, (2) a document-topic assignment matrix, and (3) the EIDs corresponding to that matrix. Requires a minimum of max(5, n_topics) non-empty abstracts.
The Topic Modeling sub-tab displays:
| Output | Interpretation |
|---|---|
| Topic Summary table | One row per topic, listing the top weighted terms and their weights. Users interpret each topic by assigning a semantic label based on these terms. |
| Topic-word heatmap | Visualizes relative term weights across topics. Darker cells indicate stronger term-topic association. |
| Document Topic Assignments table | One row per abstract EID with dominant topic, dominant weight, and all per-topic weights. This can be exported as CSV for external validation or downstream analysis. |
Topic modeling is intended as exploratory thematic summarization, not definitive article classification. Results should be interpreted alongside the search strategy, active filters, and domain knowledge.
The consolidation feature (Phase 4) enables researchers to combine multiple Scopus searches for integrated analysis.
In union mode, the application:
- Loads all results from each selected search.
- Concatenates them into a single DataFrame with a
_source_searchcolumn. - Deduplicates by EID (Scopus unique identifier), keeping the first occurrence.
- Merges abstract dictionaries from all searches (preferring non-empty values).
- Applies the complete Phase 2 and Phase 3 analysis pipeline to the merged corpus.
- Provides dedicated "Export Consolidated Results" CSV and Excel buttons.
This mode is appropriate when the researcher wants to treat multiple searches as a single body of literature.
In comparison mode, the application:
- Loads results from each search separately, preserving source labels.
- Generates three comparative visualization tabs:
- Timeline comparison: Overlaid line charts showing publications per year for each search, with distinct colors per search.
- Keyword comparison: Grouped bar chart showing the globally top 15 keywords, with per-search frequency bars side by side.
- Document overlap: A pairwise statistics table (showing shared, only-in-A, only-in-B counts for each pair) and a Venn diagram for 2-3 searches using
matplotlib_venn. For more than 3 searches, only the statistics table is shown.
This mode is appropriate when the researcher wants to compare research topics, methodologies, or search strategies.
CSV format: Flat file containing all fields from the ScopusSearch results. Encoded as UTF-8.
Excel format (.xlsx): Multi-sheet workbook:
| Sheet | Contents |
|---|---|
| Results | All search result fields (excluding internal columns prefixed with _) |
| Authors | Author name, Scopus author ID when available, and publication count, sorted by frequency descending |
| Keywords | Keyword and frequency count, lowercased, sorted by frequency descending |
| Summary | 8 aggregate metrics: search name, total results, year range, total citations, mean citations, unique journals, unique countries, unique authors |
Both CSV and Excel formats contain the same 8 columns:
| Column | Source |
|---|---|
| EID | Scopus unique identifier from search results |
| DOI | Digital Object Identifier |
| Title | Document title |
| Authors | Author names (;-separated) |
| Year | Publication year (integer) |
| Journal | Publication name |
| Abstract | Full text from AbstractRetrieval, falling back to description field from search |
| Keywords | Author keywords (` |
Every Plotly chart: PNG at 2100x1500 pixels, scale 2x (~300 DPI). Every matplotlib figure: PNG at 300 DPI with tight bounding box. Download buttons appear directly below each chart.
Search names are sanitized before being used in filenames. Excel export failures and Plotly PNG export failures are caught in the interface and shown as warnings, so the rest of the analysis page remains usable.
When a topic model has been run, two additional CSV exports are available inside the Topic Modeling sub-tab:
| Export | Contents |
|---|---|
| Topic Summary | Topic label, top terms, and term weights |
| Document Topic Assignments | EID, dominant topic, dominant topic weight, and all topic weights |
All Scopus API calls are wrapped in try/except blocks that catch:
| Exception | Handling |
|---|---|
Scopus429Error |
Rate limit exceeded. Retry with exponential backoff up to 3 total attempts. Display warning during retries, error after exhaustion. |
ScopusHtmlError |
General API error. Display error with troubleshooting hints (check network, API key, InstToken). |
Exception |
Catch-all. Display error with exception details. |
Note: Exception classes are imported from pybliometrics.exception (not pybliometrics.scopus.exception).
The _api_call_with_retry(callable_fn) function implements exponential backoff:
Attempt 1: Execute immediately
Attempt 2: Wait 2 seconds, then retry
Attempt 3: Wait 4 seconds, then retry
After 3 failures: Raise the original exception
Wait times are computed as BACKOFF_BASE_SECONDS * (2 ** attempt) where BACKOFF_BASE_SECONDS = 2 and attempt starts at 0. This yields waits of 2s and 4s before the second and third attempts respectively.
The abstract download function saves progress to disk every 25 abstracts. If the process is interrupted, previously downloaded abstracts are preserved. Re-running the download skips already-downloaded EIDs (those with non-empty values in the abstracts dict).
All error messages are displayed in English via st.error() and include actionable troubleshooting suggestions:
- Connection failures suggest checking institutional network, API key, or InstToken.
- Rate limit errors suggest waiting a few minutes before retrying.
- Missing data warnings inform the user which analyses could not be performed.
| Component | Technology | Role |
|---|---|---|
| Web framework | Streamlit 1.55 | Reactive web UI with widgets, layout, session state |
| API client | pybliometrics 4.4.1 | Scopus API communication (ScopusSearch, AbstractRetrieval, AuthorRetrieval) |
| Data manipulation | pandas 2.3 | DataFrame operations, groupby, value counts, field parsing |
| Numerical computing | NumPy 2.4 | Array operations, NaN handling |
| Component | Technology | Role |
|---|---|---|
| Interactive charts | Plotly 6.6 | Bar charts, histograms, line charts, choropleth maps, heatmaps |
| Static figures | matplotlib 3.10 | Word cloud rendering, Venn diagram rendering |
| Word clouds | wordcloud 1.9 | Word cloud generation from text and frequency dictionaries |
| Venn diagrams | matplotlib-venn 1.1 | 2-set and 3-set proportional Venn diagrams |
| Image export | Kaleido 1.2 | Chromium-based headless rendering of Plotly figures to PNG |
| Component | Technology | Role |
|---|---|---|
| TF-IDF | scikit-learn 1.8 (TfidfVectorizer) |
Term frequency-inverse document frequency vectorization |
| NMF | scikit-learn 1.8 (NMF) |
Non-negative matrix factorization for topic modeling |
| LDA | scikit-learn 1.8 (LatentDirichletAllocation) |
Latent Dirichlet allocation for topic modeling |
| Count vectors | scikit-learn 1.8 (CountVectorizer) |
Raw count vectorization for LDA topic modeling |
| Component | Technology | Role |
|---|---|---|
| Excel writing | openpyxl 3.1 | Multi-sheet .xlsx file generation |
| CSV writing | pandas (built-in) | UTF-8 encoded CSV generation |
The application uses the following Python standard library modules: os, json, uuid, time, io, hashlib, datetime, collections.Counter, itertools.combinations.
| Field | Value |
|---|---|
| Software name | ScopusLit |
| Version | 1.0.0 |
| Programming language | Python (>= 3.10) |
| Tested Python version | 3.14 |
| Operating systems | macOS, Linux, Windows (any OS supporting Python and Streamlit) |
| Size of software | Single file, approximately 3,005 lines of Python code |
| Dependencies | 12 Python packages (see Section 16) |
| External API | Scopus API via pybliometrics (requires API key) |
| Interface | Web browser (served locally by Streamlit) |
| Parallelism | Single-threaded (Streamlit execution model) |
| Data storage | Local JSON files in ./scopuslit_data/ |
| Repository | [To be added] |
| License | [To be determined] |
| Development institution | Universidad Industrial de Santander (UIS), Bucaramanga, Colombia |
ScopusLit has the potential to benefit the research community in several ways:
Lowering the barrier to bibliometric analysis. By integrating search, analysis, and visualization into a single browser-based tool with no programming requirement, ScopusLit makes quantitative literature analysis accessible to researchers who lack programming skills or familiarity with specialized bibliometric software.
Enabling reproducible bibliometric workflows. Each search is persisted as a self-contained JSON file that captures the query, execution date, full result set, and downloaded abstracts. This enables exact reproduction of analyses and facilitates sharing of bibliometric datasets between collaborators.
Supporting multilingual research communities. The inclusion of Spanish stopwords alongside English ones reflects the tool's origin at a Latin American institution and supports analysis of bibliographic corpora where abstracts may contain Spanish text, a common scenario in engineering and geosciences literature from Latin America and Spain.
Accelerating systematic review preparation. The search consolidation feature, with both union and comparison modes, directly supports the multi-query workflow typical of systematic reviews (PRISMA methodology), where researchers must execute multiple search strings across different conceptual facets and then analyze the combined and overlapping result sets.
Providing publication-ready outputs. Every visualization can be exported at 300 DPI, meeting the minimum resolution requirements of most scientific journals (typically 300 DPI for color figures). The multi-sheet Excel export provides immediately usable supplementary materials.
- Scopus-only: The tool is designed exclusively for the Scopus database. Support for Web of Science, PubMed, OpenAlex, or other databases is not included.
- API quota constraints: Scopus API imposes rate limits (typically 6-9 requests per second) and weekly quotas (typically 5,000-20,000 requests depending on the API key type). Large searches (> 5,000 results) or extensive abstract downloads may exhaust quotas.
- No co-citation or bibliographic coupling analysis: The current version does not implement reference-based analyses (co-citation networks, bibliographic coupling) which require cited reference data not available from
ScopusSearch.results. - Single-user, local deployment: The application runs locally and does not support concurrent multi-user access or cloud deployment out of the box.
- Abstract-dependent text analysis: TF-IDF, topic modeling, and abstract word clouds require downloading full abstracts, which consumes one API call per document.
- Venn diagram limit: Document overlap visualization is limited to 2-3 searches due to limitations of the matplotlib-venn library. Larger comparisons use only the overlap statistics table.
- No BibTeX export: Direct export to BibTeX format for integration with reference managers (Zotero, Mendeley, EndNote) is not yet supported.
- Monolithic codebase: This final pre-refactor version remains a single
app.pyfile for portability. The next development stage should split the application into dedicated modules for storage, API access, analysis, plotting, exports, and UI.
- Co-citation and bibliographic coupling analysis using
AbstractRetrieval.references. - Integration with OpenAlex or Semantic Scholar for open-access metadata enrichment.
- Cloud deployment template (e.g., Streamlit Community Cloud, Docker).
- BibTeX and RIS export for reference managers.
- Modular refactor of the monolithic application into maintainable packages.
- Automated tests for core analysis functions and export functions.
- Topic-model validation aids such as coherence scoring or perplexity diagnostics.
- Author collaboration internationalization metrics.
If you use ScopusLit in your research, please cite it as:
Arroyo, O. (2026). ScopusLit: An end-to-end Web-based tool for bibliometric analysis. SoftwareX, 34, 102733.
[License to be determined]
ScopusLit is developed at Universidad Industrial de Santander (UIS), Bucaramanga, Colombia.