diff --git a/CHANGELOG.md b/CHANGELOG.md deleted file mode 100644 index e69de29..0000000 diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 91c0ed6..330dc74 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -19,31 +19,26 @@ If you find a bug or have a feature request, please create an issue by following 1. **Fork the Repository**: Fork the repository to your own GitHub account. 2. **Create a Branch**: Create a new branch for your changes. - - Use a descriptive name for your branch, e.g., `feature/add-new-feature` or `bugfix/fix-issue`. 3. **Make Your Changes**: Implement your changes in your branch. 4. **Write Tests**: If applicable, write tests for your changes. 5. **Commit Your Changes**: Write clear and descriptive commit messages. - - **Commit Message Format**: Use the present tense. Example: `Add new feature` instead of `Added new feature`. ### Submitting a Pull Request -1. **Push Your Branch**: Push your branch to your forked repository. -2. **Open a Pull Request**: Open a pull request (PR) to the `main` branch of the original repository. +1. **Contact Jonathan Starr**: The project manager (jring-o), who can loop you into our regular Wednesday working sessions. Send him a message. +2. **Push Your Branch**: Push your branch to your forked repository. +3. **Open a Pull Request**: Open a pull request (PR) to the `main` branch of the original repository. - **Title**: Provide a descriptive title for your PR. - **Description**: Include a detailed description of your changes, the motivation behind them, and any related issues. -3. **Review Process**: +4. **Review Process**: - **Automatic Checks**: Your PR will undergo automated checks. - **Review by Maintainers**: Your PR will be reviewed by the maintainers. They may request changes or provide feedback. -### Code of Conduct - -Please note that this project adheres to a [Code of Conduct](link-to-code-of-conduct). By participating, you are expected to uphold this code. - ### Additional Information - **Human Verification**: During PR reviews, we strive to ensure that any contributions (especially those generated using language models) are thoroughly reviewed and verified by human maintainers for accuracy and relevance. - **Documentation**: Ensure that your changes are well-documented. Update any relevant documentation in the project. -- **Contact**: If you have any questions or need further assistance, feel free to [contact us](link-to-contact-information). +- **Contact**: If you have any questions or need further assistance, feel free to contact Jonathan Starr (jring-o). ## Acknowledgments diff --git a/LICENSE b/LICENSE index 261eeb9..f49a4e1 100644 --- a/LICENSE +++ b/LICENSE @@ -198,4 +198,4 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and - limitations under the License. + limitations under the License. \ No newline at end of file diff --git a/README.md b/README.md index 1d2dc56..9cb6e92 100644 --- a/README.md +++ b/README.md @@ -3,21 +3,66 @@ MOSS is a project of [OSSci](https://www.opensource.science/), an initiative of ## Overview - This project aims to visualize the intersection of open source software and scientific research. -> The Map of Open Source Science is a proof of concept and as such, nothing is accurate. +This Map of Open Source Science is a proof of concept right now and as such, nothing is accurate. +This project aims to map open source software and scientific (e.g., peer-reviewed) research via one comprehensive project. This repository houses the backend (e.g., database, API endpoints, etc.) as well as various front-end frameworks (in /frontends) which allow for cool visualizations. -## [Getting Started](./scripts/README.md) +## Running +To start the backend, which is deployed on a production basis at [backend.some-domain-we-need-to-buy.com](backend.some-domain-we-need-to-buy.com) and at a beta/development basis at [beta.some-domain-we-need-to-buy.com](beta.some-domain-we-need-to-buy.com) simply run -## Goal -Here is an earlier iteration built using Kumu. We want to build something similar but better. +> (Instructions for dependency installations go here; Contact Mark Eyer for details.) +> python3 main.py + +To start the frontend(s), of which the primary web-based one can be found in production at [some-domain-we-need-to-buy.com](some-domain-we-need-to-buy.com), follow the instructions in the /src/frontends subdirectories. In general, it contains an early iteration of a front-end built using Kumu. We want to build something similar but better. - [kumu instance](https://embed.kumu.io/6cbeee6faebd8cc57590da7b83c4d457#default) - [demo video](https://www.youtube.com/watch?v=jZyLSRCba_M) -## Data Sources +## File Structure +``` +├── CONTRIBUTING.md **Outlines how to contribute to the project. Still under construction.** +├── LICENSE **Standard Apache2 license.** +├── README.md **Information about this repository.** +├── docs **This is the directory where all the documentation is stored.** +├── main.py **Launchpoint for the app's backend. Run python3 main.py** +├── mkdocs.yml **Documentation configuration settings. (TODO: Determine if can be moved into /docs)** +├── pdm.lock **Project dependency file, generated via PDM.** +├── pyproject.toml **Project dependency settings, used by PDM.** +├── src **Source code directory.** +│ ├── backend **The backend, a standalone hub. Internally organized using [hexagonal architecture](https://en.wikipedia.org/wiki/Hexagonal_architecture_(software)).** +│ │ ├── administration **Used by repository maintainers to hold code for internal administrative tools.** +│ │ ├── biz_logic **The main business logic "guts" of the application. Organized around the ["Harvest, Bottle, Mix"](https://docs.google.com/presentation/d/1jE0-VBikgAd-E6XSRTEkt_RxI190uVlsWg11fB6YgXw/edit?usp=sharing) architecture developed by Schwartz et al.** +│ │ │ ├── bottle +│ │ │ ├── harvest +│ │ │ │ ├── endpoint.py **The app uses RESTful endpoints to connect with frontend spokes, via FastAPI.** +│ │ │ │ └── otherfiles.py **A bit tounge in cheek, otherfiles.py is a placeholder for the various other files related to business logic (such as ETL pipelines).** +│ │ │ ├── mix +│ │ │ └── scripts **Directory for miscellaneous stand-alone scripts which predate our overall architecture, primarily used for harvesting.** +│ │ ├── notification **The module for centralized notifications (e.g., sending emails when background scripts complete.)** +│ │ └── persistence **The module for all things database and data persistence related.** +│ └── frontends +│ └── moss-react-app **A standalone react-based website "spoke" which makes RESTful API calls to the backend "hub"** +└── tests **A directory/module which contains all unit/integration tests for src/** +``` ## Contributing -We are using the [fork and pull](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/getting-started/about-collaborative-development-models#fork-and-pull-model) collaborative development model, we welcome [pull requests](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork). -- Check issues for anything to work on + +We are still in the process of writing up all our formal procedures on how to contribute code to this repository. A rough draft is located [here](CONTRIBUTING.md) While we accept outside pull requests, the best way to get your contribution accepted is to contact Jon Starr (jring-o) and he can connect you with our weekly technical meetings and give you a brief orientation. + +We follow NumFOCUS's [code of conduct](https://numfocus.org/code-of-conduct). + +## Core Maintainers + +Here are the people who regularly attend weekly meetings where we discuss the technical details of the project. People are listed alphabetically by last/family name. (TODO: Add contact email and/or github username for each person.) + +* Dave Bunten +* Mark Eyer +* Victor Lu +* Guy Pavlov +* Sam Schwartz (samuel.d.schwartz@gmail.com) +* Jon Starr +* Peculiar Umeh +* Max Vasiliev +* Boris Veytsman +* Susie Yu diff --git a/readme-assets/bloom-setup0.png b/docs/readme-assets/bloom-setup0.png similarity index 100% rename from readme-assets/bloom-setup0.png rename to docs/readme-assets/bloom-setup0.png diff --git a/readme-assets/bloom-setup1.png b/docs/readme-assets/bloom-setup1.png similarity index 100% rename from readme-assets/bloom-setup1.png rename to docs/readme-assets/bloom-setup1.png diff --git a/readme-assets/bloom-setup2.png b/docs/readme-assets/bloom-setup2.png similarity index 100% rename from readme-assets/bloom-setup2.png rename to docs/readme-assets/bloom-setup2.png diff --git a/readme-assets/ecosystms-setup0.png b/docs/readme-assets/ecosystms-setup0.png similarity index 100% rename from readme-assets/ecosystms-setup0.png rename to docs/readme-assets/ecosystms-setup0.png diff --git a/readme-assets/ecosystms-setup1.png b/docs/readme-assets/ecosystms-setup1.png similarity index 100% rename from readme-assets/ecosystms-setup1.png rename to docs/readme-assets/ecosystms-setup1.png diff --git a/readme-assets/ecosystms-setup2.png b/docs/readme-assets/ecosystms-setup2.png similarity index 100% rename from readme-assets/ecosystms-setup2.png rename to docs/readme-assets/ecosystms-setup2.png diff --git a/readme-assets/moss-import0.png b/docs/readme-assets/moss-import0.png similarity index 100% rename from readme-assets/moss-import0.png rename to docs/readme-assets/moss-import0.png diff --git a/readme-assets/neo4j-allow-file-imports1.png b/docs/readme-assets/neo4j-allow-file-imports1.png similarity index 100% rename from readme-assets/neo4j-allow-file-imports1.png rename to docs/readme-assets/neo4j-allow-file-imports1.png diff --git a/readme-assets/neo4j-setup0.png b/docs/readme-assets/neo4j-setup0.png similarity index 100% rename from readme-assets/neo4j-setup0.png rename to docs/readme-assets/neo4j-setup0.png diff --git a/readme-assets/neo4j-setup1.png b/docs/readme-assets/neo4j-setup1.png similarity index 100% rename from readme-assets/neo4j-setup1.png rename to docs/readme-assets/neo4j-setup1.png diff --git a/readme-assets/neo4j-setup2.png b/docs/readme-assets/neo4j-setup2.png similarity index 100% rename from readme-assets/neo4j-setup2.png rename to docs/readme-assets/neo4j-setup2.png diff --git a/readme-assets/neo4j-setup3.png b/docs/readme-assets/neo4j-setup3.png similarity index 100% rename from readme-assets/neo4j-setup3.png rename to docs/readme-assets/neo4j-setup3.png diff --git a/readme-assets/neo4j-setup4.png b/docs/readme-assets/neo4j-setup4.png similarity index 100% rename from readme-assets/neo4j-setup4.png rename to docs/readme-assets/neo4j-setup4.png diff --git a/readme-assets/neo4j-setup5.png b/docs/readme-assets/neo4j-setup5.png similarity index 100% rename from readme-assets/neo4j-setup5.png rename to docs/readme-assets/neo4j-setup5.png diff --git a/readme-assets/neo4j-setup6.png b/docs/readme-assets/neo4j-setup6.png similarity index 100% rename from readme-assets/neo4j-setup6.png rename to docs/readme-assets/neo4j-setup6.png diff --git a/readme-assets/neo4j-setup7.png b/docs/readme-assets/neo4j-setup7.png similarity index 100% rename from readme-assets/neo4j-setup7.png rename to docs/readme-assets/neo4j-setup7.png diff --git a/main.py b/main.py new file mode 100644 index 0000000..725baf7 --- /dev/null +++ b/main.py @@ -0,0 +1,103 @@ +""" +Copyright 2025 the MOSS project. +Point person for this file: Sam Schwartz (samuel.d.schwartz@gmail.com) +Description: +This is the main.py file, which should be run to start the app. +This can be done by running the following command: + +python3 main.py +""" + +import uvicorn +from fastapi import FastAPI +from hex.biz_logic import router as biz_router + + +def _ini_api_app() -> FastAPI: + """Helper/factory function for initializing and returning a FastAPI instance. + + Returns: + FastAPI: a fresh FastAPI instance + """ + app = FastAPI() + return app + + +def _ini_hex_administration(app: FastAPI) -> FastAPI: + """Helper function for initiating anything to do with CLI administration. + + Args: + app (FastAPI): The FastAPI app + + Returns: + FastAPI: The app, possibly changed with CLI-related modifications. + """ + return app + + +def _ini_hex_biz(app: FastAPI) -> FastAPI: + """Helper function for initiating anything to do with buisness logic. + Specifically, adding RESTful endpoints, routers, and versioning + + Args: + app (FastAPI): The application + + Returns: + FastAPI: The app, now with base routes added. + """ + app.include_router( + biz_router, + prefix="/v1", + tags=["v1"], + responses={404: {"description": "Not found"}}, + ) + + @app.get("/") + def read_root(): + return {"Hello": "World"} + + return app + + +def _ini_hex_notification(app: FastAPI) -> FastAPI: + """Helper function for setting up any notifications to the application. + + Args: + app (FastAPI): The application + + Returns: + FastAPI: The app, possibly changed with notification-related modifications. + """ + return app + + +def _ini_hex_persistance(app): + """Helper function for setting up database connections for the application. + + Args: + app (FastAPI): The application + + Returns: + FastAPI: The app, possibly changed with database-related modifications. + """ + return app + + + +def main() -> FastAPI: + """Initializes the app when called from the command line. + + Returns: + FastAPI: The FastAPI app for the api server to serve. + """ + app = _ini_api_app() + _ini_hex_persistance(app) + _ini_hex_notification(app) + _ini_hex_administration(app) + _ini_hex_biz(app) + return app + + +if __name__ == "__main__": + app = main() + uvicorn.run(app, host="0.0.0.0", port=8000) \ No newline at end of file diff --git a/pyproject.toml b/pyproject.toml index da15740..607c30e 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -17,6 +17,25 @@ readme = "README.md" license = {text = "MIT"} +[tool.ruff] +line-length = 88 +lint.select = [ + "F", # pyflakes rules + "E", # pycodestyle error rules + "W", # pycodestyle warning rules + "B", # flake8-bugbear rules + "I", # isort rules +] +lint.ignore = [ + "E501", # line-too-long +] + +[tool.ruff.format] +indent-style = "space" +quote-style = "single" + + + [tool.pdm] distribution = false diff --git a/src/backend/administration/__init__.py b/src/backend/administration/__init__.py new file mode 100644 index 0000000..ba60d66 --- /dev/null +++ b/src/backend/administration/__init__.py @@ -0,0 +1,4 @@ +""" +The backend adhears to a hex architecture. See: https://en.wikipedia.org/wiki/Hexagonal_architecture_(software) +This module contains code for internal tools for the maintainers/administrators to do maintenence. +""" \ No newline at end of file diff --git a/src/backend/biz_logic/__init__.py b/src/backend/biz_logic/__init__.py new file mode 100644 index 0000000..0645fdf --- /dev/null +++ b/src/backend/biz_logic/__init__.py @@ -0,0 +1,21 @@ +""" +The backend adhears to a hex architecture. See: https://en.wikipedia.org/wiki/Hexagonal_architecture_(software) +This module contains code for all buisness logic of the application. +The buisness logic of the application is primarily driven by the "Harvest, Bottle, Mix" archetecture outlined by Sam Schwartz. +See this slide deck for more details: https://docs.google.com/presentation/d/1jE0-VBikgAd-E6XSRTEkt_RxI190uVlsWg11fB6YgXw/edit?usp=sharing +In particular, this module uses FastAPI to create RESTful endpoints. +The "unsightly cables behind the desk" which connect routers to various sections of the code are also included in these +__init__.py files. +""" + +from fastapi import APIRouter + +router = APIRouter( + prefix="", + responses={404: {"description": "Not found"}}, +) + +from ..biz_logic.harvest.endpoint import router as harvest_router + +router.include_router(harvest_router) + diff --git a/src/backend/biz_logic/bottle/__init__.py b/src/backend/biz_logic/bottle/__init__.py new file mode 100644 index 0000000..159b0bc --- /dev/null +++ b/src/backend/biz_logic/bottle/__init__.py @@ -0,0 +1,17 @@ +""" +The Bottle module is for anything related to extracting repository data from various sources, and, +specifically, providing RESTful CRUD (create, read, update, delete) operations relating to this repository data, +providing the interface with our persistance layer. + +These sources could include: + GitHub's multiple APIs (default repo information, contributor networks, SBOMs, etc.) + Google's BigQuery data about a repository + PyPi data about a repository + Data provided by cloning the repository and mining with PyDriller + Custom data provided by a user +and so on. + +Note: Each of these data bottles are based on a proper subset of repositories stored in a harvest. A bottle can contain +data for one repository, or it can contain the data for many repositories. Each bottle will contain the same fields for +all repositories. It will also contain meta information about when the bottling happened. +""" \ No newline at end of file diff --git a/src/backend/biz_logic/harvest/__init__.py b/src/backend/biz_logic/harvest/__init__.py new file mode 100644 index 0000000..f28bbf1 --- /dev/null +++ b/src/backend/biz_logic/harvest/__init__.py @@ -0,0 +1,16 @@ +""" +The Harvest module is for anything related to extracting raw lists of repositories from various sources, and, +specifically, providing RESTful CRUD (create, read, update, delete) operations relating to these lists of repositories, +providing the interface with our persistance layer. + +These sources could include: + GitHub search results + Spack and other build system configuration files + ArXiV + University websites + URLs within scientific papers +and so on. + +Note: Harvesting only refers to repositories (and the metadata about how they were harvested) themselves; harvesting +does not include any data associated with the repository. (That comes in the bottling stage.) +""" \ No newline at end of file diff --git a/src/backend/biz_logic/harvest/endpoint.py b/src/backend/biz_logic/harvest/endpoint.py new file mode 100644 index 0000000..b445964 --- /dev/null +++ b/src/backend/biz_logic/harvest/endpoint.py @@ -0,0 +1,103 @@ +from fastapi import APIRouter, Depends, HTTPException +from sqlalchemy import select +from sqlalchemy.ext.asyncio import AsyncSession +from src.backend.persistence.db_session import DBSession +from src.backend.biz_logic.harvest.harvest_crud_types import CreateHarvest, UpdateHarvest +from datetime import datetime +from random import random + +router = APIRouter( + prefix="/harvest", + tags=["harvest"], + responses={404: {"description": "Not found"}}, +) + +@router.post( + "/" +) +async def create_harvest( + harvest: CreateHarvest, + db_session: AsyncSession = Depends(DBSession.get_db_session) +): + # Do buisness processing, if any + # Add information to database via the db_session which connects with persistance data + # Return the full harvest (now with a database ID and/or timestamps) + + """ + Example: + processed_harvest = process_for_database_insertion(harvest) + db_session.add(processed_harvest) + await db_session.commit() + await db_session.refresh(db_transaction) + return db_transaction + """ + harvest = harvest.model_dump() + harvest["id"] = int(100*random()) + harvest["created_on"] = datetime.now() + harvest["last_update"] = datetime.now() + return harvest + + +@router.get( + "/{harvest_id}", +) +async def get_harvest( + harvest_id: int, session: AsyncSession = Depends(DBSession.get_db_session) +): + # Get data from database based on the passed ID + """ + Example: + stmt = select(DBHarvest).where(DBHarvest.id == harvest_id).distinct() + try: + result = await session.scalars(stmt) + result = result.one() + return result + except Exception as e: + raise HTTPException( + status_code=404, detail=f"Harvest '{harvest_id}' not found.\n {e}" + ) + """ + return f"Harvest {harvest_id} information from the db is returned here." + + +@router.get( + "/{harvest_id}/repos", +) +async def get_harvest_repositories( + harvest_id: int, session: AsyncSession = Depends(DBSession.get_db_session) +): + # Get data from database based on the passed ID + + return ["example/repo1", "otherexample/repo2"] + + +@router.put( + "/{harvest_id}" +) +async def update_harvest( + harvest_id: int, + harvest: UpdateHarvest, + session: AsyncSession = Depends(DBSession.get_db_session), +): + """ + Example: + + stmt = select(DBHarvest).filter(DBHarvest.id == harvest_id) + try: + result = await session.scalars(stmt) + db_harvest = result.one() + except: + raise HTTPException( + status_code=404, detail=f"Transaction {harvest_id} not found" + ) + + for key, value in harvest.model_dump(exclude_unset=True).items(): + setattr(db_harvest, key, value) + + await session.commit() + await session.refresh(db_harvest) + return db_harvest + """ + harvest = harvest.model_dump() + harvest["last_update"] = datetime.now() + return harvest diff --git a/src/backend/biz_logic/harvest/harvest_crud_types.py b/src/backend/biz_logic/harvest/harvest_crud_types.py new file mode 100644 index 0000000..30576e3 --- /dev/null +++ b/src/backend/biz_logic/harvest/harvest_crud_types.py @@ -0,0 +1,16 @@ +from typing import Optional +from pydantic import BaseModel +from datetime import datetime +from .status import Status + +class CreateHarvest(BaseModel): + name: Optional[str] = None + description: Optional[str] = None + status: Optional[Status] = None + initial_repositories: list[str] = None + +class UpdateHarvest(BaseModel): + name: Optional[str] = None + status: Optional[Status] = None + repositories: Optional[list[str]] = None + \ No newline at end of file diff --git a/src/backend/biz_logic/harvest/status.py b/src/backend/biz_logic/harvest/status.py new file mode 100644 index 0000000..0a47511 --- /dev/null +++ b/src/backend/biz_logic/harvest/status.py @@ -0,0 +1,4 @@ +from enum import Enum +class Status(str, Enum): + in_progress = "In Progress" + finished = "Finished" \ No newline at end of file diff --git a/src/backend/biz_logic/mix/__init__.py b/src/backend/biz_logic/mix/__init__.py new file mode 100644 index 0000000..75445b8 --- /dev/null +++ b/src/backend/biz_logic/mix/__init__.py @@ -0,0 +1,12 @@ +""" +The Mix module is for anything related to extracting, joining, transforming, and loading data from one or more +harvests or bottles and, specifically, providing RESTful CRUD (create, read, update, delete) operations relating to this +transfomed data, providing the interface with our persistance layer. + +Included in the mix module will be "cocktail recipies," which correspond to commonly requested datasets / user queries. +Example recipes might include: + +* n bottles of GitHub contributor data from harvest x, corresponding to the last n bottles of data collected +* 1 bottle of GitHub base data from harvest x + 1 bottle of pypi data from harvest x +* The repositories in common between harvest x and harvest y, where x and y are both scrapes of university websites but at two different points in time +""" \ No newline at end of file diff --git a/scripts/README.md b/src/backend/biz_logic/scripts/jring_o/README.md similarity index 100% rename from scripts/README.md rename to src/backend/biz_logic/scripts/jring_o/README.md diff --git a/scripts/ecosyst.ms-api.py b/src/backend/biz_logic/scripts/jring_o/ecosyst.ms-api.py similarity index 100% rename from scripts/ecosyst.ms-api.py rename to src/backend/biz_logic/scripts/jring_o/ecosyst.ms-api.py diff --git a/scripts/import-db-aura b/src/backend/biz_logic/scripts/jring_o/import-db-aura similarity index 100% rename from scripts/import-db-aura rename to src/backend/biz_logic/scripts/jring_o/import-db-aura diff --git a/scripts/import-db-neo4j b/src/backend/biz_logic/scripts/jring_o/import-db-neo4j similarity index 100% rename from scripts/import-db-neo4j rename to src/backend/biz_logic/scripts/jring_o/import-db-neo4j diff --git a/scripts/repo_cite/README.md b/src/backend/biz_logic/scripts/jring_o/repo_cite/README.md similarity index 100% rename from scripts/repo_cite/README.md rename to src/backend/biz_logic/scripts/jring_o/repo_cite/README.md diff --git a/src/backend/biz_logic/scripts/jring_o/repo_cite/repo_cite.py b/src/backend/biz_logic/scripts/jring_o/repo_cite/repo_cite.py new file mode 100644 index 0000000..45c04a0 --- /dev/null +++ b/src/backend/biz_logic/scripts/jring_o/repo_cite/repo_cite.py @@ -0,0 +1,1712 @@ +import requests +import json +import csv +import time +import re +import base64 +import logging +import os +import argparse +import uuid +from dotenv import load_dotenv +from datetime import datetime, timedelta, timezone +from tqdm import tqdm +import urllib.parse + +# Logging Handler to work with tqdm +class TqdmLoggingHandler(logging.Handler): + def __init__(self, level=logging.NOTSET): + super().__init__(level) + + def emit(self, record): + try: + msg = self.format(record) + tqdm.write(msg) + self.flush() + except Exception: + self.handleError(record) + +# Remove all handlers associated with the root logger object. +for handler in logging.root.handlers[:]: + logging.root.removeHandler(handler) + +# Configure logging to use the custom handler +logging.basicConfig( + level=logging.INFO, # Set to DEBUG for more detailed logs + format='%(asctime)s - %(levelname)s - %(message)s', + handlers=[TqdmLoggingHandler()] +) + +# Constants +GITHUB_API_URL = "https://api.github.com" +OPENALEX_API_URL = "https://api.openalex.org" +MAX_RETRIES = 3 +RETRY_DELAY = 2 # seconds + +def github_api_request(url, headers, params=None): + """ + Sends a GET request to the GitHub API with rate limit handling. + """ + for attempt in range(1, MAX_RETRIES + 1): + logging.debug(f"Attempt {attempt} for URL: {url}") + try: + response = requests.get( + url, + headers=headers, + params=params, + timeout=10 + ) + response.raise_for_status() + except requests.exceptions.Timeout: + logging.error(f"Timeout occurred for URL: {url}") + if attempt == MAX_RETRIES: + raise + else: + time.sleep(RETRY_DELAY) + continue + except requests.exceptions.RequestException as e: + logging.error(f"Request exception: {e}") + if attempt == MAX_RETRIES: + raise + else: + time.sleep(RETRY_DELAY) + continue + + logging.debug(f"Response status code: {response.status_code}") + if response.status_code == 200: + logging.debug("Successful response.") + return response.json(), response.headers + elif response.status_code == 403 and 'X-RateLimit-Remaining' in response.headers: + if response.headers['X-RateLimit-Remaining'] == '0': + reset_time = int(response.headers['X-RateLimit-Reset']) + sleep_time = max(reset_time - int(time.time()), 0) + 1 + logging.warning( + f"Rate limit exceeded. Sleeping for {sleep_time} seconds." + ) + time.sleep(sleep_time) + continue + else: + logging.error(f"Error: {response.status_code} - {response.reason}") + if attempt == MAX_RETRIES: + response.raise_for_status() + else: + time.sleep(RETRY_DELAY) + continue + raise Exception( + f"Failed to get a successful response after {MAX_RETRIES} attempts." + ) + +def get_next_link(headers): + """ + Parses the 'Link' header from GitHub API response to find the next page URL. + """ + link_header = headers.get('Link', '') + if not link_header: + return None + links = link_header.split(',') + for link in links: + parts = link.split(';') + if len(parts) < 2: + continue + url_part = parts[0].strip() + rel_part = parts[1].strip() + if rel_part == 'rel="next"': + next_url = url_part.lstrip('<').rstrip('>') + return next_url + return None + +def search_repositories_with_queries(query_terms, headers): + """ + Searches GitHub repositories based on query terms and records matching queries. + """ + repositories = {} + for query_term in query_terms: + params = {'q': query_term, 'per_page': 100} + url = f"{GITHUB_API_URL}/search/repositories" + while url: + logging.debug( + f"Searching repositories with URL: {url} and params: {params}" + ) + try: + data, headers_response = github_api_request(url, headers, params) + except Exception as e: + logging.error(f"Error searching repositories: {e}") + break + if data: + items = data.get('items', []) + logging.info( + f"Found {len(items)} repositories in this page for query '{query_term}'." + ) + for repo in items: + repo_id = repo.get('id') + if repo_id in repositories: + repositories[repo_id]['queries'].add(query_term) + else: + repositories[repo_id] = { + 'repo_data': repo, + 'queries': set([query_term]) + } + next_url = get_next_link(headers_response) + url = next_url + params = None # Parameters are only needed for the initial request + else: + break + return repositories + +def extract_doi_from_repo(owner, repo_name, headers): + """ + Attempts to extract the DOI of the associated paper from the repository. + """ + # Try to get README content + readme_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/readme" + try: + readme_data, _ = github_api_request(readme_url, headers) + if readme_data and 'content' in readme_data: + readme_content = base64.b64decode(readme_data['content']).decode('utf-8', errors='ignore') + doi_match = re.search(r'(10\.\d{4,9}/[-._;()/:A-Z0-9]+)', readme_content, re.I) + if doi_match: + doi = doi_match.group(1) + logging.info(f"DOI found in README: {doi}") + return doi + except Exception as e: + logging.warning(f"Could not retrieve README for {owner}/{repo_name}: {e}") + # List repository contents to find CITATION.cff with any capitalization + contents_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/contents" + try: + contents, _ = github_api_request(contents_url, headers) + if contents and isinstance(contents, list): + for content in contents: + if content['name'].lower() == 'citation.cff': + citation_url = content['url'] + try: + citation_data, _ = github_api_request(citation_url, headers) + if citation_data and 'content' in citation_data: + citation_content = base64.b64decode(citation_data['content']).decode('utf-8', errors='ignore') + doi_match = re.search(r'(10\.\d{4,9}/[-._;()/:A-Z0-9]+)', citation_content, re.I) + if doi_match: + doi = doi_match.group(1) + logging.info(f"DOI found in CITATION.cff: {doi}") + return doi + except Exception as e: + logging.warning(f"Could not retrieve {content['name']} for {owner}/{repo_name}: {e}") + except Exception as e: + logging.warning(f"Could not retrieve contents for {owner}/{repo_name}: {e}") + # Try to get repository description + repo_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}" + try: + repo_data, _ = github_api_request(repo_url, headers) + description = repo_data.get('description', '') + doi_match = re.search(r'(10\.\d{4,9}/[-._;()/:A-Z0-9]+)', description, re.I) + if doi_match: + doi = doi_match.group(1) + logging.info(f"DOI found in repository description: {doi}") + return doi + except Exception as e: + logging.warning(f"Could not retrieve repository data for {owner}/{repo_name}: {e}") + logging.info(f"No DOI found for repository {owner}/{repo_name}") + return None + +def get_paper_details_from_openalex(doi): + """ + Fetches paper details from OpenAlex using the DOI. + """ + doi_formatted = doi.lower() + if not doi_formatted.startswith('10.'): + logging.warning(f"Invalid DOI format: {doi}") + return None + url = f"{OPENALEX_API_URL}/works/doi:{doi_formatted}" + try: + response = requests.get(url) + response.raise_for_status() + paper_data = response.json() + logging.info(f"Paper details retrieved from OpenAlex for DOI: {doi}") + # Extract domains (concepts with level 0) + concepts = paper_data.get('concepts', []) + domains = [concept['display_name'] for concept in concepts if concept.get('level') == 0] + paper_data['domains'] = domains + return paper_data + except requests.RequestException as e: + logging.error(f"Error fetching paper details from OpenAlex: {e}") + return None + +def get_authors_and_institutions(paper_data): + """ + Extracts authors' information including names, ORCID, institutions, author IDs, and initializes other_papers list. + """ + authors_info = [] + authorships = paper_data.get('authorships', []) + for authorship in authorships: + author = authorship.get('author', {}) + institutions = authorship.get('institutions', []) + author_name = author.get('display_name') + orcid = author.get('orcid') + institution_names = [inst.get('display_name') for inst in institutions] + authors_info.append({ + 'author_name': author_name, + 'orcid': orcid, + 'institutions': institution_names, + 'author_id': author.get('id'), + 'other_papers': [] # Initialize an empty list for other papers + }) + return authors_info + +def get_other_papers_by_authors(authors_info, doi): + """ + Retrieves other papers for each author and updates the authors_info list. + """ + for author in authors_info: + author_id = author.get('author_id') + author_name = author.get('author_name') + if author_id: + url = f"{OPENALEX_API_URL}/works" + params = { + 'filter': f'authorships.author.id:{author_id}', + 'per-page': 200, + 'page': 1, + 'cursor': '*' + } + all_papers = [] + while True: + try: + response = requests.get(url, params=params) + response.raise_for_status() + data = response.json() + papers = data.get('results', []) + for paper in papers: + # Exclude the paper being analyzed + paper_doi = paper.get('doi') or '' + if paper_doi.lower() != doi.lower(): + paper_info = { + 'title': paper.get('title'), + 'publication_year': paper.get('publication_year'), + 'doi': paper_doi, + 'concepts': [concept['display_name'] for concept in paper.get('concepts', [])] + } + all_papers.append(paper_info) + if 'next_cursor' in data.get('meta', {}) and data['meta']['next_cursor']: + params['cursor'] = data['meta']['next_cursor'] + else: + break + except requests.RequestException as e: + logging.error(f"Error fetching works for author {author_name}: {e}") + break + author['other_papers'] = all_papers + logging.info(f"Retrieved {len(all_papers)} other papers for author: {author_name}") + return authors_info # Return updated authors_info + +def get_first_degree_citations(paper_data): + """ + Retrieves first-degree citations (papers citing the paper being analyzed). + """ + first_degree_citations = [] + cited_by_count = paper_data.get('cited_by_count', 0) + if cited_by_count > 0: + cited_by_api_url = paper_data.get('cited_by_api_url') + cursor = '*' + while True: + params = {'per-page': 200, 'cursor': cursor} + try: + response = requests.get(cited_by_api_url, params=params) + response.raise_for_status() + data = response.json() + papers = data.get('results', []) + for paper in papers: + paper_info = { + 'title': paper.get('title'), + 'authors': [auth['author']['display_name'] for auth in paper.get('authorships', [])], + 'publication_year': paper.get('publication_year'), + 'doi': paper.get('doi'), + 'concepts': [concept['display_name'] for concept in paper.get('concepts', [])], + 'cited_by_count': paper.get('cited_by_count', 0), + 'cited_by_api_url': paper.get('cited_by_api_url', '') + } + first_degree_citations.append(paper_info) + if data.get('meta', {}).get('next_cursor'): + cursor = data['meta']['next_cursor'] + else: + break + except requests.RequestException as e: + logging.error(f"Error fetching first-degree citations: {e}") + break + else: + logging.info("No first-degree citations found.") + logging.info(f"{len(first_degree_citations)} first-degree citations retrieved.") + return first_degree_citations + +def get_second_degree_citations(first_degree_citations): + """ + Retrieves second-degree citations and maps them to the first-degree papers they cite. + """ + second_degree_citations = [] + for first_degree_paper in first_degree_citations: + first_paper_title = first_degree_paper.get('title') + first_paper_doi = first_degree_paper.get('doi') + cited_by_count = first_degree_paper.get('cited_by_count', 0) + cited_by_api_url = first_degree_paper.get('cited_by_api_url', '') + if cited_by_count > 0 and cited_by_api_url: + cursor = '*' + while True: + params = {'per-page': 200, 'cursor': cursor} + try: + response = requests.get(cited_by_api_url, params=params) + response.raise_for_status() + data = response.json() + papers = data.get('results', []) + for second_paper in papers: + paper_info = { + 'title': second_paper.get('title'), + 'authors': [auth['author']['display_name'] for auth in second_paper.get('authorships', [])], + 'publication_year': second_paper.get('publication_year'), + 'doi': second_paper.get('doi'), + 'concepts': [concept['display_name'] for concept in second_paper.get('concepts', [])], + 'cited_by_count': second_paper.get('cited_by_count', 0), + 'cited_by_api_url': second_paper.get('cited_by_api_url', ''), + 'cites_first_degree_paper': { + 'title': first_paper_title, + 'doi': first_paper_doi + } + } + second_degree_citations.append(paper_info) + if data.get('meta', {}).get('next_cursor'): + cursor = data['meta']['next_cursor'] + else: + break + except requests.RequestException as e: + logging.error(f"Error fetching second-degree citations: {e}") + break + logging.info(f"{len(second_degree_citations)} second-degree citations retrieved.") + return second_degree_citations + +def get_contributors_and_participants(owner, repo_name, headers): + # Get contributors + contributors_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/contributors" + contributors = [] + try: + contributors_data, _ = github_api_request(contributors_url, headers) + contributors.extend(contributors_data) + logging.info(f"{len(contributors)} contributors retrieved.") + except Exception as e: + logging.error(f"Error fetching contributors: {e}") + + # Get participants from issues and comments + issues_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/issues" + participants = set() + page = 1 + per_page = 100 + while True: + params = {'state': 'all', 'per_page': per_page, 'page': page} + try: + issues_data, headers_response = github_api_request(issues_url, headers, params=params) + if not issues_data: + break + for issue in issues_data: + user = issue.get('user', {}) + if user: + participants.add(user.get('login')) + # Get comments for the issue + comments_url = issue.get('comments_url') + comments_page = 1 + while True: + comments_params = {'per_page': per_page, 'page': comments_page} + try: + comments_data, _ = github_api_request(comments_url, headers, params=comments_params) + if not comments_data: + break + for comment in comments_data: + commenter = comment.get('user', {}) + if commenter: + participants.add(commenter.get('login')) + if len(comments_data) < per_page: + break + comments_page += 1 + except Exception as e: + logging.error(f"Error fetching comments: {e}") + break + if 'next' in headers_response.get('Link', ''): + page += 1 + else: + break + except Exception as e: + logging.error(f"Error fetching issues: {e}") + break + logging.info(f"{len(participants)} participants retrieved from issues and comments.") + return contributors, participants + +def get_contributor_details(contributors, headers): + """ + Fetches real names of contributors from their GitHub profiles. + """ + contributor_details = [] + for contributor in contributors: + username = contributor.get('login') + user_url = f"{GITHUB_API_URL}/users/{username}" + try: + user_data, _ = github_api_request(user_url, headers) + real_name = user_data.get('name') + contributor_details.append({ + 'username': username, + 'real_name': real_name + }) + except Exception as e: + logging.error(f"Error fetching user data for {username}: {e}") + contributor_details.append({ + 'username': username, + 'real_name': None + }) + return contributor_details + +def analyze_connections(contributors, participants, authors_info): + """ + Analyze how contributors and participants connect with the authors. + """ + connections = [] + author_names = {author['author_name'].lower() for author in authors_info} + # Analyze contributors + for contributor in contributors: + login = contributor.get('login', '').lower() + name = contributor.get('name', '').lower() if contributor.get('name') else '' + if login in author_names or name in author_names: + connections.append({'username': contributor.get('login'), 'role': 'Contributor', 'connection': 'Author'}) + # Analyze participants + for participant in participants: + participant_lower = participant.lower() + if participant_lower in author_names: + connections.append({'username': participant, 'role': 'Participant', 'connection': 'Author'}) + logging.info(f"{len(connections)} connections found between contributors/participants and authors.") + return connections + +def analyze_contributor_affiliations(contributors, headers): + """ + Analyzes contributors to determine their affiliations. + """ + affiliations = [] + for contributor in contributors: + username = contributor.get('login') + user_url = f"{GITHUB_API_URL}/users/{username}" + try: + user_data, _ = github_api_request(user_url, headers) + company = user_data.get('company') + email = user_data.get('email') + # Use company field as affiliation + affiliation = company.strip() if company else None + # Alternatively, use email domain to infer affiliation + if not affiliation and email: + email_domain = email.split('@')[-1] + affiliation = email_domain + affiliations.append({ + 'username': username, + 'affiliation': affiliation + }) + except Exception as e: + logging.error(f"Error fetching user data for {username}: {e}") + affiliations.append({ + 'username': username, + 'affiliation': None + }) + return affiliations + +def classify_contributor_roles(contributors): + """ + Classifies contributors into roles based on their activity. + """ + # Fetch total number of commits per contributor + contributor_commits = [] + for contributor in contributors: + username = contributor.get('login') + commits = contributor.get('contributions', 0) + contributor_commits.append((username, commits)) + + # Sort contributors by number of commits in descending order + contributor_commits.sort(key=lambda x: x[1], reverse=True) + total_contributors = len(contributor_commits) + roles = {} + for idx, (username, commits) in enumerate(contributor_commits): + percentile = (idx + 1) / total_contributors + if percentile <= 0.10: + role = 'Core Contributor' + elif percentile <= 0.50: + role = 'Occasional Contributor' + else: + role = 'One-time Contributor' + roles[username] = { + 'commits': commits, + 'role': role + } + return roles + +def get_total_issues(owner, repo_name, headers): + """ + Retrieves the total number of issues, open issues, and closed issues. + """ + url = f"{GITHUB_API_URL}/search/issues" + query = f"repo:{owner}/{repo_name} is:issue" + params = {'q': query, 'per_page': 1} + total_issues = open_issues = closed_issues = None + try: + data, _ = github_api_request(url, headers, params) + total_issues = data.get('total_count', 0) + except Exception as e: + logging.error(f"Error fetching total issues for {owner}/{repo_name}: {e}") + # Open issues + params['q'] = query + ' is:open' + try: + data, _ = github_api_request(url, headers, params) + open_issues = data.get('total_count', 0) + except Exception as e: + logging.error(f"Error fetching open issues for {owner}/{repo_name}: {e}") + # Closed issues + params['q'] = query + ' is:closed' + try: + data, _ = github_api_request(url, headers, params) + closed_issues = data.get('total_count', 0) + except Exception as e: + logging.error(f"Error fetching closed issues for {owner}/{repo_name}: {e}") + return total_issues, open_issues, closed_issues + +def get_average_issue_close_time(owner, repo_name, headers): + """ + Calculates the average time to close issues. + """ + issues_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/issues" + params = {'state': 'closed', 'per_page': 100, 'page': 1} + total_time = 0 + issue_count = 0 + while True: + try: + issues_data, headers_response = github_api_request(issues_url, headers, params=params) + if not issues_data: + break + for issue in issues_data: + if 'pull_request' in issue: + continue # Skip pull requests + created_at = issue.get('created_at') + closed_at = issue.get('closed_at') + if created_at and closed_at: + created_time = datetime.strptime(created_at, '%Y-%m-%dT%H:%M:%SZ') + closed_time = datetime.strptime(closed_at, '%Y-%m-%dT%H:%M:%SZ') + time_to_close = (closed_time - created_time).total_seconds() + total_time += time_to_close + issue_count += 1 + if 'next' in headers_response.get('Link', ''): + params['page'] += 1 + else: + break + except Exception as e: + logging.error(f"Error fetching closed issues for average close time: {e}") + break + if issue_count > 0: + average_time = total_time / issue_count + average_time_days = average_time / (60 * 60 * 24) # Convert seconds to days + else: + average_time_days = None + return average_time_days + +def get_issue_update_frequency(owner, repo_name, headers, days=30): + """ + Calculates the number of issues updated in the last 'days' days. + """ + since_date = (datetime.utcnow() - timedelta(days=days)).isoformat() + 'Z' + issues_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/issues" + params = {'since': since_date, 'per_page': 100, 'page': 1, 'state': 'all'} + update_count = 0 + while True: + try: + issues_data, headers_response = github_api_request(issues_url, headers, params=params) + if not issues_data: + break + update_count += len(issues_data) + if 'next' in headers_response.get('Link', ''): + params['page'] += 1 + else: + break + except Exception as e: + logging.error(f"Error fetching issue updates: {e}") + break + return update_count + +def get_total_prs(owner, repo_name, headers): + """ + Retrieves the total number of PRs, open PRs, and closed PRs. + """ + url = f"{GITHUB_API_URL}/search/issues" + query = f"repo:{owner}/{repo_name} is:pr" + params = {'q': query, 'per_page': 1} + total_prs = open_prs = closed_prs = None + try: + data, _ = github_api_request(url, headers, params) + total_prs = data.get('total_count', 0) + except Exception as e: + logging.error(f"Error fetching total PRs for {owner}/{repo_name}: {e}") + # Open PRs + params['q'] = query + ' is:open' + try: + data, _ = github_api_request(url, headers, params) + open_prs = data.get('total_count', 0) + except Exception as e: + logging.error(f"Error fetching open PRs for {owner}/{repo_name}: {e}") + # Closed PRs + params['q'] = query + ' is:closed' + try: + data, _ = github_api_request(url, headers, params) + closed_prs = data.get('total_count', 0) + except Exception as e: + logging.error(f"Error fetching closed PRs for {owner}/{repo_name}: {e}") + return total_prs, open_prs, closed_prs + +def get_average_pr_merge_time(owner, repo_name, headers): + """ + Calculates the average time to merge pull requests. + """ + prs_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/pulls" + params = {'state': 'closed', 'per_page': 100, 'page': 1} + total_time = 0 + pr_count = 0 + while True: + try: + prs_data, headers_response = github_api_request(prs_url, headers, params=params) + if not prs_data: + break + for pr in prs_data: + if pr.get('merged_at'): + created_at = pr.get('created_at') + merged_at = pr.get('merged_at') + if created_at and merged_at: + created_time = datetime.strptime(created_at, '%Y-%m-%dT%H:%M:%SZ') + merged_time = datetime.strptime(merged_at, '%Y-%m-%dT%H:%M:%SZ') + time_to_merge = (merged_time - created_time).total_seconds() + total_time += time_to_merge + pr_count += 1 + if 'next' in headers_response.get('Link', ''): + params['page'] += 1 + else: + break + except Exception as e: + logging.error(f"Error fetching closed PRs for average merge time: {e}") + break + if pr_count > 0: + average_time = total_time / pr_count + average_time_days = average_time / (60 * 60 * 24) # Convert seconds to days + else: + average_time_days = None + return average_time_days + +def get_pr_update_frequency(owner, repo_name, headers, days=30): + """ + Calculates the number of PRs updated in the last 'days' days. + """ + since_date = (datetime.utcnow() - timedelta(days=days)).isoformat() + 'Z' + prs_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/pulls" + params = {'since': since_date, 'per_page': 100, 'page': 1, 'state': 'all'} + update_count = 0 + while True: + try: + prs_data, headers_response = github_api_request(prs_url, headers, params=params) + if not prs_data: + break + update_count += len(prs_data) + if 'next' in headers_response.get('Link', ''): + params['page'] += 1 + else: + break + except Exception as e: + logging.error(f"Error fetching PR updates: {e}") + break + return update_count + +def get_total_downloads(owner, repo_name, headers): + """ + Retrieves the total number of downloads from releases. + """ + releases_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/releases" + params = {'per_page': 100, 'page': 1} + total_downloads = 0 + recent_downloads = 0 + recent_releases_count = 0 + while True: + try: + releases_data, headers_response = github_api_request(releases_url, headers, params=params) + if not releases_data: + break + for release in releases_data: + assets = release.get('assets', []) + for asset in assets: + download_count = asset.get('download_count', 0) + total_downloads += download_count + # Check if the release is recent (last 30 days) + published_at = release.get('published_at') + if published_at: + published_time = datetime.strptime(published_at, '%Y-%m-%dT%H:%M:%SZ') + if published_time >= datetime.utcnow() - timedelta(days=30): + recent_downloads += download_count + recent_releases_count += 1 + if 'next' in headers_response.get('Link', ''): + params['page'] += 1 + else: + break + except Exception as e: + logging.error(f"Error fetching releases: {e}") + break + return total_downloads, recent_downloads, recent_releases_count + +def get_discussion_activity_count(owner, repo_name, headers, days=30): + """ + Retrieves the number of discussions and comments in the last 'days' days. + """ + discussions_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/discussions" + params = {'per_page': 100, 'page': 1} + activity_count = 0 + while True: + try: + discussions_data, headers_response = github_api_request(discussions_url, headers, params=params) + if not discussions_data: + break + for discussion in discussions_data: + created_at = discussion.get('created_at') + if created_at: + created_time = datetime.strptime(created_at, '%Y-%m-%dT%H:%M:%SZ') + if created_time >= datetime.utcnow() - timedelta(days=days): + activity_count += 1 + # Get comments + comments_url = discussion.get('comments_url') + comments_params = {'per_page': 100, 'page': 1} + while True: + comments_data, _ = github_api_request(comments_url, headers, params=comments_params) + if not comments_data: + break + for comment in comments_data: + comment_created_at = comment.get('created_at') + if comment_created_at: + comment_time = datetime.strptime(comment_created_at, '%Y-%m-%dT%H:%M:%SZ') + if comment_time >= datetime.utcnow() - timedelta(days=days): + activity_count += 1 + if 'next' in headers_response.get('Link', ''): + comments_params['page'] += 1 + else: + break + if 'next' in headers_response.get('Link', ''): + params['page'] += 1 + else: + break + except Exception as e: + logging.error(f"Error fetching discussions: {e}") + break + return activity_count + +def get_stars_forks_growth(owner, repo_name, headers): + """ + Estimates stars and forks growth over the repository's lifetime. + """ + repo_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}" + try: + repo_data, _ = github_api_request(repo_url, headers) + created_at = repo_data.get('created_at') + stars_count = repo_data.get('stargazers_count', 0) + forks_count = repo_data.get('forks_count', 0) + if created_at: + created_time = datetime.strptime(created_at, '%Y-%m-%dT%H:%M:%SZ') + days_since_creation = (datetime.utcnow() - created_time).days + if days_since_creation > 0: + stars_growth = stars_count / days_since_creation + forks_growth = forks_count / days_since_creation + else: + stars_growth = stars_count + forks_growth = forks_count + else: + stars_growth = forks_growth = None + except Exception as e: + logging.error(f"Error fetching repository data for growth calculation: {e}") + stars_growth = forks_growth = None + return stars_growth, forks_growth + +def calculate_activity_score(repo_data): + """ + Calculates an activity score based on various metrics. + """ + score = 0 + score += repo_data.get('recent_commits_count', 0) * 1 + score += repo_data.get('recent_issues_opened_count', 0) * 0.5 + score += repo_data.get('recent_issues_closed_count', 0) * 0.5 + score += repo_data.get('recent_prs_opened_count', 0) * 1 + score += repo_data.get('recent_prs_merged_count', 0) * 1 + score += repo_data.get('discussion_activity_count', 0) * 0.5 + return score + +def get_active_contributors_count(owner, repo_name, headers, days=30): + """ + Returns the number of unique contributors who have made commits in the last 'days' days. + """ + since_date = (datetime.utcnow() - timedelta(days=days)).isoformat() + 'Z' + commits_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/commits" + params = {'since': since_date, 'per_page': 100, 'page': 1} + contributors = set() + while True: + try: + commits_data, headers_response = github_api_request(commits_url, headers, params=params) + if not commits_data: + break + for commit in commits_data: + author = commit.get('author') + if author and author.get('login'): + contributors.add(author['login']) + if 'next' in headers_response.get('Link', ''): + params['page'] += 1 + else: + break + except Exception as e: + logging.error(f"Error fetching commits for active contributors: {e}") + break + return len(contributors) + +def get_recent_commits_count(owner, repo_name, headers, days=30): + """ + Returns the number of commits made in the last 'days' days. + """ + since_date = (datetime.utcnow() - timedelta(days=days)).isoformat() + 'Z' + commits_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/commits" + params = {'since': since_date, 'per_page': 100, 'page': 1} + commit_count = 0 + while True: + try: + commits_data, headers_response = github_api_request(commits_url, headers, params=params) + if not commits_data: + break + commit_count += len(commits_data) + if 'next' in headers_response.get('Link', ''): + params['page'] += 1 + else: + break + except Exception as e: + logging.error(f"Error fetching recent commits: {e}") + break + return commit_count + +def get_recent_issues_counts(owner, repo_name, headers, days=30): + """ + Returns the number of issues opened and closed in the last 'days' days. + """ + since_date = datetime.utcnow() - timedelta(days=days) + issues_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/issues" + params = {'state': 'all', 'per_page': 100, 'page': 1} + opened_count = 0 + closed_count = 0 + while True: + try: + issues_data, headers_response = github_api_request(issues_url, headers, params=params) + if not issues_data: + break + for issue in issues_data: + if 'pull_request' in issue: + continue # Skip pull requests + created_at_str = issue.get('created_at') + if created_at_str: + created_at = datetime.strptime(created_at_str, '%Y-%m-%dT%H:%M:%SZ') + if created_at >= since_date: + opened_count += 1 + if issue.get('state') == 'closed': + closed_at_str = issue.get('closed_at') + if closed_at_str: + closed_at = datetime.strptime(closed_at_str, '%Y-%m-%dT%H:%M:%SZ') + if closed_at >= since_date: + closed_count += 1 + if 'next' in headers_response.get('Link', ''): + params['page'] += 1 + else: + break + except Exception as e: + logging.error(f"Error fetching recent issues: {e}") + break + return opened_count, closed_count + +def get_recent_prs_counts(owner, repo_name, headers, days=30): + """ + Returns the number of PRs opened and merged in the last 'days' days. + """ + since_date = datetime.utcnow() - timedelta(days=days) + prs_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/pulls" + params = {'state': 'all', 'per_page': 100, 'page': 1} + opened_count = 0 + merged_count = 0 + while True: + try: + prs_data, headers_response = github_api_request(prs_url, headers, params=params) + if not prs_data: + break + for pr in prs_data: + created_at_str = pr.get('created_at') + if created_at_str: + created_at = datetime.strptime(created_at_str, '%Y-%m-%dT%H:%M:%SZ') + if created_at >= since_date: + opened_count += 1 + merged_at_str = pr.get('merged_at') + if merged_at_str: + merged_at = datetime.strptime(merged_at_str, '%Y-%m-%dT%H:%M:%SZ') + if merged_at >= since_date: + merged_count += 1 + if 'next' in headers_response.get('Link', ''): + params['page'] += 1 + else: + break + except Exception as e: + logging.error(f"Error fetching recent PRs: {e}") + break + return opened_count, merged_count + +def get_active_contributors_count(owner, repo_name, headers, days=30): + """ + Returns the number of unique contributors who have made commits in the last 'days' days. + """ + since_date = (datetime.utcnow() - timedelta(days=days)).isoformat() + 'Z' + commits_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/commits" + params = {'since': since_date, 'per_page': 100, 'page': 1} + contributors = set() + while True: + try: + commits_data, headers_response = github_api_request(commits_url, headers, params=params) + if not commits_data: + break + for commit in commits_data: + author = commit.get('author') + if author and author.get('login'): + contributors.add(author['login']) + if 'next' in headers_response.get('Link', ''): + params['page'] += 1 + else: + break + except Exception as e: + logging.error(f"Error fetching commits for active contributors: {e}") + break + return len(contributors) + +def get_recent_commits_count(owner, repo_name, headers, days=30): + """ + Returns the number of commits made in the last 'days' days. + """ + since_date = (datetime.utcnow() - timedelta(days=days)).isoformat() + 'Z' + commits_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/commits" + params = {'since': since_date, 'per_page': 100, 'page': 1} + commit_count = 0 + while True: + try: + commits_data, headers_response = github_api_request(commits_url, headers, params=params) + if not commits_data: + break + commit_count += len(commits_data) + if 'next' in headers_response.get('Link', ''): + params['page'] += 1 + else: + break + except Exception as e: + logging.error(f"Error fetching recent commits: {e}") + break + return commit_count + +def get_recent_issues_counts(owner, repo_name, headers, days=30): + """ + Returns the number of issues opened and closed in the last 'days' days. + """ + since_date = datetime.utcnow() - timedelta(days=days) + issues_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/issues" + params = {'state': 'all', 'per_page': 100, 'page': 1} + opened_count = 0 + closed_count = 0 + while True: + try: + issues_data, headers_response = github_api_request(issues_url, headers, params=params) + if not issues_data: + break + for issue in issues_data: + if 'pull_request' in issue: + continue # Skip pull requests + created_at_str = issue.get('created_at') + if created_at_str: + created_at = datetime.strptime(created_at_str, '%Y-%m-%dT%H:%M:%SZ') + if created_at >= since_date: + opened_count += 1 + if issue.get('state') == 'closed': + closed_at_str = issue.get('closed_at') + if closed_at_str: + closed_at = datetime.strptime(closed_at_str, '%Y-%m-%dT%H:%M:%SZ') + if closed_at >= since_date: + closed_count += 1 + if 'next' in headers_response.get('Link', ''): + params['page'] += 1 + else: + break + except Exception as e: + logging.error(f"Error fetching recent issues: {e}") + break + return opened_count, closed_count + +def get_recent_prs_counts(owner, repo_name, headers, days=30): + """ + Returns the number of PRs opened and merged in the last 'days' days. + """ + since_date = datetime.utcnow() - timedelta(days=days) + prs_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/pulls" + params = {'state': 'all', 'per_page': 100, 'page': 1} + opened_count = 0 + merged_count = 0 + while True: + try: + prs_data, headers_response = github_api_request(prs_url, headers, params=params) + if not prs_data: + break + for pr in prs_data: + created_at_str = pr.get('created_at') + if created_at_str: + created_at = datetime.strptime(created_at_str, '%Y-%m-%dT%H:%M:%SZ') + if created_at >= since_date: + opened_count += 1 + merged_at_str = pr.get('merged_at') + if merged_at_str: + merged_at = datetime.strptime(merged_at_str, '%Y-%m-%dT%H:%M:%SZ') + if merged_at >= since_date: + merged_count += 1 + if 'next' in headers_response.get('Link', ''): + params['page'] += 1 + else: + break + except Exception as e: + logging.error(f"Error fetching recent PRs: {e}") + break + return opened_count, merged_count + +def analyze_repository(repo_info, idx, headers, people, papers, projects, institutions): + repo = repo_info['repo_data'] + queries = repo_info['queries'] + repo_full_name = repo.get('full_name') + owner = repo.get('owner', {}).get('login') + repo_name = repo.get('name') + description = repo.get('description') or '' + topics = repo.get('topics', []) + readme_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/readme" + logging.info(f"Analyzing repository [{idx}]: {repo_full_name}") + + # Fetch README content + try: + readme_data, _ = github_api_request(readme_url, headers) + except Exception as e: + logging.warning(f"Could not retrieve README for {repo_full_name}: {e}") + readme_data = None + readme_content = '' + if readme_data and readme_data.get('content'): + readme_content = base64.b64decode( + readme_data.get('content') + ).decode('utf-8', errors='ignore') + + # Get list of files in the repository + contents_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/contents" + try: + contents, _ = github_api_request(contents_url, headers) + except Exception as e: + logging.warning(f"Could not retrieve contents for {repo_full_name}: {e}") + contents = None + files = [] + if contents and isinstance(contents, list): + for content in contents: + files.append(content.get('name', '')) + + # Get license + license_info = repo.get('license') or {} + license_name = license_info.get('name', 'No license') + + # Fetch stars, forks, watchers + stars_count = repo.get('stargazers_count', 0) + forks_count = repo.get('forks_count', 0) + watchers_count = repo.get('watchers_count', 0) + open_issues_count = repo.get('open_issues_count', 0) + + # Get Languages + languages_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/languages" + try: + languages_data, _ = github_api_request(languages_url, headers) + except Exception as e: + logging.warning(f"Could not retrieve languages for {repo_full_name}: {e}") + languages_data = None + if languages_data: + total_bytes = sum(languages_data.values()) + languages_percentages = { + language: (bytes_count / total_bytes * 100) + for language, bytes_count in languages_data.items() + } + sorted_languages = sorted( + languages_percentages.items(), + key=lambda item: item[1], + reverse=True + ) + main_language = sorted_languages[0][0] if sorted_languages else 'Unknown' + else: + languages_data = {} + languages_percentages = {} + main_language = 'Unknown' + + # Extract DOI + doi = extract_doi_from_repo(owner, repo_name, headers) + if doi: + doi = doi.lower() + paper_data = get_paper_details_from_openalex(doi) + if paper_data: + # Collect required information + paper_title = paper_data.get('title') + paper_domains = paper_data.get('domains', []) + authors_info = get_authors_and_institutions(paper_data) + # Update authors_info with other papers + authors_info = get_other_papers_by_authors(authors_info, doi) + # Get first-degree citations + first_degree_citations = get_first_degree_citations(paper_data) + total_first_degree_citations = len(first_degree_citations) + # Get second-degree citations + second_degree_citations = get_second_degree_citations(first_degree_citations) + total_second_degree_citations = len(second_degree_citations) + # Get contributors and participants + contributors, participants = get_contributors_and_participants(owner, repo_name, headers) + contributors_count = len(contributors) + # Get contributors' real names + contributors_details = get_contributor_details(contributors, headers) + # Analyze connections + connections = analyze_connections(contributors, participants, authors_info) + # Analyze contributor affiliations + contributor_affiliations = analyze_contributor_affiliations(contributors, headers) + # Classify contributor roles + contributor_roles = classify_contributor_roles(contributors) + # Compile data + paper_analysis = { + 'doi': doi, + 'paper_title': paper_title, + 'paper_domains': paper_domains, + 'authors_info': authors_info, + 'first_degree_citations': first_degree_citations, + 'second_degree_citations': second_degree_citations, + 'total_first_degree_citations': total_first_degree_citations, + 'total_second_degree_citations': total_second_degree_citations, + 'connections': connections, + 'contributor_affiliations': contributor_affiliations, + 'contributor_roles': contributor_roles, + 'contributors_details': contributors_details + } + else: + logging.warning(f"No paper data found for DOI {doi}") + paper_analysis = {} + contributors_count = 0 + else: + logging.warning(f"No DOI found for repository {repo_full_name}") + paper_analysis = {} + # Get contributors even if no DOI is found + contributors, participants = get_contributors_and_participants(owner, repo_name, headers) + contributors_count = len(contributors) + # Get contributors' real names + contributors_details = get_contributor_details(contributors, headers) + # Analyze contributor affiliations + contributor_affiliations = analyze_contributor_affiliations(contributors, headers) + # Classify contributor roles + contributor_roles = classify_contributor_roles(contributors) + paper_analysis['contributor_affiliations'] = contributor_affiliations + paper_analysis['contributor_roles'] = contributor_roles + paper_analysis['contributors_details'] = contributors_details + + # Get last commit date + commits_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/commits" + try: + commits_data, _ = github_api_request(commits_url, headers) + if commits_data: + last_commit_date = commits_data[0]['commit']['committer']['date'] + else: + last_commit_date = 'No commits found' + except Exception as e: + logging.warning(f"Could not retrieve commits for {repo_full_name}: {e}") + last_commit_date = 'Error retrieving commits' + + # Check for documentation files + has_readme = bool(readme_data) + code_of_conduct_url = f"{GITHUB_API_URL}/repos/{owner}/{repo_name}/community/code_of_conduct" + try: + code_of_conduct, _ = github_api_request(code_of_conduct_url, headers) + except Exception as e: + logging.warning(f"Could not retrieve code of conduct for {repo_full_name}: {e}") + code_of_conduct = None + has_code_of_conduct = code_of_conduct is not None and 'url' in code_of_conduct + files_to_check = ['CITATION.cff', 'CONTRIBUTING.md', 'GOVERNANCE.md', 'FUNDING.yml', 'funding.json'] + documentation = {file: False for file in files_to_check} + if contents and isinstance(contents, list): + for content in contents: + if content['name'] in documentation: + documentation[content['name']] = True + + # Get issue counts + total_issues, open_issues, closed_issues = get_total_issues(owner, repo_name, headers) + # Get average time to close issues + average_issue_close_time = get_average_issue_close_time(owner, repo_name, headers) + # Get issue update frequency + issue_update_frequency = get_issue_update_frequency(owner, repo_name, headers) + # Get PR counts + total_prs, open_prs, closed_prs = get_total_prs(owner, repo_name, headers) + # Get average time to merge PRs + average_pr_merge_time = get_average_pr_merge_time(owner, repo_name, headers) + # Get PR update frequency + pr_update_frequency = get_pr_update_frequency(owner, repo_name, headers) + # Get recent commits count (last 30 days) + recent_commits_count = get_recent_commits_count(owner, repo_name, headers) + # Get active contributors count (last 30 days) + active_contributors_count = get_active_contributors_count(owner, repo_name, headers) + # Get recent issues opened and closed counts (last 30 days) + recent_issues_opened_count, recent_issues_closed_count = get_recent_issues_counts(owner, repo_name, headers) + # Get recent PRs opened and merged counts (last 30 days) + recent_prs_opened_count, recent_prs_merged_count = get_recent_prs_counts(owner, repo_name, headers) + # Get total downloads and recent downloads + total_downloads, total_downloads_recent, recent_releases_count = get_total_downloads(owner, repo_name, headers) + # Get discussion activity count + discussion_activity_count = get_discussion_activity_count(owner, repo_name, headers) + # Get stars and forks growth + stars_growth, forks_growth = get_stars_forks_growth(owner, repo_name, headers) + # Calculate activity score + activity_score = calculate_activity_score({ + 'recent_commits_count': recent_commits_count, + 'recent_issues_opened_count': recent_issues_opened_count, + 'recent_issues_closed_count': recent_issues_closed_count, + 'recent_prs_opened_count': recent_prs_opened_count, + 'recent_prs_merged_count': recent_prs_merged_count, + 'discussion_activity_count': discussion_activity_count + }) + + # Create Unique IDs + project_id = f"project_{uuid.uuid4()}" + projects[project_id] = { + "id": project_id, + "full_name": repo_full_name, + "description": description, + "license": license_name, + "last_commit_date": last_commit_date, + "has_readme": has_readme, + "has_code_of_conduct": has_code_of_conduct, + "documentation": documentation, + "main_language": main_language, + "languages_percentages": languages_percentages, + "stars_count": stars_count, + "forks_count": forks_count, + "watchers_count": watchers_count, + "open_issues_count": open_issues_count, + "contributors": [], + "associated_papers": [], + "activity_metrics": { + "total_issues": total_issues, + "open_issues": open_issues, + "closed_issues": closed_issues, + "average_issue_close_time_days": average_issue_close_time, + "issue_update_frequency": issue_update_frequency, + "total_prs": total_prs, + "open_prs": open_prs, + "closed_prs": closed_prs, + "average_pr_merge_time_days": average_pr_merge_time, + "pr_update_frequency": pr_update_frequency, + "total_downloads": total_downloads, + "activity_score": activity_score, + "recent_commits_count": recent_commits_count, + "active_contributors_count": active_contributors_count, + "recent_issues_opened_count": recent_issues_opened_count, + "recent_issues_closed_count": recent_issues_closed_count, + "recent_prs_opened_count": recent_prs_opened_count, + "recent_prs_merged_count": recent_prs_merged_count, + "stars_growth": stars_growth, + "forks_growth": forks_growth, + "recent_releases_count": recent_releases_count, + "total_downloads_recent": total_downloads_recent, + "discussion_activity_count": discussion_activity_count + }, + "queries": list(queries), + "associated_papers": [] + } + + # Handle Paper Analysis + if doi and paper_analysis: + paper_id = f"paper_{uuid.uuid4()}" + papers[paper_id] = { + "id": paper_id, + "doi": paper_analysis.get('doi'), + "title": paper_analysis.get('paper_title'), + "domains": paper_analysis.get('paper_domains', []), + "authors": [], + "cites_papers": [], + "cited_by_papers": [], + "associated_projects": [project_id], + "concepts": [] # To be filled + } + projects[project_id]["associated_papers"].append(paper_id) + + # Process Authors + for author in paper_analysis.get('authors_info', []): + # Generate unique person ID + if author.get('author_id'): + person_id = f"person_{uuid.uuid4()}" + else: + person_id = f"person_{uuid.uuid4()}" + if person_id not in people: + people[person_id] = { + "id": person_id, + "name": author['author_name'], + "orcid": author.get('orcid'), + "github_username": None, # If available, otherwise None + "affiliations": [], + "authored_papers": [], + "contributed_projects": [], + "other_papers": [] + } + # Link paper to author + people[person_id]["authored_papers"].append(paper_id) + papers[paper_id]["authors"].append(person_id) + + # Handle affiliations + for institution_name in author['institutions']: + # Check if institution exists + existing_institution = next((inst for inst in institutions.values() if inst["name"] == institution_name), None) + if existing_institution: + institution_id = existing_institution["id"] + institutions[institution_id]["affiliated_people"].append(person_id) + else: + institution_id = f"institution_{uuid.uuid4()}" + institutions[institution_id] = { + "id": institution_id, + "name": institution_name, + "location": "", # Add if available + "affiliated_people": [person_id] + } + # Link institution to person + people[person_id]["affiliations"].append(institution_id) + + # Handle Cited Papers + for citation in paper_analysis.get('first_degree_citations', []): + cited_paper_id = f"paper_{uuid.uuid4()}" + papers[cited_paper_id] = { + "id": cited_paper_id, + "doi": citation.get('doi'), + "title": citation.get('title'), + "domains": citation.get('concepts', []), + "authors": [], # To be populated if available + "cites_papers": [], + "cited_by_papers": [paper_id], + "associated_projects": [], + "concepts": citation.get('concepts', []) + } + papers[paper_id]["cited_by_papers"].append(cited_paper_id) + # Optionally, handle authors of cited papers similarly + # ... + + # Handle Second-degree Citations + for second_citation in paper_analysis.get('second_degree_citations', []): + second_paper_id = f"paper_{uuid.uuid4()}" + papers[second_paper_id] = { + "id": second_paper_id, + "doi": second_citation.get('doi'), + "title": second_citation.get('title'), + "domains": second_citation.get('concepts', []), + "authors": [], # To be populated if available + "cites_papers": [paper_id], + "cited_by_papers": [], + "associated_projects": [], + "concepts": second_citation.get('concepts', []) + } + # Optionally, link back to the first-degree paper + # ... + + # Handle Connections (optional, depending on analysis needs) + # ... + + # Handle Contributors (even if no DOI is found) + if 'contributor_affiliations' in paper_analysis: + for contributor in paper_analysis['contributor_affiliations']: + username = contributor['username'] + affiliation = contributor['affiliation'] + # Check if person already exists + existing_person = next((p for p in people.values() if p["github_username"] == username), None) + if existing_person: + person_id = existing_person["id"] + if affiliation: + # Check if institution exists + existing_institution = next((inst for inst in institutions.values() if inst["name"] == affiliation), None) + if existing_institution: + institution_id = existing_institution["id"] + institutions[institution_id]["affiliated_people"].append(person_id) + else: + institution_id = f"institution_{uuid.uuid4()}" + institutions[institution_id] = { + "id": institution_id, + "name": affiliation, + "location": "", # Add if available + "affiliated_people": [person_id] + } + # Link institution to person + people[person_id]["affiliations"].append(institution_id) + else: + # Create new person entry + person_id = f"person_{uuid.uuid4()}" + people[person_id] = { + "id": person_id, + "name": contributor.get('real_name') or username, + "orcid": None, + "github_username": username, + "affiliations": [], + "authored_papers": [], + "contributed_projects": [project_id], + "other_papers": [] + } + projects[project_id]["contributors"].append(person_id) + if affiliation: + # Check if institution exists + existing_institution = next((inst for inst in institutions.values() if inst["name"] == affiliation), None) + if existing_institution: + institution_id = existing_institution["id"] + institutions[institution_id]["affiliated_people"].append(person_id) + else: + institution_id = f"institution_{uuid.uuid4()}" + institutions[institution_id] = { + "id": institution_id, + "name": affiliation, + "location": "", # Add if available + "affiliated_people": [person_id] + } + # Link institution to person + people[person_id]["affiliations"].append(institution_id) + + # Compile data + repo_data = { + 'repo_number': idx, + 'full_name': repo_full_name, + 'description': description, + 'license': license_name, + 'last_commit_date': last_commit_date, + 'has_readme': has_readme, + 'has_code_of_conduct': has_code_of_conduct, + 'documentation': documentation, + 'main_language': main_language, + 'languages_percentages': languages_percentages, + 'stars_count': stars_count, + 'forks_count': forks_count, + 'watchers_count': watchers_count, + 'open_issues_count': open_issues_count, + 'contributors_count': contributors_count, + 'activity_metrics': { + "total_issues": total_issues, + "open_issues": open_issues, + "closed_issues": closed_issues, + "average_issue_close_time_days": average_issue_close_time, + "issue_update_frequency": issue_update_frequency, + "total_prs": total_prs, + "open_prs": open_prs, + "closed_prs": closed_prs, + "average_pr_merge_time_days": average_pr_merge_time, + "pr_update_frequency": pr_update_frequency, + "total_downloads": total_downloads, + "activity_score": activity_score, + "recent_commits_count": recent_commits_count, + "active_contributors_count": active_contributors_count, + "recent_issues_opened_count": recent_issues_opened_count, + "recent_issues_closed_count": recent_issues_closed_count, + "recent_prs_opened_count": recent_prs_opened_count, + "recent_prs_merged_count": recent_prs_merged_count, + "stars_growth": stars_growth, + "forks_growth": forks_growth, + "recent_releases_count": recent_releases_count, + "total_downloads_recent": total_downloads_recent, + "discussion_activity_count": discussion_activity_count + }, + 'associated_papers': [], + 'queries': list(queries) + } + + # If there's associated paper, link it to the project + if doi and paper_analysis: + paper_id = [pid for pid, pdata in papers.items() if pdata.get('doi') == doi] + if paper_id: + repo_data['associated_papers'].extend(paper_id) + + projects[project_id]['activity_metrics'] = repo_data['activity_metrics'] + projects[project_id]['contributors'] = [person['id'] for person in people.values() if project_id in person['contributed_projects']] + + logging.info(f"Repository analyzed: {repo_full_name}") + return + +def write_to_json(people, papers, projects, institutions, output_filename_json): + """ + Writes the collected entities to a structured JSON file. + """ + data = { + "people": list(people.values()), + "papers": list(papers.values()), + "projects": list(projects.values()), + "institutions": list(institutions.values()) + } + with open(output_filename_json, 'w', encoding='utf-8') as f: + json.dump(data, f, ensure_ascii=False, indent=4) + logging.info(f"JSON data written to {output_filename_json}") + +def write_entity_csv(entity_list, headers, filename): + """ + Writes a list of entities to a CSV file. + """ + with open(filename, 'w', newline='', encoding='utf-8') as csvfile: + writer = csv.DictWriter(csvfile, fieldnames=headers) + writer.writeheader() + for entity in entity_list: + # Convert lists to semicolon-separated strings for CSV compatibility + for key, value in entity.items(): + if isinstance(value, list): + entity[key] = '; '.join(value) + elif isinstance(value, dict): + entity[key] = json.dumps(value) + writer.writerow(entity) + logging.info(f"CSV data written to {filename}") + +def convert_sets_to_lists(obj): + """ + Recursively converts sets to lists in a data structure. + """ + if isinstance(obj, dict): + return {k: convert_sets_to_lists(v) for k, v in obj.items()} + elif isinstance(obj, list): + return [convert_sets_to_lists(element) for element in obj] + elif isinstance(obj, set): + return list(obj) + else: + return obj + +def parse_repository_input(repo_input): + """ + Parses repository input and extracts owner and repo_name. + Supports both 'owner/repo' format and full GitHub URLs. + """ + repo_input = repo_input.strip() + if repo_input.startswith('http://') or repo_input.startswith('https://'): + # Parse URL + parsed_url = urllib.parse.urlparse(repo_input) + path_parts = parsed_url.path.strip('/').split('/') + if len(path_parts) >= 2: + owner = path_parts[0] + repo_name = path_parts[1] + return owner, repo_name + else: + raise ValueError(f"Invalid GitHub URL format: {repo_input}") + else: + # Assume 'owner/repo' format + if '/' not in repo_input: + raise ValueError(f"Invalid repository format: {repo_input}") + owner, repo_name = repo_input.split('/', 1) + return owner, repo_name + +# Main script +if __name__ == "__main__": + start_time = time.time() + + # Parse command-line arguments + parser = argparse.ArgumentParser(description='Repository Analysis Script') + parser.add_argument('--repos', nargs='+', help='List of repositories in the format owner/repo or full GitHub URLs') + parser.add_argument('--limit', '-l', type=int, help='Limit processing to the first N repositories') + args = parser.parse_args() + + # GitHub authentication + load_dotenv() + github_token = os.getenv('GITHUB_TOKEN') + if not github_token: + logging.error("GITHUB_TOKEN not found in .env file. Please create a .env file with your GitHub token.") + exit(1) + headers = { + 'Authorization': f'token {github_token}', + 'Accept': 'application/vnd.github.v3+json' + } + + # User input if repos are not provided via arguments + if not args.repos: + repositories_input = [] + while True: + repo_input = input("Enter a repository in the format owner/repo or full GitHub URL (or 'n' to stop): ").strip() + if repo_input.lower() == 'n': + break + repositories_input.append(repo_input) + if not repositories_input: + logging.error("No repositories provided. Exiting.") + exit(1) + else: + repositories_input = args.repos + + # Build query terms + query_terms = [] + for repo_full_name in repositories_input: + try: + owner, repo_name = parse_repository_input(repo_full_name) + query_terms.append(f'repo:{owner}/{repo_name}') + except ValueError as e: + logging.warning(str(e)) + continue + + # Search repositories + repositories = search_repositories_with_queries(query_terms, headers) + logging.info(f"Total repositories found: {len(repositories)}") + + # Limit processing if --limit flag is set + if args.limit: + limit_count = args.limit + logging.info(f"Limiting processing to the first {limit_count} repositories due to --limit flag.") + # Convert repositories dictionary to a list of items and take the first N + repositories_items = list(repositories.items())[:limit_count] + else: + repositories_items = list(repositories.items()) + + # Initialize entity collections + people = {} + papers = {} + projects = {} + institutions = {} + + # Analyze repositories with a progress bar + all_repo_data = [] + total_repos = len(repositories_items) + + with tqdm(total=total_repos, desc='Analyzing Repositories', unit='repo', position=0) as pbar: + for idx, (repo_id, repo_info) in enumerate(repositories_items, start=1): + logging.info(f"Processing repository {idx}/{total_repos}: {repo_info['repo_data'].get('full_name', '')}") + analyze_repository( + repo_info, + idx, + headers, + people, + papers, + projects, + institutions + ) + pbar.update(1) # Update the main repositories progress bar + + # Convert all sets in collections to lists if necessary + all_repo_data_serializable = { + "people": convert_sets_to_lists(list(people.values())), + "papers": convert_sets_to_lists(list(papers.values())), + "projects": convert_sets_to_lists(list(projects.values())), + "institutions": convert_sets_to_lists(list(institutions.values())) + } + + # Output results + output_filename_json = "analysis_results.json" + write_to_json(people, papers, projects, institutions, output_filename_json) + + # Optionally, write separate CSVs for each entity + # Example for people + people_headers = ['id', 'name', 'orcid', 'github_username', 'affiliations', 'authored_papers', 'contributed_projects', 'other_papers'] + write_entity_csv(list(people.values()), people_headers, "people.csv") + + # Example for papers + papers_headers = ['id', 'doi', 'title', 'domains', 'authors', 'cites_papers', 'cited_by_papers', 'associated_projects', 'concepts'] + write_entity_csv(list(papers.values()), papers_headers, "papers.csv") + + # Example for projects + projects_headers = [ + 'id', 'full_name', 'description', 'license', 'last_commit_date', 'has_readme', + 'has_code_of_conduct', 'documentation', 'main_language', 'languages_percentages', + 'stars_count', 'forks_count', 'watchers_count', 'open_issues_count', + 'contributors', 'associated_papers', 'activity_metrics', 'queries' + ] + write_entity_csv(list(projects.values()), projects_headers, "projects.csv") + + # Example for institutions + institutions_headers = ['id', 'name', 'location', 'affiliated_people'] + write_entity_csv(list(institutions.values()), institutions_headers, "institutions.csv") + + end_time = time.time() + total_runtime = end_time - start_time + logging.info(f"Total runtime: {total_runtime:.2f} seconds") diff --git a/scripts/repo_finder/README.md b/src/backend/biz_logic/scripts/jring_o/repo_finder/README.md similarity index 100% rename from scripts/repo_finder/README.md rename to src/backend/biz_logic/scripts/jring_o/repo_finder/README.md diff --git a/scripts/repo_finder/oaont.csv b/src/backend/biz_logic/scripts/jring_o/repo_finder/oaont.csv similarity index 100% rename from scripts/repo_finder/oaont.csv rename to src/backend/biz_logic/scripts/jring_o/repo_finder/oaont.csv diff --git a/scripts/repo_finder/oaont.json b/src/backend/biz_logic/scripts/jring_o/repo_finder/oaont.json similarity index 100% rename from scripts/repo_finder/oaont.json rename to src/backend/biz_logic/scripts/jring_o/repo_finder/oaont.json diff --git a/scripts/repo_finder/repofinder.py b/src/backend/biz_logic/scripts/jring_o/repo_finder/repofinder.py similarity index 100% rename from scripts/repo_finder/repofinder.py rename to src/backend/biz_logic/scripts/jring_o/repo_finder/repofinder.py diff --git a/src/moss/app/__init__.py b/src/backend/notification/__init__.py similarity index 100% rename from src/moss/app/__init__.py rename to src/backend/notification/__init__.py diff --git a/src/moss/cli/__init__.py b/src/backend/persistence/db_models/__init__.py similarity index 100% rename from src/moss/cli/__init__.py rename to src/backend/persistence/db_models/__init__.py diff --git a/src/moss/lib/models/base.py b/src/backend/persistence/db_models/base.py similarity index 100% rename from src/moss/lib/models/base.py rename to src/backend/persistence/db_models/base.py diff --git a/src/moss/lib/models/nodes.py b/src/backend/persistence/db_models/nodes.py similarity index 100% rename from src/moss/lib/models/nodes.py rename to src/backend/persistence/db_models/nodes.py diff --git a/src/moss/lib/models/ontology.py b/src/backend/persistence/db_models/ontology.py similarity index 100% rename from src/moss/lib/models/ontology.py rename to src/backend/persistence/db_models/ontology.py diff --git a/src/moss/lib/models/relationships.py b/src/backend/persistence/db_models/relationships.py similarity index 100% rename from src/moss/lib/models/relationships.py rename to src/backend/persistence/db_models/relationships.py diff --git a/src/backend/persistence/db_session.py b/src/backend/persistence/db_session.py new file mode 100644 index 0000000..ca3d1c7 --- /dev/null +++ b/src/backend/persistence/db_session.py @@ -0,0 +1,32 @@ +from collections.abc import AsyncGenerator + +from sqlalchemy import exc +from sqlalchemy.orm import sessionmaker +from sqlalchemy.ext.asyncio import AsyncSession +from sqlalchemy.ext.asyncio import async_sessionmaker +from sqlalchemy.ext.asyncio import create_async_engine + +def get_db_connection(use_async=True): + TEMP_DB_LOC = "dev.sqlight" + ASYNC_TEMP_DATABASE_URL = f"sqlite+aiosqlite:///./{TEMP_DB_LOC}" + TEMP_DATABASE_URL = f"sqlite:///./{TEMP_DB_LOC}" + url = ASYNC_TEMP_DATABASE_URL if use_async else TEMP_DATABASE_URL + return url + +class DBSession: + + # Must put connection info as a class variable so that pytests run. + connection_info = get_db_connection() + + @staticmethod + async def get_db_session() -> AsyncGenerator[AsyncSession, None]: + engine = create_async_engine(DBSession.connection_info, echo=False) + # factory = async_sessionmaker(engine) + factory = sessionmaker(bind=engine, class_=AsyncSession, expire_on_commit=False) + async with factory() as session: + try: + yield session + await session.commit() + except exc.SQLAlchemyError as error: + await session.rollback() + raise error \ No newline at end of file diff --git a/neo4j-docker/README.md b/src/backend/persistence/neo4j_docker/README.md similarity index 100% rename from neo4j-docker/README.md rename to src/backend/persistence/neo4j_docker/README.md diff --git a/neo4j-docker/docker-compose.override.yml b/src/backend/persistence/neo4j_docker/docker-compose.override.yml similarity index 100% rename from neo4j-docker/docker-compose.override.yml rename to src/backend/persistence/neo4j_docker/docker-compose.override.yml diff --git a/neo4j-docker/docker-compose.yml b/src/backend/persistence/neo4j_docker/docker-compose.yml similarity index 100% rename from neo4j-docker/docker-compose.yml rename to src/backend/persistence/neo4j_docker/docker-compose.yml diff --git a/neo4j-docker/env-example b/src/backend/persistence/neo4j_docker/env-example similarity index 100% rename from neo4j-docker/env-example rename to src/backend/persistence/neo4j_docker/env-example diff --git a/index.html b/src/frontends/index.html similarity index 100% rename from index.html rename to src/frontends/index.html diff --git a/moss-react-app/.eslintrc.cjs b/src/frontends/moss-react-app/.eslintrc.cjs similarity index 100% rename from moss-react-app/.eslintrc.cjs rename to src/frontends/moss-react-app/.eslintrc.cjs diff --git a/moss-react-app/.gitignore b/src/frontends/moss-react-app/.gitignore similarity index 100% rename from moss-react-app/.gitignore rename to src/frontends/moss-react-app/.gitignore diff --git a/moss-react-app/README.md b/src/frontends/moss-react-app/README.md similarity index 100% rename from moss-react-app/README.md rename to src/frontends/moss-react-app/README.md diff --git a/moss-react-app/index.html b/src/frontends/moss-react-app/index.html similarity index 100% rename from moss-react-app/index.html rename to src/frontends/moss-react-app/index.html diff --git a/moss-react-app/package-lock.json b/src/frontends/moss-react-app/package-lock.json similarity index 100% rename from moss-react-app/package-lock.json rename to src/frontends/moss-react-app/package-lock.json diff --git a/moss-react-app/package.json b/src/frontends/moss-react-app/package.json similarity index 100% rename from moss-react-app/package.json rename to src/frontends/moss-react-app/package.json diff --git a/moss-react-app/public/vite.svg b/src/frontends/moss-react-app/public/vite.svg similarity index 100% rename from moss-react-app/public/vite.svg rename to src/frontends/moss-react-app/public/vite.svg diff --git a/moss-react-app/src/App.css b/src/frontends/moss-react-app/src/App.css similarity index 100% rename from moss-react-app/src/App.css rename to src/frontends/moss-react-app/src/App.css diff --git a/moss-react-app/src/App.jsx b/src/frontends/moss-react-app/src/App.jsx similarity index 100% rename from moss-react-app/src/App.jsx rename to src/frontends/moss-react-app/src/App.jsx diff --git a/moss-react-app/src/assets/react.svg b/src/frontends/moss-react-app/src/assets/react.svg similarity index 100% rename from moss-react-app/src/assets/react.svg rename to src/frontends/moss-react-app/src/assets/react.svg diff --git a/moss-react-app/src/index.css b/src/frontends/moss-react-app/src/index.css similarity index 100% rename from moss-react-app/src/index.css rename to src/frontends/moss-react-app/src/index.css diff --git a/moss-react-app/src/main.jsx b/src/frontends/moss-react-app/src/main.jsx similarity index 100% rename from moss-react-app/src/main.jsx rename to src/frontends/moss-react-app/src/main.jsx diff --git a/moss-react-app/vite.config.js b/src/frontends/moss-react-app/vite.config.js similarity index 100% rename from moss-react-app/vite.config.js rename to src/frontends/moss-react-app/vite.config.js diff --git a/src/moss/cli/__main__.py b/src/moss/cli/__main__.py deleted file mode 100644 index e69de29..0000000 diff --git a/src/moss/environments/dev.env b/src/moss/environments/dev.env deleted file mode 100644 index e69de29..0000000 diff --git a/src/moss/environments/prod.env b/src/moss/environments/prod.env deleted file mode 100644 index e69de29..0000000 diff --git a/src/moss/environments/staging.env b/src/moss/environments/staging.env deleted file mode 100644 index e69de29..0000000 diff --git a/src/moss/lib/__init__.py b/src/moss/lib/__init__.py deleted file mode 100644 index e69de29..0000000 diff --git a/src/moss/lib/config.py b/src/moss/lib/config.py deleted file mode 100644 index 9d281cd..0000000 --- a/src/moss/lib/config.py +++ /dev/null @@ -1 +0,0 @@ -"""Singleton configuration object""" diff --git a/src/moss/lib/models/__init__.py b/src/moss/lib/models/__init__.py deleted file mode 100644 index e69de29..0000000 diff --git a/src/moss/lib/schema/__init__.py b/src/moss/lib/schema/__init__.py deleted file mode 100644 index e69de29..0000000 diff --git a/tests/__init__.py b/tests/__init__.py index e69de29..6898ce2 100644 --- a/tests/__init__.py +++ b/tests/__init__.py @@ -0,0 +1,3 @@ +""" +This module contains all unit tests and integration tests for code in src/ +""" \ No newline at end of file