Developer notes

Running IARI in AWS

Open a screen $ screen -RR

Run gunicorn $ TESTDEADLINK_KEY={insert secret key here} /bin/bash -c 'source /srv/wcdimportbot/venv/bin/activate; gunicorn -w 30 --bind unix:/tmp/wikicitations-api/ipc.sock wsgi:app --timeout 1000' > output.log 2>&1

Open a new window in screen "ctrl+a c" to look at the log $ tail -F output.log

Development workflow

Pick a story to work on
Create a new branch
Make some changes
Run pre-commit run -all
Run pytest -x
If everything looks good update the README.md accordingly
Commit the changes using a clear message
Open a new pull request
Add what story is fixed to the PR
Check if the CI finishes with all the tests green
Ask for review if any breaking changes
Merge the PR

Pre-commit

Pre-commit is a framework and tool that helps developers to set up and enforce code quality rules and best practices in their software development projects. It is commonly used in the context of version control systems like Git.

The main purpose of pre-commit is to perform checks and validations on the codebase before allowing developers to commit their changes. It helps catch issues early on and ensures that the committed code meets certain standards. By integrating pre-commit into the development workflow, developers can automate various tasks such as code formatting, linting, running tests, and more.

Here's how it typically works:

Developers define a set of hooks or scripts that should be run before committing code. These hooks are usually stored in a configuration file called .pre-commit-config.yaml in the project's repository.
When a developer attempts to make a commit, pre-commit is triggered, and it runs the defined hooks against the files being committed.
The hooks perform specific tasks such as checking for code style violations, running unit tests, detecting security vulnerabilities, or verifying documentation.
If any of the hooks fail, pre-commit prevents the commit from being made and displays the error messages or warnings generated by the hooks. Developers can then address the issues and reattempt the commit.

Pre-commit is highly configurable, allowing developers to choose from a wide range of pre-existing hooks or even write custom hooks to cater to their project's specific needs. It supports various programming languages and integrates with popular code analysis tools like Pylint, Flake8, ESLint, and more.

By incorporating pre-commit into the development workflow, teams can enforce consistent code quality, reduce manual effort, and catch potential problems early, leading to cleaner and more maintainable codebases.

Run pre-commit

$ pre-commit run --all

Install pre-commit

$ pre-commit install

Extract single page from pdf

Use browser -> print -> save to pdf

CLI Usage examples

Architecture design ideas for future graph generation

Graph generation architecture

WIP algorithm version 0:

Generation phase:

hash the article wikitext (article_wikitext_hash)
Parse the article Wikitext
generate the article_hash
generate the base item using WBI
Store the json data using the hash (in ssdb)
hash the wikitext of all the references found (reference_wikitext_hash)
generate the reference item if an identifier was found
Store the generated reference json in ssdb with the reference_hash as key
Store the reference wikitext using the reference_wikitext_hash as key in ssdb
keep a record of which articles has which raw reference hashes in ssdb with key=article_hash+"refs" as key and a list of reference_wikitext_hash as value if any
keep a record of hashed references for each article in ssdb with key=article_hash+reference_hash, value list of identifier hashes if any)

We intentionally do not generate website items, nor handle the non-hashable references in this first iteration.

Upload phase:

Open a connection to Wikibase using WikibaseIntegrator.
Loop over all references and upload the json to Wikibase for each unique reference
store the resulting wcdqid in ssdb (key=reference_hash+"wcdqid" value=wcdqid)
loop over all articles and finish generating the item using unihash list and get the wcdqids for references from ssdb.
- Upload up to a max of 500 references on an article in one go, discard any above that.

Improvements for next iteration:

add any surplus references using addclaim to avoid throwing a way good data.

Stream based architecture (abandoned)

Version 2.1.0+ of the bot is using a stream based architecture to distribute workloads efficiently and scale horizontally.

Decisions and principles guiding the design:

The KISS-principle
The IASandboxWikibase.cloud is the default Wikibase used.
Test coverage >90% is desired
CI integration is desired (Currently we lack SSDB in Github Actions so that does not work)
One class one concern (separation of concerns)
Docker compose is used to bring up most of the architecture
An updated diagram of all classes is desirable to get an overview
An updated diagram of the workflow is desirable to get an overview

Tests

Install requirements

$ poetry install --with dev

Run all tests stop at first failure

$ pytest -x

Coverage

We have a helper script which updates TEST_COVERAGE.txt:

$ ./run-test-coverage.sh

Find slow tests

$ python -m pytest --durations=10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Developer notes

Running IARI in AWS

Development workflow

Pre-commit

Run pre-commit

Install pre-commit

Extract single page from pdf

CLI Usage examples

Architecture design ideas for future graph generation

Graph generation architecture

Stream based architecture (abandoned)

Tests

Install requirements

Run all tests stop at first failure

Coverage

Find slow tests

FilesExpand file tree

DEVELOPMENT_NOTES.md

Latest commit

History

DEVELOPMENT_NOTES.md

File metadata and controls

Developer notes

Running IARI in AWS

Development workflow

Pre-commit

Run pre-commit

Install pre-commit

Extract single page from pdf

CLI Usage examples

Architecture design ideas for future graph generation

Graph generation architecture

Stream based architecture (abandoned)

Tests

Install requirements

Run all tests stop at first failure

Coverage

Find slow tests