Skip to content

Latest commit

 

History

History
112 lines (85 loc) · 5.43 KB

File metadata and controls

112 lines (85 loc) · 5.43 KB

Developer notes

Running IARI in AWS

Open a screen $ screen -RR

Run gunicorn $ TESTDEADLINK_KEY={insert secret key here} /bin/bash -c 'source /srv/wcdimportbot/venv/bin/activate; gunicorn -w 30 --bind unix:/tmp/wikicitations-api/ipc.sock wsgi:app --timeout 1000' > output.log 2>&1

Open a new window in screen "ctrl+a c" to look at the log $ tail -F output.log

Development workflow

  • Pick a story to work on
  • Create a new branch
  • Make some changes
  • Run pre-commit run -all
  • Run pytest -x
  • If everything looks good update the README.md accordingly
  • Commit the changes using a clear message
  • Open a new pull request
  • Add what story is fixed to the PR
  • Check if the CI finishes with all the tests green
  • Ask for review if any breaking changes
  • Merge the PR

Pre-commit

Pre-commit is a framework and tool that helps developers to set up and enforce code quality rules and best practices in their software development projects. It is commonly used in the context of version control systems like Git.

The main purpose of pre-commit is to perform checks and validations on the codebase before allowing developers to commit their changes. It helps catch issues early on and ensures that the committed code meets certain standards. By integrating pre-commit into the development workflow, developers can automate various tasks such as code formatting, linting, running tests, and more.

Here's how it typically works:

  • Developers define a set of hooks or scripts that should be run before committing code. These hooks are usually stored in a configuration file called .pre-commit-config.yaml in the project's repository.
  • When a developer attempts to make a commit, pre-commit is triggered, and it runs the defined hooks against the files being committed.
  • The hooks perform specific tasks such as checking for code style violations, running unit tests, detecting security vulnerabilities, or verifying documentation.
  • If any of the hooks fail, pre-commit prevents the commit from being made and displays the error messages or warnings generated by the hooks. Developers can then address the issues and reattempt the commit.

Pre-commit is highly configurable, allowing developers to choose from a wide range of pre-existing hooks or even write custom hooks to cater to their project's specific needs. It supports various programming languages and integrates with popular code analysis tools like Pylint, Flake8, ESLint, and more.

By incorporating pre-commit into the development workflow, teams can enforce consistent code quality, reduce manual effort, and catch potential problems early, leading to cleaner and more maintainable codebases.

Run pre-commit

$ pre-commit run --all

Install pre-commit

$ pre-commit install

Extract single page from pdf

Use browser -> print -> save to pdf

CLI Usage examples

Architecture design ideas for future graph generation

Graph generation architecture

WIP algorithm version 0:

Generation phase:

  1. hash the article wikitext (article_wikitext_hash)
  2. Parse the article Wikitext
  3. generate the article_hash
  4. generate the base item using WBI
  5. Store the json data using the hash (in ssdb)
  6. hash the wikitext of all the references found (reference_wikitext_hash)
  7. generate the reference item if an identifier was found
  8. Store the generated reference json in ssdb with the reference_hash as key
  9. Store the reference wikitext using the reference_wikitext_hash as key in ssdb
  10. keep a record of which articles has which raw reference hashes in ssdb with key=article_hash+"refs" as key and a list of reference_wikitext_hash as value if any
  11. keep a record of hashed references for each article in ssdb with key=article_hash+reference_hash, value list of identifier hashes if any)

We intentionally do not generate website items, nor handle the non-hashable references in this first iteration.

Upload phase:

  1. Open a connection to Wikibase using WikibaseIntegrator.
  2. Loop over all references and upload the json to Wikibase for each unique reference
  3. store the resulting wcdqid in ssdb (key=reference_hash+"wcdqid" value=wcdqid)
  4. loop over all articles and finish generating the item using unihash list and get the wcdqids for references from ssdb.
    • Upload up to a max of 500 references on an article in one go, discard any above that.

Improvements for next iteration:

  • add any surplus references using addclaim to avoid throwing a way good data.

Stream based architecture (abandoned)

Version 2.1.0+ of the bot is using a stream based architecture to distribute workloads efficiently and scale horizontally.

Decisions and principles guiding the design:

  • The KISS-principle
  • The IASandboxWikibase.cloud is the default Wikibase used.
  • Test coverage >90% is desired
  • CI integration is desired (Currently we lack SSDB in Github Actions so that does not work)
  • One class one concern (separation of concerns)
  • Docker compose is used to bring up most of the architecture
  • An updated diagram of all classes is desirable to get an overview
  • An updated diagram of the workflow is desirable to get an overview

Tests

Install requirements

$ poetry install --with dev

Run all tests stop at first failure

$ pytest -x

Coverage

We have a helper script which updates TEST_COVERAGE.txt:

$ ./run-test-coverage.sh

Find slow tests

$ python -m pytest --durations=10