Skip to content

zarmstrong/autoshift-scraper

 
 

Repository files navigation

Overview

Script aimed at scraping SHiFT Codes from websites, currently all provided from the great work done at https://mentalmars.com. Current webpages scraped include:

Instead of publishing this as part of Fabbi's autoshift, this is aimed at publishing a machine readable file that can be hit by autoshift. This reduces the load on mentalmars as it's likely not ok to have swarms of autoshifts scraping their website. Instead codes are published to the repo here:

Intent

This script has been setup with the intent that other webpages could be scraped. The Python Dictionary webpages can be used to customise the webpage, the tables and their contents. This may need adjusting as mentalmars' website updates over time.

TODO List:

  • Scrape mentalmars
  • output into a autoshift compatible json file format
  • change to find table tags in figure tags to reduce noise in webpage
  • publish to GitHub here
  • dockerise and schedule
  • identify expired codes on website (strikethrough)
  • identify expired codes by date

Use

Command Line Use

# If only generating locally
python ./autoshift_scraper.py 

# If pushing to GitHub:
python ./autoshift_scraper.py --user GITHUB_USERNAME --repo GITHUB_REPOSITORY_NAME --token GITHUB_AUTHTOKEN

# If scheduling: 
python ./autoshift_scraper.py --schedule 5 # redeem every 5 hours

Docker Use

The following docker environment variables are in use:

Environment Variable Use
GITHUB_USER The username that owns the GitHub repo to commit to
GITHUB_REPO The name of the GitHub repository to commit to
GITHUB_TOKEN The GitHub fine-grained personal access token -- see below for more details
PARSER_ARGS (Optional) Additional parameters to pass in, like "--schedule 2 --verbose"

Example:

docker run -d -t -i \
-e GITHUB_USER='ugoogalizer' \ 
-e GITHUB_REPO='autoshift-codes' \
-e GITHUB_TOKEN='github_pat_***' \
-e PARSER_ARGS='--verbose --schedule 2' \
-v autoshift:/autoshift/data \
--name autoshift-scraper \
zacharmstrong/autoshift-scraper:latest

Example localhost build image:

docker run -d -t -i \
-e GITHUB_USER='zacharmstrong' \
-e GITHUB_REPO='autoshift-codes' \
-e GITHUB_TOKEN='github_pat_***' \
-e PARSER_ARGS='--verbose --schedule 2' \
-v autoshift:/autoshift/data \
--name autoshift-scraper \
localhost/autoshift-scraper:latest

Kubernetes Use

Example Deployment file

--- # deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: autoshift-scraper
  name: autoshift-scraper
#  namespace: autoshift
spec:
  selector:
    matchLabels:
      app: autoshift-scraper
  revisionHistoryLimit: 0
  template:
    metadata:
      labels:
        app: autoshift-scraper
    spec:
      containers:
        - name: autoshift-scraper
          image: zacharmstrong/autoshift-scraper:latest
          imagePullPolicy: IfNotPresent
          env:
            - name: GITHUB_USER
              value: "zarmstrong"
            - name: GITHUB_REPO
              value: "autoshift-codes"
            - name: GITHUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: autoshift-scraper-secret
                  key: githubtoken
            - name: PARSER_ARGS
              value: "--schedule 2"
          resources:
            requests:
              cpu: 100m
              memory: 100Mi
            limits:
              cpu: "100m"
              memory: "500Mi"
          volumeMounts:
            - mountPath: /autoshift-scraper/data
              name: autoshift-scraper-pv
      volumes:
        - name: autoshift-scraper-pv
          # If this is NFS backed, you may have to add the nolock mount option to the storage class
          persistentVolumeClaim:
            claimName: autoshift-scraper-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
# If this is NFS backed, you may have to add the nolock mount option to the storage class
metadata:
  name: autoshift-scraper-pvc
#  namespace: autoshift
spec:
  storageClassName: managed-nfs-storage-retain
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Mi


# kubectl create namespace autoshift
# kubectl config set-context --current --namespace=autoshift
# kubectl create secret generic autoshift-scraper-secret --from-literal=githubtoken='XXX' 

# To get the username and password use: 
# kubectl get secret autoshift-scraper-secret -o jsonpath="{.data.githubtoken}" | base64 -d

Configuring GitHub connectivity

Need to create a new fine-grained personal access token, with access to the only the destination repo and Read & Write access to "Contents"

The token should look something like

github_pat_11p9ou8easrhsgp98sepfg97gUS98hu7ASFuASFDNOANSFDASF ... (but much longer)

Setting up development environment

Original setup

# setup venv
python3 -m venv .venv
source ./.venv/bin/activate

# install packages
pip install requests bs4 html5lib PyGithub APScheduler

pip freeze > requirements.txt

Docker Container Image Build

# Once off setup: 
git clone TODO

# Personal parameters
export HARBORURL=harbor.test.com

git pull

#Set Build Parameters
export VERSIONTAG=0.7

#Build the Image
docker build -t autoshift-scraper:latest -t autoshift-scraper:${VERSIONTAG} . 

#Get the image name, it will be something like 41d81c9c2d99: 
export IMAGE=$(docker images -q autoshift-scraper:latest)
echo ${IMAGE}

#Tag and Push the image into local harbor
docker login ${HARBORURL}:443
docker tag ${IMAGE} ${HARBORURL}:443/autoshift/autoshift-scraper:latest
docker tag ${IMAGE} ${HARBORURL}:443/autoshift/autoshift-scraper:${VERSIONTAG}
docker push ${HARBORURL}:443/autoshift/autoshift-scraper:latest
docker push ${HARBORURL}:443/autoshift/autoshift-scraper:${VERSIONTAG}

#Tag and Push the image to public docker hub repo
docker login -u ugoogalizer docker.io/ugoogalizer/autoshift-scraper
docker tag ${IMAGE} docker.io/ugoogalizer/autoshift-scraper:latest
docker tag ${IMAGE} docker.io/ugoogalizer/autoshift-scraper:${VERSIONTAG}
docker push docker.io/ugoogalizer/autoshift-scraper:latest
docker push docker.io/ugoogalizer/autoshift-scraper:${VERSIONTAG}

Testing

Unit tests are provided for the main parser logic, including the MentalMars and Polygon Borderlands 4 scrapers.

Running the tests

  1. Install test dependencies (pytest):

    pip install pytest
  2. Run all tests from the project root:

    pytest
  3. To run a specific test file:

    pytest tests/test_parsers.py

What is tested

  • Extraction and normalization of codes from sample HTML for both MentalMars and Polygon BL4 sources.
  • Handling of invalid or duplicate codes.
  • Error handling for missing or malformed HTML.

Test files are located in the tests/ directory.

Mark codes as expired (local helper)

A small helper script is included to mark one or more codes as expired in data/shiftcodes.json:

Usage:

# mark a single code (sets expires to now UTC)
python mark_expired.py BHRBJ-ZWHT3-W6JBK-BT3BB-CW3ZK

# mark multiple codes
python mark_expired.py CODE1 CODE2 CODE3

# set explicit expires timestamp (ISO)
python mark_expired.py CODE1 --expires "2025-09-26T04:19:00+00:00"

Upload updated file to GitHub:

  • To have the script push the updated shiftcodes.json back to the repository, provide GitHub credentials when running:
python mark_expired.py CODE1 --user your-gh-username --repo your-repo-name --token your_fine_grained_token

The script will attempt to update shiftcodes.json on the main branch (it will create the file if missing).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.5%
  • Dockerfile 0.5%