Collects all scripts to be used to setup SolrCloud for images and text.
Clone the repo.
There are no python requirements other than numpy for incremental_post_and_long
There is an external Solr dependency. Use the same version as the version currently used in the search indexes (currently Solr 8.8.1) Update script constants to setup the location of the Solr bin.
Used to init a SolrCloud image index.
solr-configset/: contains the necessary information to make a SolrCloud instance ready to have an Arquivo.pt image search index. They are used in thesolr-cloud-image-indexAnsible roles, which is the prefered way for creating an image index. Relevant files include:images/conf/update-script.js: script that takes care of deduplication of images across collections. It performs the same role as theDupDigestMergerjob from the image and page search indexer.images/conf/managed-schema: Solr schema for the current image index
send_config_set.sh: sends theimagesconfigset to Zookeeper
Used to send documents to Solr.
incremental_post.py: used to send the JSONL files created by the image indexing pipeline. More info on how this works here
Used to test the capacity of a SolrCloud instance.
incremental_post_and_test.py: script that sends a set number of documents to Solr (e.g. 1000000, 5000000) and measures how fast the retrieval is. It used to estimate the maximum capacity of a server, in terms of number of documents indexed vs. retrieval time.test_latency.py: aux functions to measure latency.WorkBench.jmx: JMeter script to measurequeries.txt: quereis generated by random pairs of words
Example scripts on how to manipulate documents in the Solr index after posting
update_block.py: update documents according to the block list.update_docs_by_collection.py: re-embargo collectionsupdate_nsfw.py: update nsfw status of a set of documents
The page index was modeled after the image search index, so the structure is very similar.
Used to init a SolrCloud image index.
solr-configset/: contains the necessary information to make a SolrCloud instance ready to have an Arquivo.pt image search index. They are used in thesolr-cloud-page-indexAnsible roles, which is the prefered way for creating an image index. Relevant files include:pages/conf/update-script.js: script that takes care of deduplication of pages across collections. It performs the same role as theDocumentDupDigestMergerJobjob from the image and page search indexer.pages/conf/managed-schema: Solr schema for the current page index
send_config_set.sh: sends thepagesconfigset to Zookeeper
Used to send documents to Solr.
index_text.sh: used to send the JSONL files created by the pages indexing pipeline, using the Solr binary for postingindex_text_bash.sh: same as the previous script, but without the required Solr binary
Example scripts on how to manipulate documents in the Solr index after posting
update_block.py: update documents according to the block list.