GitHub - SearchEngineDesign/SearchEngine

EECS 498 Search Engine - Code Father

There are three principal processes involved in search engine. These are the crawling apparatus, the frontend, and the index server.

The crawling apparatus

This is the most complex part of the project. It is ran via master.sh, which monitors the size and status of the crawling loop. Continuously running this loop creates index chunks indefinitely. The chunks are stored in the folder ./log/chunks.

The process is run with nohup. Its output can be monitored via "tail -f nohup.out".

When the process exits, its frontier and bloom filter are written out to a specified list file and a bloomfilter.bin file, respectively. The list file is the same file of endline-delineated urls that is specified on startup. master.sh uses log/frontier/list as the starting file.
The frontend

Currently hosted locally, the frontend takes in queries and distributes them to the index servers. Once it receives data back from the servers (in the format "{url}\t{score}\n), it ranks the urls according to their score, ignoring duplicates. It then outputs a list of urls.

The frontend is run via the "run_server.sh" script in ./frontend.
The index server

Each of our 10 virtual machines runs its own server, which seek index chunks based on the queries they receive. The results are ranked, and then sent over a socket to the frontend.

The servers are run through the "run_indexserver.sh" script in ./frontend. Because the IP addresses of the Google Cloud VMs are not dynamic, we have to update the addresses in the frontend code occasionally.

Name		Name	Last commit message	Last commit date
Latest commit History 273 Commits
Crawler @ 9331c0f		Crawler @ 9331c0f
distrib		distrib
dynamicRanker @ ece40bb		dynamicRanker @ ece40bb
frontend @ 13f0553		frontend @ 13f0553
frontier @ f01024a		frontier @ f01024a
include		include
index @ 976e091		index @ 976e091
isr @ 4d208fe		isr @ 4d208fe
log		log
parser @ dae7b07		parser @ dae7b07
queryCompiler @ 4579b1e		queryCompiler @ 4579b1e
ranker		ranker
threading		threading
.env		.env
.gitignore		.gitignore
.gitmodules		.gitmodules
Makefile		Makefile
README.MD		README.MD
env.py		env.py
master.sh		master.sh
run_script.sh		run_script.sh
runner.cpp		runner.cpp
update_vms.sh		update_vms.sh
useful		useful

Provide feedback