Customizable web crawler. It uses cohttp client to fetch the addresses. It starts from single URI and goes through other links that have been found in the URI response. It saves data to irmin git storage which could be viewed by GraphQL client.
This tool is still in development although it could be used to fetch all pages in TLD. There are things missing such as jump to global URI, create queue and handling multiple lwt threads etc.
There are two executables in the bin folder.
mainstarts crawling immediately. By default it doesn't jump to other TLDs. The reason is to fetch all pages that belongs to same website.graphqlprovides GraphQL server to view details of the data gathered.
$ git clone [email protected]:erhangundogan/webcamel.git
$ cd webcamel
$ dune buildRun the crawler:
$ ./_build/default/bin/main.exe https://www.example.org
main.exe: [INFO] Fetching: https://www.example.org
main.exe: [INFO] Total 1 URL addresses extracted in 0.976784 secs. locals: 0, globals: 1
main.exe: [INFO] Saving data repo:(example.org) to key:(/example.org/index.html)
main.exe: [INFO] Saving site repo:(example.org) to key:(/index.html)and run irmin graphql server:
$ ./_build/default/bin/graphql.exe
Visit GraphiQL @ http://localhost:9876/graphqlOpen your browser and navigate to http://localhost:9876/graphql and run this graphql query:
{
master {
tree {
get_contents(key:"/example.org/index.html") {
key
value {
uri
redirect
secure
locals
globals
headers {
key
value
}
}
}
}
}
}irmin uses git-fs mode to save data into the /tmp folder so they would be disposed. If you want to keep data change it from the config.ml
There are 2 irmin stores. One for page sources and one for the page details. The one above shows page details.
git repo per TLD. You can see the repos if you change the directory.
$ cd /tmp/irmin/sites
$ ls
example.org/ x.com/
$ irmin list -s git --root /tmp/irmin/sites/example.org /
FILE index.htmlall top level domains included under the same git repo. There are 6 fields stored from the request aside HTML source code.
uri(string) Requested URIredirect(string option) Eventual URI if there is a redirectionsecure(bool) Is the URI provides secure connectionlocals(string list) Addresses extracted from the page that belongs to TLD. This is Uri.t Set for the absolute URIs.globals(string list) Similar tolocalsbut these URIs belong to other TLDs.header(header list) Key/Value list for the received HTTP Headers.
$ irmin list -s git --root /tmp/irmin/data /
DIR example.org
DIR x.com
$ irmin list -s git --root /tmp/irmin/data /example.org
FILE index.html
$ irmin get -s git --root /tmp/irmin/data /example.org/index.html
{
"uri": "https://www.example.org",
"secure": 1,
"headers": [
{
"key": "accept-ranges",
"value": "bytes"
},
{
"key": "age",
"value": "288506"
},
{
"key": "cache-control",
"value": "max-age=604800"
},
{
"key": "content-length",
"value": "1256"
},
{
"key": "content-type",
"value": "text/html; charset=UTF-8"
},
{
"key": "date",
"value": "Sun, 17 May 2020 12:32:49 GMT"
},
{
"key": "etag",
"value": "\"3147526947\""
},
{
"key": "expires",
"value": "Sun, 24 May 2020 12:32:49 GMT"
},
{
"key": "last-modified",
"value": "Thu, 17 Oct 2019 07:18:26 GMT"
},
{
"key": "server",
"value": "ECS (nyb/1D1F)"
},
{
"key": "vary",
"value": "Accept-Encoding"
},
{
"key": "x-cache",
"value": "HIT"
}
],
"globals": [
"https://www.iana.org/domains/example"
]
}
