Dockerfile for fscrawler
Published on docker hub here.
Mostly inspired by elasticsearch's alpine dockerfile
Supported tags
2.2with fscrawler version 2.2 and alpine 3.52.4with fscrawler 2.4 and alpine 3.52.5with fscrawler 2.5 and ubuntu 16.042.6with fscrawler 2.6 and ubuntu 20.04- Note: the binary name
fscrawler-es5is compatible with elasticsearch version 5, versusfscrawlerandfscrawler-es6with version 6
- Note: the binary name
- (WIP)
2.7-SNAPSHOT-v20201204- Note: the binary name
fscrawler-es6is compatible with elasticsearch version 6, versusfscrawlerandfscrawler-es7with version 7
- Note: the binary name
Dockerfile includes tesseract (via ubuntu 20.04)
Given you have good docker-fu skills,
to run fscrawler docker image in folder indexing mode:
docker run \
-it --rm --name my-fscrawler \
-v <data folder>:/usr/share/fscrawler/data/:ro \
-v <config folder>:/usr/share/fscrawler/config-mount/<project-name>:ro \
shadiakiki1986/fscrawler \
[CLI options]
where
- data folder is the path to the folder with the files to index
- config folder is the path to the host fscrawler config dir
- make sure to use the proper URL reference in the config file to point to the elasticsearch instance
- e.g.
localhost:9200if elasticsearch is running locally
- e.g.
- make sure to use the proper URL reference in the config file to point to the elasticsearch instance
- if the config folder is not mounted from the host, the docker container will have an empty
configfolder, thus prompting the user for confirmationY/Nof creating the first project file - CLI options are documented here
An example set of CLI options is to run fscrawler in REST API mode:
docker run \
...
-p <local port>:8080
shadiakiki1986/fscrawler \
--loop "0" --reset fscrawler_rest
Given you already have good docker-compose-fu skills, check docker-compose.yml.
To use
echo "vm.max_map_count=262144"| sudo tee -a /etc/sysctl.conf
docker-compose pull
docker-compose build
docker-compose up
Docker-fscrawler can be used in coordination with an elasticsearch docker container or an elasticsearch instance running natively on the host machine. To make coordination between the ES and fscrawler containers easy, it is recommended to use docker-compose, as described here.
Make sure you have set up vm.max_map_count=262144 by either putting it in /etc/sysctl.conf and
running sudo sysctl -p, or whatever other means is convenient to you. This is necessary for elasticsearch. (see
Ref)
Download the following files from this git repository. Cloning the whole repository is not necessary.
docker-compose.yml(single-node) ordocker-compose-deployment.yml(multi-node)build/elasticsearch/docker-healthcheck
Make a new empty folder and put these two files in it. This directory will be the home of your configurations, and the location from which you can control your containers and make changes.
Change the name of docker-compose-deployment.yml to docker-compose.yml.
- Make a file here called
.env. Here you can configure the docker containers. - Add the line
TARGET_DIR=/path/to/directory/you/want/to/index. If you don't add this line, it will default to./data/ - Add the line
JOB_NAME=name_to_give_your_index. This will be the name of the fscrawler job and the ES index. If you don't add this line, it will default tofscrawler_job.
Now run
docker-compose run fscrawlerRespond with Y to the question of whether to create a new config.
Edit the newly created config/fscrawler_job/_settings.json file (you may need to use sudo, the folder name may be
different if you are using .env). Change elasticsearch.nodes from 127.0.0.1 to
elasticsearch1, so that it reads follows.
...
"elasticsearch" : {
"nodes" : [ {
"host" : "elasticsearch1",
"port" : 9200,
"scheme" : "HTTP"
} ],
"bulk_size" : 100,
"flush_interval" : "5s"
},
...For the rest of the settings in this file, can choose your own based on
the options documented here. Do not change fs.url
unless you also change the corresponding line in docker-compose.yml, or else fscrawler won't be able to find your
files.
Populate data/ or the directory you specified in .env with some files you would like to index.
Run the following.
docker-compose up -d elasticsearch1 elasticsearch2
docker-compose up -d fscrawlerfscrawler should then upload the test files you put in data/. To check that all is well,
query the elasticsearch over http (substitute fscrawler_job if you gave it your own name in .env)
curl http://localhost:9200/fscrawler_job/_search | jqIf you see all your documents here, you should be good to go!
If you don't see all your documents, use the following command to get more detailed logs.
docker-compose run fscrawler --config_dir /usr/share/fscrawler/config fscrawler_job --restart --debugHopefully these logs will make it clear what went wrong. Failing that you can use
--trace instead of --debug for even more detailed logs. You can also use --restart whenever you want to re-index
everything (otherwise files are only reindexed when they are touched).
Additional options for docker-compose run fscrawler can be found
here.
Using docker-compose, startup elasticsearch and run fscrawler on files in test/data every 15 minutes:
docker-compose up elasticsearch1 fscrawlerFor the remaining examples, the default config depends on having a running elasticsearch instance on the localhost at port 9200. Start one with:
# [Ref](https://github.com/docker-library/elasticsearch/issues/111)
sudo sysctl -w vm.max_map_count=262144
docker-compose run -p 9200:9200 -d elasticsearch1For the versions of the docker-compose file, docker-compose, and docker, check the travis builds
Notice that the docker-compose fscrawler service is wired to wait for a healthcheck in elasticsearch.
In the case of a manual launch of elasticsearch:
- wait for around 15 seconds,
- or watch the logs,
- or check
http://$host:9200/_cat/health?h=statuswhere you need to wait foryelloworgreen, depending on your application
To index the test files provided in this repo
docker run -it --rm \
--net="host" \
--name my-fscrawler \
-v $PWD/test/data/:/usr/share/fscrawler/data/:ro \
shadiakiki1986/fscrawlerSame example above, but with loop=1 to run it only once
docker run -it --rm \
--net="host" \
--name my-fscrawler \
-v $PWD/test/data/:/usr/share/fscrawler/data/:ro \
-v $PWD/config/myjob:/usr/share/fscrawler/config-mount/myjob:ro \
shadiakiki1986/fscrawler \
--config_dir /usr/share/fscrawler/config \
--loop 1 \
--trace \
myjobTo build the docker image
git clone https://github.com/shadiakiki1986/docker-fscrawler
docker build -t shadiakiki1986/fscrawler:local . # or use version instead of "local"
To test against elasticsearch locally, follow steps in .travis.yml
To update fscrawler in this docker container:
-
install docker (instructions for linux: link)
-
install docker-compose (instructions for linux: link)
-
update the version numbers used in
Dockerfile- (deprecated) also update the URL to the maven zip file to download
-
test can build
docker build -t shadiakiki1986/fscrawler:2.6 .docker build -t shadiakiki1986/fscrawler:2.7-SNAPSHOT-20201204 .
-
test can run (check section above "Usage / with docker-compose (file 1)", or run tests in
.travis.ymlfile) -
commit, tag, push to github
To update the automated build on hub.docker.com
- the "latest" tag will get re-built automatically with the
pushabove - to add a new version tag, need to
build settingsand add it manually, then clicksaveandtrigger
To update elasticsearch in the docker-compose for the purpose of testing (e.g. .travis.yml)
- edit
build/elasticsearch/Dockerfileby changingFROMimage - follow steps in
.travis.yml
Version 2.6 (2020-12-04)
- update fscrawler from 2.6-SNAPSHOT to 2.6
- update ubuntu base image from 16.04 to 20.04, etc
- support
fscrawler{,-es5,-es6}
Version 2.6-SNAPSHOT (2018-10-08)
- update fscrawler from 2.5 to
2.6-SNAPSHOT(master branch as of today)
Version 2.5.2 (2018-10-08)
- docker-compose.yml updates
- update base elasticsearch image to be
6.4from 6.1 - bring back the file crawl service
- elasticsearch healthcheck to target yellow as a "minimum" now that 6.4 shows green instead of yellow even if 1 node
- update base elasticsearch image to be
Version 2.5.1 (2018-10-08)
- using fscrawler 2.5
Version 2.4.2 (2018-10-04)
- change the main base image to be ubuntu instead of alpine linux
- move the alpine linux image into a "alpine" folder
- move teh ubuntu linux image out of the "ubuntu" folder
Version 2.4 (2017-12-27)
- update fscrawler from 2.2 to 2.4
- use
config-mountfor mounting config folder into fscrawler docker container - update elasticsearch service from 5.1.2 to 6.1.1
- elasticsearch 5.1.2 was not working with fscrawler 2.4 anyway because of dadoonet/fscrawler#472
- replace git submodule of my fork of elasticsearch-docker with just
build/elasticsearch/Dockerfile- the purpose of the fork was to push healthchecks into upstream, but my PR was rejected
- fork was at https://github.com/shadiakiki1986/elasticsearch-docker
- PR was at elastic/elasticsearch-docker#27
- argumentation at elastic/elasticsearch-docker#60
- proposed solution of just using docker-compose healthcheck would be too long in order to wait for "green" status
Version 2.2 (2017-02-22)
- use fscrawler 2.2