Web Crawler for Crawling any of the site using Form UI.
This project will give you the sitemap which will be outputted after crawling the site which you want as show below.
This is the above Form which is used to generate a Site-Map.xml file using 2 paramters namely crawl url and Max No of Pages.
So i have done the both the server Side validation and client side validation to avoid the error sceniarios.
So once you submit the form it will download the site-map.xml into your browser . i have copied even one site-map.xml which was generated by the application. download site-map
This is a Maven Project so to build the project we require maven installed on the system/IDE and then using command maven clean install will build the application. if you want skip Junit Test then we should use maven clean install -skiptests=true. once you build the application you will get jar file.
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ spring-boot-web-crawler --- [INFO] Changes detected - recompiling the module! [INFO] Compiling 3 source files to C:\Users\ramasrid\delete\Code-Repo\spring-boot-web-jsp\target\classes [INFO] [INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ spring-boot-web-crawler --- [INFO] Not copying test resources [INFO] [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ spring-boot-web-crawler --- [INFO] Not compiling test sources [INFO] [INFO] --- maven-surefire-plugin:2.18.1:test (default-test) @ spring-boot-web-crawler --- [INFO] Tests are skipped. [INFO] [INFO] --- maven-jar-plugin:2.6:jar (default-jar) @ spring-boot-web-crawler --- [INFO] Building jar: C:\Users\ramasrid\delete\Code-Repo\spring-boot-web-jsp\target\spring-boot-web-crawler-1.0.jar [INFO] [INFO] --- spring-boot-maven-plugin:1.4.2.RELEASE:repackage (default) @ spring-boot-web-crawler --- [INFO] [INFO] --- maven-install-plugin:2.5.2:install (default-install) @ spring-boot-web-crawler --- [INFO] Installing C:\Users\ramasrid\delete\Code-Repo\spring-boot-web-jsp\target\spring-boot-web-crawler-1.0.jar to C:\root\maven_repository\org\springframework\boot\spring-boot-web-crawler\1.0\spring-boot-web-crawler-1.0.jar [INFO] Installing C:\Users\ramasrid\delete\Code-Repo\spring-boot-web-jsp\pom.xml to C:\root\maven_repository\org\springframework\boot\spring-boot-web-crawler\1.0\spring-boot-web-crawler-1.0.pom [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 02:16 min [INFO] Finished at: 2017-05-06T19:37:21+05:30 [INFO] Final Memory: 29M/245M [INFO] ------------------------------------------------------------------------
Running the Project is very easy since we have got the jar file you can just start jar file using Java -jar spring-boot-web-crawler-1.0.jar
2017-05-06 19:43:12.753 INFO 50276 --- [ main] o.s.w.s.handler.SimpleUrlHandlerMapping : Mapped URL path [/webjars/] onto handler of type [class org.springframework.web.servlet.resource.ResourceHttpRequestHandler] 2017-05-06 19:43:12.754 INFO 50276 --- [ main] o.s.w.s.handler.SimpleUrlHandlerMapping : Mapped URL path [/] onto handler of type [class org.springframework.web.servlet.resource.ResourceHttpRequestHandler] 2017-05-06 19:43:12.814 INFO 50276 --- [ main] o.s.w.s.handler.SimpleUrlHandlerMapping : Mapped URL path [/**/favicon.ico] onto handler of type [class org.springframework.web.servlet.resource.ResourceHttpRequestHandler] 2017-05-06 19:43:13.408 INFO 50276 --- [ main] o.s.j.e.a.AnnotationMBeanExporter : Registering beans for JMX exposure on startup 2017-05-06 19:43:16.728 INFO 50276 --- [ main] s.b.c.e.t.TomcatEmbeddedServletContainer : Tomcat started on port(s): 8080 (http) 2017-05-06 19:43:16.733 INFO 50276 --- [ main] c.w.crawler.SpringBootWebApplication : Started SpringBootWebApplication in 37.536 seconds (JVM running for 41.615) 2017-05-06 19:43:20.050 INFO 50276 --- [nio-8080-exec-1] o.a.c.c.C.[Tomcat].[localhost].[/] : Initializing Spring FrameworkServlet 'dispatcherServlet' 2017-05-06 19:43:20.051 INFO 50276 --- [nio-8080-exec-1] o.s.web.servlet.DispatcherServlet : FrameworkServlet 'dispatcherServlet': initialization started 2017-05-06 19:43:20.185 INFO 50276 --- [nio-8080-exec-1] o.s.web.servlet.DispatcherServlet : FrameworkServlet 'dispatcherServlet': initialization completed in 134 ms
So once you submit the form it will download the site-map.xml into your browser . i have copied even one site-map.xml which was generated by the application. download site-map
2017-05-06 20:17:52.775 INFO 50276 --- [nio-8080-exec-4] e.u.i.crawler4j.crawler.CrawlController : Crawler 1 started 2017-05-06 20:17:54.596 INFO 50276 --- [ Crawler 1] com.wipro.crawler.RealCrawler : visiting: http://wiprodigital.com/ 2017-05-06 20:17:58.244 INFO 50276 --- [ Crawler 1] com.wipro.crawler.RealCrawler : visiting: http://wiprodigital.com/news/wipro-joins-industrial-internet-consortium/ 2017-05-06 20:18:01.124 INFO 50276 --- [ Crawler 1] com.wipro.crawler.RealCrawler : visiting: http://wiprodigital.com/news/digital-transformation-idc-report-future-services-wipro-digital/ 2017-05-06 20:18:01.853 INFO 50276 --- [ Crawler 1] com.wipro.crawler.RealCrawler : visiting: http://wiprodigital.com/2017/03/21/digital-transformation-employees-designing-customer-journey/ 2017-05-06 20:18:02.266 INFO 50276 --- [ Crawler 1] com.wipro.crawler.RealCrawler : visiting: http://wiprodigital.com/?post_type%5B%5D=news&s 2017-05-06 20:18:03.022 INFO 50276 --- [ Crawler 1] com.wipro.crawler.RealCrawler : visiting: http://wiprodigital.com/2017/02/09/extending-lifecycle-products-services-better-world/ 2017-05-06 20:18:03.394 INFO 50276 --- [ Crawler 1] com.wipro.crawler.RealCrawler : visiting: http://wiprodigital.com/cases/increasing-customer-value-through-iot-for-jcb-india/ 2017-05-06 20:18:03.810 INFO 50276 --- [ Crawler 1] com.wipro.crawler.RealCrawler : visiting: http://wiprodigital.com/2017/03/23/bringing-digital-design-thinking-wipro-designits-story/ 2017-05-06 20:18:12.787 INFO 50276 --- [ Thread-3] e.u.i.crawler4j.crawler.CrawlController : It looks like no thread is working, waiting for 10 seconds to make sure... 2017-05-06 20:18:22.844 INFO 50276 --- [ Thread-3] e.u.i.crawler4j.crawler.CrawlController : No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure... 2017-05-06 20:18:32.845 INFO 50276 --- [ Thread-3] e.u.i.crawler4j.crawler.CrawlController : All of the crawlers are stopped. Finishing the process... 2017-05-06 20:18:32.846 INFO 50276 --- [ Thread-3] e.u.i.crawler4j.crawler.CrawlController : Waiting for 10 seconds before final clean up... 2017-05-06 20:18:43.386 INFO 50276 --- [nio-8080-exec-4] com.wipro.crawler.WelcomeController : Done with Download the XML File
#Downloaded Site Map Screen Shot.
Junit test cases are created in the test folder and its available inside the project and i have covered 3 junit test cases and results are as shown below.
Sridhar