|
| 1 | +# Laravel Intelligent Scraper |
| 2 | + |
| 3 | +[](https://github.com/softonic/laravel-intelligent-scraper/releases) |
| 4 | +[](LICENSE.md) |
| 5 | +[](https://travis-ci.org/softonic/laravel-intelligent-scraper) |
| 6 | +[](https://packagist.org/packages/softonic/laravel-intelligent-scraper) |
| 7 | + |
| 8 | +This packages offers a scraping solution that doesn't require to know the web HTML structure and it is autoconfigured |
| 9 | +when some change is detected in the HTML structure. This allows you to continue scraping without manual intervention |
| 10 | +during a long time. |
| 11 | + |
| 12 | +## Installation |
| 13 | + |
| 14 | +To install, use composer: |
| 15 | + |
| 16 | +```bash |
| 17 | +composer require softonic/laravel-intelligent-scraper |
| 18 | +``` |
| 19 | + |
| 20 | +To publish the scraper config, you can use |
| 21 | +```bash |
| 22 | +php artisan vendor:publish --provider="Softonic\LaravelIntelligentScraper\ScraperProvider" --tag=config |
| 23 | +``` |
| 24 | + |
| 25 | +## Configuration |
| 26 | + |
| 27 | +There are two different options for the initial setup. The package can be |
| 28 | +[configured using datasets](#configuration-based-in-dataset) or |
| 29 | +[configured using Xpath](#configuration-based-in-xpath). Both ways produces the same result but |
| 30 | +depending on your Xpath knowledge you could prefer one or other. |
| 31 | + |
| 32 | +### Configuration based in dataset |
| 33 | + |
| 34 | +The first step is to know which data do you want to obtain from a page, so you must go to the page and choose all |
| 35 | +the texts, images, metas, etc... that do you want to scrap and label them, so you can tell the scraper what do you want. |
| 36 | + |
| 37 | +An example from microsoft store could be: |
| 38 | +```php |
| 39 | +<?php |
| 40 | +use Softonic\LaravelIntelligentScraper\Scraper\Models\ScrapedDataset; |
| 41 | + |
| 42 | +ScrapedDataset::create([ |
| 43 | + 'url' => 'https://test.c/p/my-objective', |
| 44 | + 'type' => 'Item-definition-1', |
| 45 | + 'data' => [ |
| 46 | + 'title' => 'My title', |
| 47 | + 'body' => 'This is the body content I want to get', |
| 48 | + 'images' => [ |
| 49 | + 'https://test.c/images/1.jpg', |
| 50 | + 'https://test.c/images/2.jpg', |
| 51 | + 'https://test.c/images/3.jpg', |
| 52 | + ], |
| 53 | + ], |
| 54 | +]); |
| 55 | +``` |
| 56 | + |
| 57 | +In this example we can see that we want different fields that we labeled arbitrarily. Some of theme have multiple |
| 58 | +values, so we can scrap lists of items from pages. |
| 59 | + |
| 60 | +With this single dataset we will be able to train our scraper and be able to scrap any page with the same structure. |
| 61 | +Due to the pages usually have different structures depending on different variables, you should add different datasets |
| 62 | +trying to cover maximum page variations possible. The scraper WILL NOT BE ABLE to scrap page variations not incorporated |
| 63 | +in the dataset. |
| 64 | + |
| 65 | +Once we did the job, all is ready to work. You should not care about updates always you have enought data in the dataset |
| 66 | +to cover all the new modifications on the page, so the scraper will recalculate the modifications on the fly. You can |
| 67 | +check [how it works](how-it-works.md) to know much about the internals. |
| 68 | + |
| 69 | +We will check more deeply how we can create a new dataset and what options are available in the next section. |
| 70 | + |
| 71 | +#### Dataset creation |
| 72 | + |
| 73 | +The dataset is composed by `url` and `data`. |
| 74 | +* The `url` part is simple, you just need to indicate the url from where you obtained the data. |
| 75 | +* The `type` part gives a item name to the current dataset. This allows you to define multiple types. |
| 76 | +* The `data` part is where you indicate what data and assign the label that you want to get. |
| 77 | +The data could be a list of items or a single item. |
| 78 | + |
| 79 | +A basic example could be: |
| 80 | +```php |
| 81 | +<?php |
| 82 | +use Softonic\LaravelIntelligentScraper\Scraper\Models\ScrapedDataset; |
| 83 | + |
| 84 | +ScrapedDataset::create([ |
| 85 | + 'url' => 'https://test.c/p/my-objective', |
| 86 | + 'type' => 'Item-definition-1', |
| 87 | + 'data' => [ |
| 88 | + 'title' => 'My title', |
| 89 | + 'body' => 'This is the body content I want to get', |
| 90 | + 'images' => [ |
| 91 | + 'https://test.c/images/1.jpg', |
| 92 | + 'https://test.c/images/2.jpg', |
| 93 | + 'https://test.c/images/3.jpg', |
| 94 | + ], |
| 95 | + ], |
| 96 | +]); |
| 97 | +``` |
| 98 | + |
| 99 | +In this dataset we want that the text `My title` to be labeled as title and we also have a list of images that we want |
| 100 | +to be labeled as images. With this we have the flexibility to pick items one by one or in lists. |
| 101 | + |
| 102 | +Sometimes we want to label some text that it is not clean in the HTML because it could include insivible characters like |
| 103 | +`\r\n`. To avoid to deal with that, the dataset allows you to add regular expressions. |
| 104 | + |
| 105 | +Example with `body` field as regexp: |
| 106 | + |
| 107 | +```php |
| 108 | +<?php |
| 109 | +use Softonic\LaravelIntelligentScraper\Scraper\Models\ScrapedDataset; |
| 110 | + |
| 111 | +ScrapedDataset::create([ |
| 112 | + 'url' => 'https://test.c/p/my-objective', |
| 113 | + 'data' => [ |
| 114 | + 'title' => 'My title', |
| 115 | + 'body' => regexp('/^Body starts here, but it is do long that.*$/si'), |
| 116 | + 'images' => [ |
| 117 | + 'https://test.c/images/1.jpg', |
| 118 | + 'https://test.c/images/2.jpg', |
| 119 | + 'https://test.c/images/3.jpg', |
| 120 | + ], |
| 121 | + ], |
| 122 | +]); |
| 123 | +``` |
| 124 | + |
| 125 | +With this change we will ensure that we detect the `body` even if it has hidden characters. |
| 126 | + |
| 127 | +**IMPORTANT** The scraper tries to find the text in all the tags including children, so if you define a regular |
| 128 | +expression without limit, like for example `/.*Body starts.*/` you will find the text in `<html>` element due to that |
| 129 | +text is inside some child element of `<html>`. So define regexp carefully. |
| 130 | + |
| 131 | +### Configuration based in Xpath |
| 132 | + |
| 133 | +After you collected all the Xpath from the HTML, you just need to create the configuration models. They looks like: |
| 134 | +```php |
| 135 | +<?php |
| 136 | +use Softonic\LaravelIntelligentScraper\Scraper\Models\Configuration; |
| 137 | + |
| 138 | +Configuration::create([ |
| 139 | + 'name' => 'title', |
| 140 | + 'type' => 'Item-definition-1', |
| 141 | + 'xpaths' => '//*[@id=title]', |
| 142 | +]); |
| 143 | + |
| 144 | +Configuration::create([ |
| 145 | + 'name' => 'category', |
| 146 | + 'type' => 'Item-definition-1', |
| 147 | + 'xpaths' => ['//*[@id=cat]', '//*[@id=long-cat]'], |
| 148 | +]); |
| 149 | +``` |
| 150 | + |
| 151 | +In the definition, you should give a name to the field to be scraped and identify it as a type. The xpaths field could |
| 152 | +contain a string or an array of strings. This is because the HTML can contain different variations depending on the |
| 153 | +specific page, you you can write a list of Xpath that will be checked in order giving the first result found. |
| 154 | + |
| 155 | +## Usage |
| 156 | + |
| 157 | +After configure the scraper, you just need to execute it like: |
| 158 | +```php |
| 159 | +<?php |
| 160 | +$scraper = resolve(\Softonic\LaravelIntelligentScraper\Scraper\Scraper::class); |
| 161 | +$data = $scraper->getData('https://test.c/p/my-objective', 'Item-definition-1'); |
| 162 | + |
| 163 | +/** |
| 164 | + * Item-definition-1 defined as 3 fields tagged as: title, body and images |
| 165 | + */ |
| 166 | +echo var_export($data); |
| 167 | +/** |
| 168 | + * Output: |
| 169 | + * [ |
| 170 | + * 'title' => ['My title']. |
| 171 | + * 'body' => ['This is the body content I want to get'], |
| 172 | + * 'images' => [ |
| 173 | + * 'https://test.c/images/1.jpg', |
| 174 | + * 'https://test.c/images/2.jpg', |
| 175 | + * 'https://test.c/images/3.jpg', |
| 176 | + * ], |
| 177 | + * ] |
| 178 | + */ |
| 179 | + |
| 180 | +``` |
| 181 | + |
| 182 | +All the output fields are arrays that can contain one or more results. |
| 183 | + |
| 184 | +## Testing |
| 185 | + |
| 186 | +`softonic/laravel-intelligent-scraper` has a [PHPUnit](https://phpunit.de) test suite and a coding style compliance test suite using [PHP CS Fixer](http://cs.sensiolabs.org/). |
| 187 | + |
| 188 | +To run the tests, run the following command from the project folder. |
| 189 | + |
| 190 | +``` bash |
| 191 | +$ docker-compose run test |
| 192 | +``` |
| 193 | + |
| 194 | +To run interactively using [PsySH](http://psysh.org/): |
| 195 | +``` bash |
| 196 | +$ docker-compose run psysh |
| 197 | +``` |
| 198 | + |
| 199 | +## How it works? |
| 200 | + |
| 201 | +The scraper is auto configurable, but needs an initial dataset or add a configuration. |
| 202 | +The dataset tells the configurator which data do you want and how to label it. |
| 203 | + |
| 204 | + |
| 205 | + |
| 206 | +To be reconfigurable and conserve the dataset freshness the scraper store the latest data scraped. |
| 207 | + |
| 208 | +``` |
| 209 | +# Powered by https://code2flow.com/app |
| 210 | +function calculate configuration { |
| 211 | + if(!Has dataset?) { |
| 212 | + goto fail; |
| 213 | + } |
| 214 | + Extract configuration using dataset; |
| 215 | + if(!Has extracted configuration?) { |
| 216 | + goto fail; |
| 217 | + } |
| 218 | +} |
| 219 | +
|
| 220 | +Scrape url 'https://test.c/p/my-onjective' using 'Item-definition-1'; |
| 221 | +try { |
| 222 | + load configuration; |
| 223 | +} |
| 224 | +catch(Missing config) { |
| 225 | + call calculate configuration; |
| 226 | +} |
| 227 | +
|
| 228 | +extract data using configuration; |
| 229 | +// It could be produced by old configuration |
| 230 | +if(Error extracting data) { |
| 231 | + call calculate configuration; |
| 232 | +} |
| 233 | +extract data using configuration; |
| 234 | +// It could be produced because the dataset does not have all the page variations |
| 235 | +if(Error extracting data) { |
| 236 | + goto fail; |
| 237 | +} |
| 238 | +
|
| 239 | +goto success |
| 240 | +
|
| 241 | +fail: |
| 242 | +No scraped data; |
| 243 | +return; |
| 244 | +success: |
| 245 | +Scraped data; |
| 246 | +``` |
| 247 | + |
| 248 | +## License |
| 249 | + |
| 250 | +The Apache 2.0 license. Please see [LICENSE](LICENSE) for more information. |
| 251 | + |
| 252 | +[PSR-2]: http://www.php-fig.org/psr/psr-2/ |
| 253 | +[PSR-4]: http://www.php-fig.org/psr/psr-4/ |
0 commit comments