Skip to content
This repository has been archived by the owner. It is now read-only.

Commit c2dae0c

Browse files
authored
feature/move-to-an-event-based-system Event based scraper (#1)
Moved to a event based system
1 parent 6016842 commit c2dae0c

26 files changed

+891
-569
lines changed

Diff for: README.md

+90-43
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,8 @@ php artisan vendor:publish --provider="Softonic\LaravelIntelligentScraper\Scrape
2727
There are two different options for the initial setup. The package can be
2828
[configured using datasets](#configuration-based-in-dataset) or
2929
[configured using Xpath](#configuration-based-in-xpath). Both ways produces the same result but
30-
depending on your Xpath knowledge you could prefer one or other.
30+
depending on your Xpath knowledge you could prefer one or other. We recommend to use the
31+
[configured using Xpath](#configuration-based-in-xpath) approach.
3132

3233
### Configuration based in dataset
3334

@@ -99,7 +100,7 @@ ScrapedDataset::create([
99100
In this dataset we want that the text `My title` to be labeled as title and we also have a list of images that we want
100101
to be labeled as images. With this we have the flexibility to pick items one by one or in lists.
101102

102-
Sometimes we want to label some text that it is not clean in the HTML because it could include insivible characters like
103+
Sometimes we want to label some text that it is not clean in the HTML because it could include invisible characters like
103104
`\r\n`. To avoid to deal with that, the dataset allows you to add regular expressions.
104105

105106
Example with `body` field as regexp:
@@ -110,6 +111,7 @@ use Softonic\LaravelIntelligentScraper\Scraper\Models\ScrapedDataset;
110111

111112
ScrapedDataset::create([
112113
'url' => 'https://test.c/p/my-objective',
114+
'type' => 'Item-definition-1',
113115
'data' => [
114116
'title' => 'My title',
115117
'body' => regexp('/^Body starts here, but it is do long that.*$/si'),
@@ -154,33 +156,41 @@ specific page, you you can write a list of Xpath that will be checked in order g
154156

155157
## Usage
156158

157-
After configure the scraper, you just need to execute it like:
159+
After configure the scraper, you will be able to request an specific scrape using the `scraoe` helper
158160
```php
159161
<?php
160-
$scraper = resolve(\Softonic\LaravelIntelligentScraper\Scraper\Scraper::class);
161-
$data = $scraper->getData('https://test.c/p/my-objective', 'Item-definition-1');
162-
163-
/**
164-
* Item-definition-1 defined as 3 fields tagged as: title, body and images
165-
*/
166-
echo var_export($data);
167-
/**
168-
* Output:
169-
* [
170-
* 'title' => ['My title'].
171-
* 'body' => ['This is the body content I want to get'],
172-
* 'images' => [
173-
* 'https://test.c/images/1.jpg',
174-
* 'https://test.c/images/2.jpg',
175-
* 'https://test.c/images/3.jpg',
176-
* ],
177-
* ]
178-
*/
179162

163+
scrape('https://test.c/p/my-objective', 'Item-definition-1');
164+
```
165+
166+
The scrape will produce a `\Softonic\LaravelIntelligentScraper\Scraper\Events\Scraped` event if all worked as expected.
167+
So attach a listener to that event to receive the data.
168+
169+
```php
170+
$event->scrapeRequest->url // Url scraped
171+
$event->scrapeRequest->type // Request type
172+
$event->data // Contains all the data in a [ 'fieldName' => 'value' ] format.
180173
```
181174

182175
All the output fields are arrays that can contain one or more results.
183176

177+
If the scrape fails a `\Softonic\LaravelIntelligentScraper\Scraper\Events\ScrapeFailed` event is fired with the
178+
scrape request information.
179+
```php
180+
$event->scrapeRequest->url // Url scraped
181+
$event->scrapeRequest->type // Request type
182+
```
183+
184+
### Queue workers
185+
186+
You need to workers, one for the default queue and another for the `configure` queue. The `configure` worker
187+
should be a single worker to avoid parallel configurations.
188+
189+
```bash
190+
php artisan queue:work # As many as you want
191+
php artisan queue:work --queue=configure # Just one
192+
```
193+
184194
## Testing
185195

186196
`softonic/laravel-intelligent-scraper` has a [PHPUnit](https://phpunit.de) test suite and a coding style compliance test suite using [PHP CS Fixer](http://cs.sensiolabs.org/).
@@ -201,50 +211,87 @@ $ docker-compose run psysh
201211
The scraper is auto configurable, but needs an initial dataset or add a configuration.
202212
The dataset tells the configurator which data do you want and how to label it.
203213

204-
![Scrape process](./docs/images/diagram.png "Scrape process")
214+
There are three services that have unique responsibilities and are connected using the event system.
215+
### Scrape
216+
217+
It is fired when the system receive a `\Softonic\LaravelIntelligentScraper\Scraper\Events\ScrapeRequest` event. It
218+
can be done using our `scrape($url, $type)` helper function.
205219

206-
To be reconfigurable and conserve the dataset freshness the scraper store the latest data scraped.
220+
![Scrape process](./docs/images/scrape_diagram.png "Scrape process")
207221

208222
```
209223
# Powered by https://code2flow.com/app
210-
function calculate configuration {
211-
if(!Has dataset?) {
212-
goto fail;
213-
}
214-
Extract configuration using dataset;
215-
if(!Has extracted configuration?) {
216-
goto fail;
217-
}
218-
}
219-
220-
Scrape url 'https://test.c/p/my-onjective' using 'Item-definition-1';
224+
Scrape Request 'https://test.c/p/my-onjective' using 'Item-definition-1';
221225
try {
222226
load configuration;
223227
}
224228
catch(Missing config) {
225-
call calculate configuration;
229+
goto fail;
226230
}
227231
228232
extract data using configuration;
229233
// It could be produced by old configuration
230234
if(Error extracting data) {
231-
call calculate configuration;
235+
goto fail
236+
}
237+
238+
fire Scraped Event;
239+
return;
240+
241+
fail:
242+
fire InvalidConfiguration Event;
243+
```
244+
245+
246+
### Update dataset
247+
248+
To be reconfigurable and conserve the dataset freshness the scraper store the latest data scraped automatically.
249+
250+
![Updatge dataset process](./docs/images/update_dataset_diagram.png "Update dataset process")
251+
252+
```
253+
# Powered by https://code2flow.com/app
254+
Receive Scraped event;
255+
Remove oldest scraped data;
256+
Store scraped data;
257+
Scrape dataset updated;
258+
```
259+
260+
### Configure Scraper
261+
262+
If a InvalidConfiguration event is fired, the system tries to calculate a new configuration to get the information from
263+
ScrapeRequest.
264+
265+
![Configuration process](./docs/images/configure_diagram.png "Configuration process")
266+
267+
```
268+
# Powered by https://code2flow.com/app
269+
Invalid Configuration for ScrapeRequest;
270+
271+
try {
272+
calculate configuration;
232273
}
274+
catch(Cannot be reconfigured) {
275+
goto fail;
276+
}
277+
233278
extract data using configuration;
234-
// It could be produced because the dataset does not have all the page variations
235279
if(Error extracting data) {
236280
goto fail;
237281
}
238282
239-
goto success
240-
283+
Store new configuerion;
284+
Scraped data;
285+
return;
241286
fail:
287+
Fire ScrapeFailed Event;
242288
No scraped data;
243-
return;
244-
success:
245-
Scraped data;
246289
```
247290

291+
This process could produce two different events:
292+
* Scraped: All worked as expected and the page was scraped
293+
* ScrapeFailed: The scrape couldn't be done after recalculate config, so we need a manual configuration action to fix it.
294+
248295
## License
249296

250297
The Apache 2.0 license. Please see [LICENSE](LICENSE) for more information.

Diff for: docs/images/configure_diagram.png

102 KB
Loading

Diff for: docs/images/diagram.png

-170 KB
Binary file not shown.

Diff for: docs/images/scrape_diagram.png

112 KB
Loading

Diff for: docs/images/update_dataset_diagram.png

41 KB
Loading

Diff for: src/Scraper.php

-96
This file was deleted.

Diff for: src/Scraper/Application/Configurator.php

+3-2
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
use Goutte\Client;
66
use Illuminate\Support\Collection;
77
use Illuminate\Support\Facades\Log;
8+
use Softonic\LaravelIntelligentScraper\Scraper\Exceptions\ConfigurationException;
89
use Softonic\LaravelIntelligentScraper\Scraper\Models\Configuration;
910
use Softonic\LaravelIntelligentScraper\Scraper\Models\ScrapedDataset;
1011
use Symfony\Component\DomCrawler\Crawler;
@@ -68,7 +69,7 @@ private function getCrawler($scrapedData)
6869
* If the data is not valid anymore, it is deleted from dataset.
6970
*
7071
* @param ScrapedDataset $scrapedData
71-
* @param Crawler $crawler
72+
* @param Crawler $crawler
7273
*
7374
* @return array
7475
*/
@@ -130,7 +131,7 @@ private function checkConfiguration($data, Collection $finalConfig)
130131
$fieldsExpected = array_keys($data);
131132

132133
$fieldsMissing = implode(',', array_diff($fieldsExpected, $fieldsFound));
133-
throw new \UnexpectedValueException("Field(s) \"{$fieldsMissing}\" not found.", 0);
134+
throw new ConfigurationException("Field(s) \"{$fieldsMissing}\" not found.", 0);
134135
}
135136
}
136137
}

Diff for: src/Scraper/Application/XpathFinder.php

-3
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@
33
namespace Softonic\LaravelIntelligentScraper\Scraper\Application;
44

55
use Goutte\Client as GoutteClient;
6-
use Softonic\LaravelIntelligentScraper\Scraper\Events\Scraped;
76
use Softonic\LaravelIntelligentScraper\Scraper\Exceptions\MissingXpathValueException;
87

98
class XpathFinder
@@ -49,8 +48,6 @@ public function extract(string $url, $configs): array
4948
});
5049
}
5150

52-
event(new Scraped($url, $configs[0]['type'], $result));
53-
5451
return $result;
5552
}
5653
}

Diff for: src/Scraper/Events/InvalidConfiguration.php

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
<?php
2+
3+
namespace Softonic\LaravelIntelligentScraper\Scraper\Events;
4+
5+
class InvalidConfiguration
6+
{
7+
/**
8+
* @var ScrapeRequest
9+
*/
10+
public $scrapeRequest;
11+
12+
public function __construct(ScrapeRequest $scrapeRequest)
13+
{
14+
$this->scrapeRequest = $scrapeRequest;
15+
}
16+
}

Diff for: src/Scraper/Events/ScrapeFailed.php

+26
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
<?php
2+
3+
namespace Softonic\LaravelIntelligentScraper\Scraper\Events;
4+
5+
use Illuminate\Foundation\Events\Dispatchable;
6+
use Illuminate\Queue\SerializesModels;
7+
8+
class ScrapeFailed
9+
{
10+
use Dispatchable, SerializesModels;
11+
12+
/**
13+
* @var ScrapeRequest
14+
*/
15+
public $scrapeRequest;
16+
17+
/**
18+
* Create a new event instance.
19+
*
20+
* @param ScrapeRequest $scrapeRequest
21+
*/
22+
public function __construct(ScrapeRequest $scrapeRequest)
23+
{
24+
$this->scrapeRequest = $scrapeRequest;
25+
}
26+
}

0 commit comments

Comments
 (0)