Skip to content
This repository has been archived by the owner. It is now read-only.

Commit 9321568

Browse files
Initial library base
0 parents  commit 9321568

34 files changed

+2638
-0
lines changed

Diff for: .gitignore

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
/build
2+
/vendor/
3+
composer.lock
4+
.php_cs.cache

Diff for: .php_cs

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
<?php
2+
$finder = PhpCsFixer\Finder::create()
3+
->in(__DIR__ . '/src')
4+
->in(__DIR__ . '/tests');
5+
return PhpCsFixer\Config::create()
6+
->setRules([
7+
'@PSR2' => true,
8+
'array_syntax' => ['syntax' => 'short'],
9+
'concat_space' => ['spacing' => 'one'],
10+
'new_with_braces' => true,
11+
'no_blank_lines_after_phpdoc' => true,
12+
'no_empty_phpdoc' => true,
13+
'no_empty_comment' => true,
14+
'no_leading_import_slash' => true,
15+
'no_trailing_comma_in_singleline_array' => true,
16+
'no_unused_imports' => true,
17+
'ordered_imports' => ['importsOrder' => null, 'sortAlgorithm' => 'alpha'],
18+
'phpdoc_add_missing_param_annotation' => ['only_untyped' => true],
19+
'phpdoc_align' => true,
20+
'phpdoc_no_empty_return' => true,
21+
'phpdoc_order' => true,
22+
'phpdoc_scalar' => true,
23+
'phpdoc_to_comment' => true,
24+
'psr0' => false,
25+
'psr4' => true,
26+
'return_type_declaration' => ['space_before' => 'none'],
27+
'single_blank_line_before_namespace' => true,
28+
'single_quote' => true,
29+
'space_after_semicolon' => true,
30+
'ternary_operator_spaces' => true,
31+
'trailing_comma_in_multiline_array' => true,
32+
'trim_array_spaces' => true,
33+
'whitespace_after_comma_in_array' => true,
34+
])
35+
->setFinder($finder);

Diff for: .travis.yml

+32
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
language: php
2+
3+
sudo: false
4+
5+
matrix:
6+
include:
7+
- php: 7.1
8+
env: STATIC_ANALYSIS=true VALIDATE_CODING_STYLE=true
9+
- php: 7.2
10+
env: STATIC_ANALYSIS=true VALIDATE_CODING_STYLE=true
11+
- php: master
12+
env: STATIC_ANALYSIS=true VALIDATE_CODING_STYLE=false
13+
allow_failures:
14+
- php: master
15+
fast_finish: true
16+
17+
cache:
18+
directories:
19+
- $HOME/.composer/cache
20+
21+
before_install:
22+
- travis_retry composer self-update
23+
24+
install:
25+
- travis_retry composer update --no-interaction --prefer-source
26+
27+
script:
28+
- composer phpunit
29+
30+
after_script:
31+
- if [ "$VALIDATE_CODING_STYLE" == "true" ]; then composer phpcs; fi
32+
- if [ "$STATIC_ANALYSIS" == "true" ]; then composer phpstan; fi

Diff for: LICENSE

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
Copyright 2018 Softonic International S.A.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License");
4+
you may not use this file except in compliance with the License.
5+
You may obtain a copy of the License at
6+
7+
http://www.apache.org/licenses/LICENSE-2.0
8+
9+
Unless required by applicable law or agreed to in writing, software
10+
distributed under the License is distributed on an "AS IS" BASIS,
11+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
See the License for the specific language governing permissions and
13+
limitations under the License.

Diff for: README.md

+253
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
# Laravel Intelligent Scraper
2+
3+
[![Latest Version](https://img.shields.io/github/release/softonic/laravel-intelligent-scraper.svg?style=flat-square)](https://github.com/softonic/laravel-intelligent-scraper/releases)
4+
[![Software License](https://img.shields.io/badge/license-Apache%202.0-blue.svg?style=flat-square)](LICENSE.md)
5+
[![Build Status](https://img.shields.io/travis/softonic/laravel-intelligent-scraper/master.svg?style=flat-square)](https://travis-ci.org/softonic/laravel-intelligent-scraper)
6+
[![Total Downloads](https://img.shields.io/packagist/dt/softonic/laravel-intelligent-scraper.svg?style=flat-square)](https://packagist.org/packages/softonic/laravel-intelligent-scraper)
7+
8+
This packages offers a scraping solution that doesn't require to know the web HTML structure and it is autoconfigured
9+
when some change is detected in the HTML structure. This allows you to continue scraping without manual intervention
10+
during a long time.
11+
12+
## Installation
13+
14+
To install, use composer:
15+
16+
```bash
17+
composer require softonic/laravel-intelligent-scraper
18+
```
19+
20+
To publish the scraper config, you can use
21+
```bash
22+
php artisan vendor:publish --provider="Softonic\LaravelIntelligentScraper\ScraperProvider" --tag=config
23+
```
24+
25+
## Configuration
26+
27+
There are two different options for the initial setup. The package can be
28+
[configured using datasets](#configuration-based-in-dataset) or
29+
[configured using Xpath](#configuration-based-in-xpath). Both ways produces the same result but
30+
depending on your Xpath knowledge you could prefer one or other.
31+
32+
### Configuration based in dataset
33+
34+
The first step is to know which data do you want to obtain from a page, so you must go to the page and choose all
35+
the texts, images, metas, etc... that do you want to scrap and label them, so you can tell the scraper what do you want.
36+
37+
An example from microsoft store could be:
38+
```php
39+
<?php
40+
use Softonic\LaravelIntelligentScraper\Scraper\Models\ScrapedDataset;
41+
42+
ScrapedDataset::create([
43+
'url' => 'https://test.c/p/my-objective',
44+
'type' => 'Item-definition-1',
45+
'data' => [
46+
'title' => 'My title',
47+
'body' => 'This is the body content I want to get',
48+
'images' => [
49+
'https://test.c/images/1.jpg',
50+
'https://test.c/images/2.jpg',
51+
'https://test.c/images/3.jpg',
52+
],
53+
],
54+
]);
55+
```
56+
57+
In this example we can see that we want different fields that we labeled arbitrarily. Some of theme have multiple
58+
values, so we can scrap lists of items from pages.
59+
60+
With this single dataset we will be able to train our scraper and be able to scrap any page with the same structure.
61+
Due to the pages usually have different structures depending on different variables, you should add different datasets
62+
trying to cover maximum page variations possible. The scraper WILL NOT BE ABLE to scrap page variations not incorporated
63+
in the dataset.
64+
65+
Once we did the job, all is ready to work. You should not care about updates always you have enought data in the dataset
66+
to cover all the new modifications on the page, so the scraper will recalculate the modifications on the fly. You can
67+
check [how it works](how-it-works.md) to know much about the internals.
68+
69+
We will check more deeply how we can create a new dataset and what options are available in the next section.
70+
71+
#### Dataset creation
72+
73+
The dataset is composed by `url` and `data`.
74+
* The `url` part is simple, you just need to indicate the url from where you obtained the data.
75+
* The `type` part gives a item name to the current dataset. This allows you to define multiple types.
76+
* The `data` part is where you indicate what data and assign the label that you want to get.
77+
The data could be a list of items or a single item.
78+
79+
A basic example could be:
80+
```php
81+
<?php
82+
use Softonic\LaravelIntelligentScraper\Scraper\Models\ScrapedDataset;
83+
84+
ScrapedDataset::create([
85+
'url' => 'https://test.c/p/my-objective',
86+
'type' => 'Item-definition-1',
87+
'data' => [
88+
'title' => 'My title',
89+
'body' => 'This is the body content I want to get',
90+
'images' => [
91+
'https://test.c/images/1.jpg',
92+
'https://test.c/images/2.jpg',
93+
'https://test.c/images/3.jpg',
94+
],
95+
],
96+
]);
97+
```
98+
99+
In this dataset we want that the text `My title` to be labeled as title and we also have a list of images that we want
100+
to be labeled as images. With this we have the flexibility to pick items one by one or in lists.
101+
102+
Sometimes we want to label some text that it is not clean in the HTML because it could include insivible characters like
103+
`\r\n`. To avoid to deal with that, the dataset allows you to add regular expressions.
104+
105+
Example with `body` field as regexp:
106+
107+
```php
108+
<?php
109+
use Softonic\LaravelIntelligentScraper\Scraper\Models\ScrapedDataset;
110+
111+
ScrapedDataset::create([
112+
'url' => 'https://test.c/p/my-objective',
113+
'data' => [
114+
'title' => 'My title',
115+
'body' => regexp('/^Body starts here, but it is do long that.*$/si'),
116+
'images' => [
117+
'https://test.c/images/1.jpg',
118+
'https://test.c/images/2.jpg',
119+
'https://test.c/images/3.jpg',
120+
],
121+
],
122+
]);
123+
```
124+
125+
With this change we will ensure that we detect the `body` even if it has hidden characters.
126+
127+
**IMPORTANT** The scraper tries to find the text in all the tags including children, so if you define a regular
128+
expression without limit, like for example `/.*Body starts.*/` you will find the text in `<html>` element due to that
129+
text is inside some child element of `<html>`. So define regexp carefully.
130+
131+
### Configuration based in Xpath
132+
133+
After you collected all the Xpath from the HTML, you just need to create the configuration models. They looks like:
134+
```php
135+
<?php
136+
use Softonic\LaravelIntelligentScraper\Scraper\Models\Configuration;
137+
138+
Configuration::create([
139+
'name' => 'title',
140+
'type' => 'Item-definition-1',
141+
'xpaths' => '//*[@id=title]',
142+
]);
143+
144+
Configuration::create([
145+
'name' => 'category',
146+
'type' => 'Item-definition-1',
147+
'xpaths' => ['//*[@id=cat]', '//*[@id=long-cat]'],
148+
]);
149+
```
150+
151+
In the definition, you should give a name to the field to be scraped and identify it as a type. The xpaths field could
152+
contain a string or an array of strings. This is because the HTML can contain different variations depending on the
153+
specific page, you you can write a list of Xpath that will be checked in order giving the first result found.
154+
155+
## Usage
156+
157+
After configure the scraper, you just need to execute it like:
158+
```php
159+
<?php
160+
$scraper = resolve(\Softonic\LaravelIntelligentScraper\Scraper\Scraper::class);
161+
$data = $scraper->getData('https://test.c/p/my-objective', 'Item-definition-1');
162+
163+
/**
164+
* Item-definition-1 defined as 3 fields tagged as: title, body and images
165+
*/
166+
echo var_export($data);
167+
/**
168+
* Output:
169+
* [
170+
* 'title' => ['My title'].
171+
* 'body' => ['This is the body content I want to get'],
172+
* 'images' => [
173+
* 'https://test.c/images/1.jpg',
174+
* 'https://test.c/images/2.jpg',
175+
* 'https://test.c/images/3.jpg',
176+
* ],
177+
* ]
178+
*/
179+
180+
```
181+
182+
All the output fields are arrays that can contain one or more results.
183+
184+
## Testing
185+
186+
`softonic/laravel-intelligent-scraper` has a [PHPUnit](https://phpunit.de) test suite and a coding style compliance test suite using [PHP CS Fixer](http://cs.sensiolabs.org/).
187+
188+
To run the tests, run the following command from the project folder.
189+
190+
``` bash
191+
$ docker-compose run test
192+
```
193+
194+
To run interactively using [PsySH](http://psysh.org/):
195+
``` bash
196+
$ docker-compose run psysh
197+
```
198+
199+
## How it works?
200+
201+
The scraper is auto configurable, but needs an initial dataset or add a configuration.
202+
The dataset tells the configurator which data do you want and how to label it.
203+
204+
![Scrape process](./docs/images/diagram.png "Scrape process")
205+
206+
To be reconfigurable and conserve the dataset freshness the scraper store the latest data scraped.
207+
208+
```
209+
# Powered by https://code2flow.com/app
210+
function calculate configuration {
211+
if(!Has dataset?) {
212+
goto fail;
213+
}
214+
Extract configuration using dataset;
215+
if(!Has extracted configuration?) {
216+
goto fail;
217+
}
218+
}
219+
220+
Scrape url 'https://test.c/p/my-onjective' using 'Item-definition-1';
221+
try {
222+
load configuration;
223+
}
224+
catch(Missing config) {
225+
call calculate configuration;
226+
}
227+
228+
extract data using configuration;
229+
// It could be produced by old configuration
230+
if(Error extracting data) {
231+
call calculate configuration;
232+
}
233+
extract data using configuration;
234+
// It could be produced because the dataset does not have all the page variations
235+
if(Error extracting data) {
236+
goto fail;
237+
}
238+
239+
goto success
240+
241+
fail:
242+
No scraped data;
243+
return;
244+
success:
245+
Scraped data;
246+
```
247+
248+
## License
249+
250+
The Apache 2.0 license. Please see [LICENSE](LICENSE) for more information.
251+
252+
[PSR-2]: http://www.php-fig.org/psr/psr-2/
253+
[PSR-4]: http://www.php-fig.org/psr/psr-4/

0 commit comments

Comments
 (0)