Skip to content

Commit 2a9130f

Browse files
authored
Merge pull request #3 from masterT/params-schema
Params schema
2 parents 2a3157f + baf014c commit 2a9130f

5 files changed

Lines changed: 185 additions & 41 deletions

File tree

README.md

Lines changed: 30 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -96,48 +96,22 @@ var options = {
9696
var scraper = yoloScraper.createScraper(options);
9797
```
9898

99-
#### Returned scraper function
99+
#### `options.paramsSchema`
100100

101-
To use your scraper function, pass the params of your scraping request, and a callback function.
102-
103-
```js
104-
scraper(params, function (error, data) {
105-
if (error) {
106-
// handle the `error`
107-
} else {
108-
// do something with `data`
109-
}
110-
});
111-
```
112-
113-
When a request error occurred, the callback `error` argument will be an instance of _Error_ and the `data` will be _null_.
114-
115-
##### Case `options.validateList = false`
116-
117-
When an validation error occurred, the callback `error` argument will be an instance of _ValidationError_ and the `data` will be _null_.
118-
119-
Otherwise, the `error` will be _null_ and `data` will be the returned value of `options.extract`.
120-
121-
122-
##### Case `options.validateList = true`
123-
124-
When an validation errors occurred, the callback `error` argument will be an instance of _ListValidationError_, otherwise it will be _null_.
125-
126-
If the value returned by `options.extract` is not an Array, `error` will be an instance of _Error_.
127-
128-
The `data` always be an _Array_ that only contains the **valid** item returned by `options.extract`.
101+
The [JSON schema](https://spacetelescope.github.io/understanding-json-schema/) that defines the shape of the accepted arguments passed to `options.request`. When invalid, an Error will be thrown.
129102

103+
Optional
130104

131105
#### `options.request = function(params)`
132106

133-
Function that takes the *same argument* passed to your scraper function. It returns the options to pass to the [request ](https://www.npmjs.com/package/request) module to make the request.
107+
Function that takes the arguments passed to your scraper function and returns the options to pass to the [request ](https://www.npmjs.com/package/request) module to make the network request.
134108

135109
**Required**
136110

137111

138112
#### `options.extract = function(response, body, $)`
139113

140-
Function that takes [request](https://www.npmjs.com/package/request) response, the response body and a [cheerio](https://www.npmjs.com/package/cheerio) instance. It returns the extracted data you want.
114+
Function that takes [request](https://www.npmjs.com/package/request) response, the response body (_String_) and a [cheerio](https://www.npmjs.com/package/cheerio) instance. It returns the extracted data you want.
141115

142116
**Required**
143117

@@ -158,7 +132,7 @@ Optional, default: `{}`
158132

159133
#### `options.ajvOptions`
160134

161-
The option to pass to [ajv](https://www.npmjs.com/package/ajv) when it compiles the schema.
135+
The option to pass to [ajv](https://www.npmjs.com/package/ajv) when it compiles the JSON schemas.
162136

163137
Optional, default: `{allErrors: true}` - It check all rules collecting all errors
164138

@@ -170,6 +144,30 @@ Use this option to validate each item of the data extracted **individually**. Wh
170144
Optional, default: `false`
171145

172146

147+
#### scraper function
148+
149+
To use your scraper function, pass the params to send to `options.request`, and a callback function.
150+
151+
```js
152+
scraper(params, function (error, data) {
153+
if (error) {
154+
// handle the `error`
155+
} else {
156+
// do something with `data`
157+
}
158+
});
159+
```
160+
161+
##### callback(error, data)
162+
163+
- When a network request error occurred, the callback `error` argument will be an _Error_ and the `data` will be _null_.
164+
165+
- When `options.validateList = false` and a validation error occurred, `error` will be a _ValidationError_ and the `data` will be _null_. Otherwise, the `error` will be _null_ and `data` will be the returned value of `options.extract`.
166+
167+
- When `options.validateList = true` and a validation errors occurred, `error` will be a _ListValidationError_, otherwise it will be _null_. If the value returned by `options.extract` is not an Array, `error` will be an instance of _Error_. The `data` always be an _Array_ that only contains the **valid** item returned by `options.extract`. It's not because `error` is a _ListValidationError_ that there will be no `data`!
168+
169+
170+
173171
## dependecies
174172

175173
- [request](https://www.npmjs.com/package/request) - Simplified HTTP request client.

examples/usingParamsSchema.js

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
var yoloScraper = require('../lib/index.js');
2+
3+
4+
var scraper = yoloScraper.createScraper({
5+
6+
paramsSchema: {
7+
"$schema": "http://json-schema.org/draft-04/schema#",
8+
"type": "string",
9+
"minLength": 1
10+
},
11+
12+
request: function (username) {
13+
return 'https://www.npmjs.com/~' + username.toLowerCase();
14+
},
15+
16+
extract: function (response, body, $) {
17+
return $('.collaborated-packages li').toArray().map(function (element) {
18+
var $element = $(element);
19+
return {
20+
name: $element.find('a').text(),
21+
url: $element.find('a').attr('href'),
22+
version: $element.find('strong').text()
23+
};
24+
});
25+
},
26+
27+
schema: {
28+
"$schema": "http://json-schema.org/draft-04/schema#",
29+
"type" : "array",
30+
"items": {
31+
"type": "object",
32+
"additionalProperties": false,
33+
"properties": {
34+
"name": { "type": "string" },
35+
"url": { "type": "string", "format": "uri" },
36+
"version": { "type": "string", "pattern": "^v\\d+\\.\\d+\\.\\d+$" }
37+
},
38+
"required": [ "name", "url", "version" ]
39+
}
40+
}
41+
42+
});
43+
44+
var validParams = "masterT";
45+
var invalidParams = "";
46+
47+
scraper(validParams, function (error, data) {
48+
// scraper(invalidParams, function (error, data) {
49+
if (error) {
50+
console.log('error:', error);
51+
} else {
52+
console.log('data:', data);
53+
}
54+
});

lib/createScraper.js

Lines changed: 40 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,41 +5,73 @@ var cheerio = require('cheerio'),
55
ListValidationError = require('./ListValidationError.js');
66

77

8+
function isObject(value) {
9+
return typeof value === 'object' && value !== null && !Array.isArray(value);
10+
}
11+
12+
13+
function isFunction(value) {
14+
return typeof value === 'function';
15+
}
16+
17+
18+
function isArray(value) {
19+
return Array.isArray(value);
20+
}
21+
22+
23+
function isBoolean(value) {
24+
return typeof value === 'boolean';
25+
}
26+
27+
828
module.exports = function (options) {
929

10-
if (typeof options.request !== 'function') {
30+
if (!isFunction(options.request)) {
1131
throw new Error("Expect options.request to be a function");
1232
}
13-
if (typeof options.extract !== 'function') {
33+
if (!isFunction(options.extract)) {
1434
throw new Error("Expect options.extract to be a function");
1535
}
16-
if (typeof options.schema !== 'object') {
36+
if (!isObject(options.schema)) {
1737
throw new Error("Expect options.schema to be an object");
1838
}
19-
if (options.hasOwnProperty('validateList') && typeof options.validateList !== 'boolean') {
39+
if (options.hasOwnProperty('validateList') && !isBoolean(options.validateList)) {
2040
throw new Error("Expect options.validateList to be a boolean");
2141
}
42+
if (options.hasOwnProperty('paramsSchema') && !isObject(options.paramsSchema)) {
43+
throw new Error("Expect options.paramsSchema to be an object");
44+
}
2245

2346
var cheerioOptions = {};
24-
if (typeof options.cheerioOptions === 'object') {
47+
if (isObject(options.cheerioOptions)) {
2548
cheerioOptions = options.cheerioOptions;
2649
}
2750

2851
var ajvOptions = {allErrors: true};
29-
if (typeof options.ajvOptions === 'object') {
52+
if (isObject(options.ajvOptions)) {
3053
ajvOptions = options.ajvOptions;
3154
}
3255

3356
// compile the JSON schema
3457
var ajv = new Ajv(ajvOptions);
58+
var validateParamsSchema;
3559
var validateSchema = ajv.compile(options.schema);
60+
if (options.paramsSchema) {
61+
validateParamsSchema = ajv.compile(options.paramsSchema);
62+
}
3663

3764
return function (params, callback) {
3865

39-
if (typeof callback !== 'function') {
66+
if (!isFunction(callback)) {
4067
throw new Error("Expect callback to be a function");
4168
}
4269

70+
if (validateParamsSchema && !validateParamsSchema(params)) {
71+
var paramsError = ajv.errorsText(validateParamsSchema.errors, {dataVar: 'params'});
72+
throw new Error(paramsError);
73+
}
74+
4375
var requestOption = options.request(params);
4476

4577
request(requestOption, function (error, response, body) {
@@ -55,7 +87,7 @@ module.exports = function (options) {
5587
if (options.validateList) {
5688
var validationErrors = [];
5789
var validItems = [];
58-
if (!Array.isArray(extractedData)) {
90+
if (!isArray(extractedData)) {
5991
callbackError = new Error('Expect the extracted data to be an array when using options.validateList');
6092
} else {
6193
extractedData.forEach(function (item) {

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "yolo-scraper",
3-
"version": "0.1.0",
3+
"version": "0.2.0",
44
"description": "A simple way to structure your web scraper.",
55
"main": "lib/index.js",
66
"keywords": [

spec/yoloScraperSpec.js

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,16 @@ describe("createScraper", function () {
2323
}).toThrowError(Error, "Expect options.request to be a function");
2424
});
2525

26+
it("throws an error when property `request` is not a function", function () {
27+
expect(function () {
28+
createScraper({
29+
request: null,
30+
extract: function () {},
31+
schema: {}
32+
});
33+
}).toThrowError(Error, "Expect options.request to be a function");
34+
});
35+
2636
it("throws an error without function property `extract`", function () {
2737
expect(function () {
2838
createScraper({
@@ -32,6 +42,16 @@ describe("createScraper", function () {
3242
}).toThrowError(Error, "Expect options.extract to be a function");
3343
});
3444

45+
it("throws an error when property `extract` is not a function", function () {
46+
expect(function () {
47+
createScraper({
48+
request: function () {},
49+
extract: null,
50+
schema: {}
51+
});
52+
}).toThrowError(Error, "Expect options.extract to be a function");
53+
});
54+
3555
it("throws an error without function property `schema`", function () {
3656
expect(function () {
3757
createScraper({
@@ -41,6 +61,16 @@ describe("createScraper", function () {
4161
}).toThrowError(Error, "Expect options.schema to be an object");
4262
});
4363

64+
it("throws an error when property `schema` is not a boolean", function () {
65+
expect(function () {
66+
createScraper({
67+
request: function () {},
68+
extract: function () {},
69+
schema: null
70+
});
71+
}).toThrowError(Error, "Expect options.schema to be an object");
72+
});
73+
4474
it("throws an error when property `validateList` is not a boolean", function () {
4575
expect(function () {
4676
createScraper({
@@ -52,6 +82,17 @@ describe("createScraper", function () {
5282
}).toThrowError(Error, "Expect options.validateList to be a boolean");
5383
});
5484

85+
it("throws an error when property `paramsSchema` is not an object", function () {
86+
expect(function () {
87+
createScraper({
88+
request: function () {},
89+
extract: function () {},
90+
schema: {},
91+
paramsSchema: null
92+
});
93+
}).toThrowError(Error, "Expect options.paramsSchema to be an object");
94+
});
95+
5596
it("returns a function with properties `request`, `extract` and `schema`, and without `validateList`", function () {
5697
var scraper = createScraper({
5798
request: function () {},
@@ -70,6 +111,25 @@ describe("createScraper", function () {
70111
.pend("Don't know how to: mock request module and expect it to receive options.cheerioOptions");
71112

72113

114+
describe("when using paramsSchema", function () {
115+
116+
it("validate the params", function () {
117+
var options = scraperOptions();
118+
options.paramsSchema = {
119+
"type": "string",
120+
"minLength": 1
121+
};
122+
var invalidParams = "";
123+
var scraper = createScraper(options);
124+
125+
expect(function () {
126+
scraper(invalidParams, function(error, data) {});
127+
}).toThrowError(Error, /params/)
128+
});
129+
130+
});
131+
132+
73133
describe("when validateList is false", function () {
74134
var requestBody = fixture("list.html"),
75135
params = 'numbers',

0 commit comments

Comments
 (0)