You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+5-3Lines changed: 5 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,7 +30,7 @@ $ pip install preparser
30
30
31
31
## parameters
32
32
33
-
here below are some of the parameters you can use for initai the Object `PreParser` from the package `preparser`:
33
+
here below are some of the parameters you can use for initial the Object `PreParser` from the package `preparser`:
34
34
35
35
36
36
| Parameters | Type | Description |
@@ -44,7 +44,7 @@ here below are some of the parameters you can use for initai the Object `PrePars
44
44
| stop_when_task_failed | bool | wheather need stop when you failed to get request from a Url,default is `True`|
45
45
| threading_numbers | int | The maximum number of threads in the threading pool. Default is `3`. |
46
46
| checked_same_site | bool | wheather need add more headers info to pretend requesting in a same site to parse datas, default is `True`,to resolve the `CORS` Block. |
47
-
47
+
| html_dynamic_scope | list or None | point and get the specied scope dom of the whole page html, default is None,which stands for the whole page.<br />if this value was set, the parameter should be a list(2) Object. <br/> 1. the first value is a tag <a href="https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector"> selecter</a>. <br /> for example, 'div#main' mean a div tag with 'id=main', 'div.test' will get the the first matched div tag with 'class = test'. but don't make the selecter too complex or matched the mutiple parent dom, otherwise you can't get their inner_html() correctly or time out, and finally you can get the BeautifulSoup object of the inner_html from this selecter selected tag in the `request_call_back_func`. <br /> 2. the secound value should be one of the values below: <br />`attached`: wait for element to be present in DOM. <br />`detached`: wait for element to not be present in DOM. <br />`hidden`: wait for element to have non-empty bounding box and no 'visibility:hidden'. Note that element,without any content or with 'display:none' has an empty bounding box and is not considered visible. <br /> `visible`: wait for element to be either detached from DOM, or have an empty bounding box or 'visibility:hidden'. This is opposite to the 'visible' option.
48
48
49
49
## example
50
50
@@ -81,7 +81,7 @@ if __name__ == "__main__":
81
81
parser = PreParser(
82
82
url_list=url_list,
83
83
request_call_back_func=handle_preparser_result,
84
-
parser_mode='api', # this mode depands on you set, you can use the "api" or "html"
84
+
parser_mode='api', # this mode depands on you set, you can use the "api", "html",or 'html_dynamic'
85
85
start_threading=True,
86
86
threading_mode='single',
87
87
cached_data=True,
@@ -114,6 +114,8 @@ Get help ➡️ [Github issue](https://github.com/BertramYe/preparser/issues)
114
114
115
115
# Update logs
116
116
117
+
*`version 2.0.6 `: add the `html_dynamic_scope` parameters to let user can specified the whole dynamic parse scope, which can help faster the preparser speed when the `parser_mode` is `html_dynamic` . and resort the additional tools into the `ToolsHelper` package.
118
+
117
119
*`version 2.0.5 `: remove the dynamic mode browser core install from setup into package call.
118
120
119
121
*`version 2.0.4 `: test the installing process command.
A slight PreParser oject to handle the parsing task with threading pools or other methods from webpage urls or api urls.
17
15
18
16
Parameters:
19
-
url_list(list):The list of URLs to parse from. Default is an empty list.
20
-
request_call_back_func (Callable[[str,BeautifulSoup | Dict[str, Any]], Any] | None):A callback function according to the parser_mode to handle the `BeautifulSoup` object or request `json` Object. and if you want to show your business process failed, you can return `None`, otherwise please return a `not None` Object.
21
-
parser_mode(Literal['html','api']): the pre-parsing datas mode,default is html,
17
+
url_list(list):The list of URLs to parse from. Default is an empty list.
18
+
request_call_back_func (Callable[[str,BeautifulSoup | Dict[str, Any]], Any] | None):A callback function according to the parser_mode to handle the `BeautifulSoup` object or request `json` Object. and if you want to show your business process failed, you can return `None`, otherwise please return a `not None` Object.
19
+
parser_mode(Literal['html','api']): the pre-parsing datas mode,default is html,
22
20
`html`: parse the content from static html, and return an `BeautifulSoup` Object.
23
21
`api`: parse the datas from an api, and return the `json` Object.
24
22
`html_dynamic`: parse from the whole webpage html content and return an `BeautifulSoup` Object, even the content that generated by the dynamic js code.
25
-
cached_data(bool): weather cache the parsed datas, defalt is False.
26
-
start_threading(bool): Whether to use threading pool for parsing the data. Default is False.
27
-
threading_mode(Literal['map','single']): to run the task mode,default is `single`
23
+
cached_data(bool): weather cache the parsed datas, defalt is False.
24
+
start_threading(bool): Whether to use threading pool for parsing the data. Default is False.
25
+
threading_mode(Literal['map','single']): to run the task mode,default is `single`.
28
26
`map`: use the `map` func of the theading pool to distribute tasks.
29
27
`single`: use the `submit` func to distribute the task one by one into the theading pool.
30
-
stop_when_task_failed (bool) : wheather need stop when you failed to get request from a Url,default is True
31
-
threading_numbers (int): The maximum number of threads in the threading pool. Default is 3.
32
-
checked_same_site (bool): wheather need add more headers info to pretend requesting in a same site to parse datas, default is True,to resolve the CORS Block.
28
+
stop_when_task_failed(bool): wheather need stop when you failed to get request from a Url,default is True.
29
+
threading_numbers(int): The maximum number of threads in the threading pool. Default is 3.
30
+
checked_same_site(bool): wheather need add more headers info to pretend requesting in a same site to parse datas, default is True,to resolve the CORS Block.
31
+
html_dynamic_scope(list[str,Literal['attached', 'detached', 'hidden', 'visible']] | None): point and get the specied scope dom of the whole page html, default is None, which stands for the whole page dom.
32
+
else if this value was set, the parameter should be a list(2) Object.
33
+
1. the first value is a tag <a href="https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector"> selecter</a>.
34
+
for example, 'div#main' mean a div tag with 'id=main', 'div.test' will get the the first matched div tag with 'class = test'.
35
+
but don't make the selecter too complex or matched the mutiple parent dom, otherwise you can't get their inner_html() correctly or time out.
36
+
and finally you can get the BeautifulSoup object of the inner_html from this selecter selected tag in the `request_call_back_func`.
37
+
2. the secound value should be one of the values below:
38
+
`attached`: wait for element to be present in DOM
39
+
`detached`: wait for element to not be present in DOM.
40
+
`hidden`: wait for element to have non-empty bounding box and no `visibility:hidden`. Note that element,without any content or with `display:none` has an empty bounding box and is not considered visible.
41
+
`visible`: wait for element to be either detached from DOM, or have an empty bounding box or `visibility:hidden`. This is opposite to the 'visible' option.
33
42
Attributes:
34
-
url_list (list): The list of URLs to parse from.
35
-
request_call_back_func (Callable[[str,BeautifulSoup | Dict[str, Any]], bool] | None): The callback function to process the BeautifulSoup Or Json object.
36
-
parser_mode (Literal['html','api']): the preparse datas mode.
37
-
cached_data (bool): weather to cache the parse datas.
38
-
start_threading (bool): Whether to use threading pool.
39
-
threading_mode (Literal['map','single']): to run the task mode.
40
-
stop_when_task_failed (bool) : wheather need stop when you failed to get request from a Url.
41
-
threading_numbers (int): The maximum number of threads.
42
-
checked_same_site (bool): wheather need add more headers info to pretend requesting in a same site to parse datas, to resolve the CORS Block.
43
+
url_list(list):The list of URLs to parse from.
44
+
request_call_back_func(Callable[[str,BeautifulSoup | Dict[str, Any]], bool] | None): The callback function to process the BeautifulSoup Or Json object.
45
+
parser_mode(Literal['html','api']): the preparse datas mode.
46
+
cached_data(bool): weather to cache the parse datas.
47
+
start_threading(bool): Whether to use threading pool.
48
+
threading_mode(Literal['map','single']): to run the task mode.
49
+
stop_when_task_failed(bool): wheather need stop when you failed to get request from a Url.
50
+
threading_numbers(int): The maximum number of threads.
51
+
checked_same_site(bool): wheather need add more headers info to pretend requesting in a same site to parse datas, to resolve the CORS Block.
52
+
html_dynamic_scope(list[str,Literal['attached', 'detached', 'hidden', 'visible']] | None): to get and load specified scope html nodes resouce.
this function is help finding out the website elements nodes between specified two same level elements notes and finally return a new BeautifulSoup Object
194
-
195
-
Parameters:
196
-
start_node (BeautifulSoup | None): The start elements nodes, defaut is None, which means from the target first one element to start get the element node
197
-
end_node (BeautifulSoup | None): The end elements nodes, defaut is None, which means from the last element to start get the element node
198
-
include_start_node (bool): when get the element nodes, weather include the start node, default is False.
199
-
include_end_node (bool): when get the element nodes, weather include the end node, default is False.
200
-
parent_node (BeautifulSoup | None): the parent element node which contained the start_node and end_node,
201
-
if you set it , we just find the node in current nodes' children element,
202
-
also default it can be None,which will be the parent node of the start_node or end_node.
203
-
"""
204
-
205
-
206
-
if (notstart_node) and (notend_node):
207
-
print("error: start_node and end_node are both None !!!")
0 commit comments