|
| 1 | +# Description |
| 2 | + |
| 3 | +this is a sight Parser to help you pre_parser the datas from `specified website url or api`, it help you get ride of the duplicate coding to get the request from the `specified url and speed up the process with the threading pool` and you just need focused on the bussiness proceess coding after you get the specified request response from the `specified webpage or api urls` |
| 4 | + |
| 5 | +# Attention |
| 6 | + |
| 7 | +as this slight pre_parser was just based the module of the `requests` and `beautifulsoup4`, which mainly was used to parse the data from the `api` and page randered as `static html`, so it can't directly parsed datas that need waiting the whole website pages was loaded, but for this function , maybe I will added later in future. |
| 8 | + |
| 9 | +```bash |
| 10 | + |
| 11 | +python version >= 3.9 |
| 12 | + |
| 13 | +``` |
| 14 | + |
| 15 | +# How to use |
| 16 | + |
| 17 | +## install |
| 18 | + |
| 19 | +```bash |
| 20 | +$ pip install preparser |
| 21 | +``` |
| 22 | + |
| 23 | + |
| 24 | + |
| 25 | +> Github Resouce ➡️ [Github Repos](https://github.com/BertramYe/preparser) |
| 26 | +
|
| 27 | +> and also just feel free to fork and modify this codes. if you like current project, star ⭐ it please, uwu. |
| 28 | +
|
| 29 | +> PyPI: ➡️ [PyPI Publish](https://pypi.org/project/preparser/) |
| 30 | +
|
| 31 | +## parameters |
| 32 | + |
| 33 | +here below are some of the parameters you can use for initai the Object `PreParser` from the package `preparser`: |
| 34 | + |
| 35 | + |
| 36 | +| Parameters | Type | Description | |
| 37 | +| --------------------- | ----------------- |-------------------------------------------------------- | |
| 38 | +| url_list | list | The list of URLs to parse from. Default is an empty list. | |
| 39 | +| request_call_back_func | Callable or None | A callback function according to the parser_mode to handle the `BeautifulSoup` object or request `json` Object. and if you want to show your business process failed, you can return `None`, otherwise please return a `not None` Object. | |
| 40 | +| parser_mode | `'html'` or `'api'` | The pre-parsing datas mode,default is `'html'`.<br/> `html`: use the bs4 to parse the datas, and return an `BeautifulSoup` Object. <br/> `api` : use requests only and return an `json` object. <br/> **and all of them you can get it when you set the `request_call_back_func`, otherwise get it via the object of `PreParer(....).cached_request_datas` | |
| 41 | +| cached_data | bool | weather cache the parsed datas, defalt is False. | |
| 42 | +| start_threading | bool | Whether to use threading pool for parsing the data. Default is `False`.| |
| 43 | +| threading_mode | `'map'` or `'single'` | to run the task mode, default is `single`. <br/> `map`: use the `map` func of the theading pool to distribute tasks. <br/> `single`: use the `submit` func to distribute the task one by one into the theading pool. | |
| 44 | +| stop_when_task_failed | bool | wheather need stop when you failed to get request from a Url,default is `True` | |
| 45 | +| threading_numbers | int | The maximum number of threads in the threading pool. Default is `3`. | |
| 46 | +| checked_same_site | bool | wheather need add more headers info to pretend requesting in a same site to parse datas, default is `True`,to resolve the `CORS` Block. | |
| 47 | + |
| 48 | + |
| 49 | +## example |
| 50 | + |
| 51 | +```python |
| 52 | + |
| 53 | +# test.py |
| 54 | +from preparser import PreParser,BeautifulSoup,Json_Data,Filer |
| 55 | + |
| 56 | + |
| 57 | +def handle_preparser_result(url:str,preparser_object:BeautifulSoup | Json_Data) -> bool: |
| 58 | + # here you can just write the bussiness logical you want |
| 59 | + |
| 60 | + # attention: |
| 61 | + # preparser_object type depaned on the `parser_mode` in the `PreParser`: |
| 62 | + # 'api' : preparser_object is the type of a Json_Data |
| 63 | + # 'html' : preparser_object is the type of a BeautifulSoup |
| 64 | + |
| 65 | + ........ |
| 66 | + |
| 67 | + # for the finally return: |
| 68 | + # if you want to show current result is failed just Return a None, else just return any object which is not None. |
| 69 | + return preparser_object |
| 70 | + |
| 71 | + |
| 72 | +if __name__ == "__main__": |
| 73 | + |
| 74 | + # start the parser |
| 75 | + url_list = [ |
| 76 | + 'https://example.com/api/1', |
| 77 | + 'https://example.com/api/2', |
| 78 | + ..... |
| 79 | + ] |
| 80 | + |
| 81 | + parser = PreParser( |
| 82 | + url_list=url_list, |
| 83 | + request_call_back_func=handle_preparser_result, |
| 84 | + parser_mode='api', # this mode depands on you set, you can use the "api" or "html" |
| 85 | + start_threading=True, |
| 86 | + threading_mode='single', |
| 87 | + cached_data=True, |
| 88 | + stop_when_task_failed=False, |
| 89 | + threading_numbers=3, |
| 90 | + checked_same_site=True |
| 91 | + ) |
| 92 | + |
| 93 | + # start parse |
| 94 | + parser.start_parse() |
| 95 | + |
| 96 | + # when all task finished, you can get the all task result result like below: |
| 97 | + all_results = parser.cached_request_datas |
| 98 | + |
| 99 | + # if you want to terminal, just execute the function here below |
| 100 | + # parser.stop_parse() |
| 101 | + |
| 102 | + # also you can use the Filer to save the final result above |
| 103 | + # and also find the datas in the `result/test.json` |
| 104 | + filer = Filer('json') |
| 105 | + filer.write_data_into_file('result/test',[all_result]) |
| 106 | + |
| 107 | +``` |
| 108 | + |
| 109 | + |
| 110 | +# Get Help |
| 111 | + |
| 112 | +Get help ➡️ [Github issue](https://github.com/BertramYe/preparser/issues) |
| 113 | + |
0 commit comments