Skip to content

Commit 05c6613

Browse files
committed
finished the basical development of the preparser
1 parent 29ad087 commit 05c6613

File tree

9 files changed

+593
-0
lines changed

9 files changed

+593
-0
lines changed

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
*__pycache__
2+
Pipfile
3+
result
4+
test.py
5+
*.egg-info
6+
.pypirc
7+
dist

MANIFEST.in

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
include README.md
2+
include LICENSE
3+
4+
# include setup.py # no need include current package, which will auto include the setup.py file when uploaded into PyPI
5+
6+
# get ride of the Pipfile files, as it just been use for the virtual environment building
7+
exclude Pipfile
8+
exclude test.py
9+
# exclude setup.py
10+
recursive-exclude *.egg-info *.* # when running the python setup.py develop , it help generate files
11+
recursive-exclude preparser/__pycache__/ *.*.pyc # when running the python setup.py develop , it help generate files
12+
# remove all __pycache__ directory's content from all of the directory
13+
# global-exclude *.pyc

README.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Description
2+
3+
this is a sight Parser to help you pre_parser the datas from `specified website url or api`, it help you get ride of the duplicate coding to get the request from the `specified url and speed up the process with the threading pool` and you just need focused on the bussiness proceess coding after you get the specified request response from the `specified webpage or api urls`
4+
5+
# Attention
6+
7+
as this slight pre_parser was just based the module of the `requests` and `beautifulsoup4`, which mainly was used to parse the data from the `api` and page randered as `static html`, so it can't directly parsed datas that need waiting the whole website pages was loaded, but for this function , maybe I will added later in future.
8+
9+
```bash
10+
11+
python version >= 3.9
12+
13+
```
14+
15+
# How to use
16+
17+
## install
18+
19+
```bash
20+
$ pip install preparser
21+
```
22+
23+
24+
25+
> Github Resouce ➡️ [Github Repos](https://github.com/BertramYe/preparser)
26+
27+
> and also just feel free to fork and modify this codes. if you like current project, star ⭐ it please, uwu.
28+
29+
> PyPI: ➡️ [PyPI Publish](https://pypi.org/project/preparser/)
30+
31+
## parameters
32+
33+
here below are some of the parameters you can use for initai the Object `PreParser` from the package `preparser`:
34+
35+
36+
| Parameters | Type | Description |
37+
| --------------------- | ----------------- |-------------------------------------------------------- |
38+
| url_list | list | The list of URLs to parse from. Default is an empty list. |
39+
| request_call_back_func | Callable or None | A callback function according to the parser_mode to handle the `BeautifulSoup` object or request `json` Object. and if you want to show your business process failed, you can return `None`, otherwise please return a `not None` Object. |
40+
| parser_mode | `'html'` or `'api'` | The pre-parsing datas mode,default is `'html'`.<br/> `html`: use the bs4 to parse the datas, and return an `BeautifulSoup` Object. <br/> `api` : use requests only and return an `json` object. <br/> **and all of them you can get it when you set the `request_call_back_func`, otherwise get it via the object of `PreParer(....).cached_request_datas` |
41+
| cached_data | bool | weather cache the parsed datas, defalt is False. |
42+
| start_threading | bool | Whether to use threading pool for parsing the data. Default is `False`.|
43+
| threading_mode | `'map'` or `'single'` | to run the task mode, default is `single`. <br/> `map`: use the `map` func of the theading pool to distribute tasks. <br/> `single`: use the `submit` func to distribute the task one by one into the theading pool. |
44+
| stop_when_task_failed | bool | wheather need stop when you failed to get request from a Url,default is `True` |
45+
| threading_numbers | int | The maximum number of threads in the threading pool. Default is `3`. |
46+
| checked_same_site | bool | wheather need add more headers info to pretend requesting in a same site to parse datas, default is `True`,to resolve the `CORS` Block. |
47+
48+
49+
## example
50+
51+
```python
52+
53+
# test.py
54+
from preparser import PreParser,BeautifulSoup,Json_Data,Filer
55+
56+
57+
def handle_preparser_result(url:str,preparser_object:BeautifulSoup | Json_Data) -> bool:
58+
# here you can just write the bussiness logical you want
59+
60+
# attention:
61+
# preparser_object type depaned on the `parser_mode` in the `PreParser`:
62+
# 'api' : preparser_object is the type of a Json_Data
63+
# 'html' : preparser_object is the type of a BeautifulSoup
64+
65+
........
66+
67+
# for the finally return:
68+
# if you want to show current result is failed just Return a None, else just return any object which is not None.
69+
return preparser_object
70+
71+
72+
if __name__ == "__main__":
73+
74+
# start the parser
75+
url_list = [
76+
'https://example.com/api/1',
77+
'https://example.com/api/2',
78+
.....
79+
]
80+
81+
parser = PreParser(
82+
url_list=url_list,
83+
request_call_back_func=handle_preparser_result,
84+
parser_mode='api', # this mode depands on you set, you can use the "api" or "html"
85+
start_threading=True,
86+
threading_mode='single',
87+
cached_data=True,
88+
stop_when_task_failed=False,
89+
threading_numbers=3,
90+
checked_same_site=True
91+
)
92+
93+
# start parse
94+
parser.start_parse()
95+
96+
# when all task finished, you can get the all task result result like below:
97+
all_results = parser.cached_request_datas
98+
99+
# if you want to terminal, just execute the function here below
100+
# parser.stop_parse()
101+
102+
# also you can use the Filer to save the final result above
103+
# and also find the datas in the `result/test.json`
104+
filer = Filer('json')
105+
filer.write_data_into_file('result/test',[all_result])
106+
107+
```
108+
109+
110+
# Get Help
111+
112+
Get help ➡️ [Github issue](https://github.com/BertramYe/preparser/issues)
113+

preparser/FileHelper.py

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
from typing import Literal,Any
2+
from os import makedirs
3+
from os.path import dirname
4+
from json import dump
5+
6+
7+
FileType = Literal['txt','json']
8+
9+
class Filer():
10+
"""
11+
a simple file object to help process file result
12+
13+
Parameters:
14+
file_type ( Literal['txt','json'] ): to save or read files type, so far mainly support `txt` and `json`, but future will add more
15+
"""
16+
def __init__(self,file_type:FileType='txt') -> None:
17+
self.file_type:FileType = file_type
18+
19+
def write_data_into_file(self,new_file_name:str,datas:list[Any],ensure_json_ascii:bool=False):
20+
"""
21+
a function to help save the result from the object preparser result into txt or json file
22+
23+
Parameters:
24+
new_file_name ( str ): to save the result filename , if file existed, will empty the file content and rewrite `datas` into it, else will auto help create a new file.
25+
datas: (list[Any]) : to save the files datas, which should be a `list` Obeject
26+
ensure_json_ascii (bool) : for the json file content writing, as from the object `PreParser` result, which no need ensure_ascii , otherwise may result the content ascii can't be decoded, just keep default False is ok.
27+
"""
28+
29+
file_name = f'{new_file_name}.{self.file_type}'
30+
# 获取文件夹路径
31+
dir_path = dirname(file_name)
32+
33+
# 确保文件夹存在,若不存在则创建
34+
makedirs(dir_path, exist_ok=True)
35+
try:
36+
print(f'begin to save datas into file: {file_name} !')
37+
if self.file_type == 'txt' or self.file_type == 'json':
38+
with open(file_name,'w',encoding="utf-8") as file:
39+
if self.file_type == 'json':
40+
dump(datas, file, indent=4,ensure_ascii=ensure_json_ascii)
41+
else:
42+
file.writelines(datas)
43+
else:
44+
print(f'failed to save datas into file: {file_name}, as current file type not support, file_type: {self.file_type}')
45+
print(f'successd to save datas into file: {file_name} !')
46+
except Exception as err:
47+
print(f'failed to save datas into file: {file_name}, error: {err}')

0 commit comments

Comments
 (0)