Skip to content

Commit 5c20e0e

Browse files
authored
V1
This is a complete rewrite of the library to use xmltodict and pydantic Notable changes: - Ditched bs4 - Now using xmltodict and pydantic - Removed limit option - Parser now uses classmethods * 6cc67f9 Uncomment ci stuff * fb139f0 Add better Tag docs * 9b14325 Fix tests after refactor * e0b6a3a Rewrite Parser to classmethods, add basic docs * 7708c77 Update Tag docstring and run doctests in ci.yml * 3130ca1 Rename RSSFeed->RSS, RSSBaseModel->XMLBaseModel * 8f763d5 Scarp all of the wrap/unwrap work Improve conftest fixutes Add support for self-closing tags Set every field to be a Tag Add json/dict_plain and tests for it Ignore unused imports for all inits * e9e841a Update sample jsons * fc02cf1 Add wrap/unwrap population tests * e02a007 Add tests for wrap/unwrap chaining (renamed from with/without) * c436ce4 Add autogenerated dunder methods to Tag * c88388c Fix windows charmap for tests * 329765a Fix datetime tests * 2147f9a Remove push rule from ci until V2 is done * 1e44298 Add with/without_tags factory to all schemas * bd31f3c Fix tests with item, add apology_line tests * d5a80f4 Add items to channel [WIP] * 49db408 Add datetime comparison tests Refactor CI a bit Allow schema object mutation Add current and future todos Ad d IPython to dev deps Clean up README a bit [WIP] Add more rss samples for test * 5a2fcb4 Remove 3.10 syntax * a07aa9c bump setup python to v4 * 955b1ff Fix 3.12 version * b9d64c6 Replace flake8 with ruff * 908d2b0 Fix ci.yml * dd75c66 Update cron * 461eb82 Add no category attr test, remove unused file * c99b985 More updates to V2 * 1a1d20e Backup before os reinstall * 2cad195 Temp commit, reword later * e96faba Intermediate commit, added models, fixing linting and them
2 parents a13e8fa + 6cc67f9 commit 5c20e0e

35 files changed

+3308
-323
lines changed

.flake8

Lines changed: 0 additions & 5 deletions
This file was deleted.

.github/dependabot.yml

Lines changed: 0 additions & 8 deletions
This file was deleted.

.github/workflows/ci.yml

Lines changed: 11 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,22 @@
11
name: Lint and test
22

33
on:
4+
schedule:
5+
- cron: "0 0 1 * *"
6+
# TODO: Uncomment after V2 is finished
47
push:
58
paths-ignore:
6-
- '.github/**'
7-
- '!.github/workflows/ci.yml'
8-
- '.gitignore'
9-
- 'README.md'
9+
- ".gitignore"
10+
- "README.md"
1011
pull_request:
1112

1213
jobs:
13-
build:
14+
test:
1415
strategy:
1516
max-parallel: 6
1617
matrix:
1718
os: [ "ubuntu-latest", "windows-latest", "macos-latest" ]
18-
python-version: [ 3.7, 3.8, 3.9, '3.10' ]
19+
python-version: [ "3.8", "3.9", "3.10", "3.11"]
1920

2021
runs-on: ${{ matrix.os }}
2122

@@ -27,7 +28,7 @@ jobs:
2728

2829
- name: Set up Python ${{ matrix.python-version }} on ${{ matrix.os }}
2930
id: setup-python
30-
uses: actions/setup-python@v3
31+
uses: actions/setup-python@v4
3132
with:
3233
python-version: ${{ matrix.python-version }}
3334
cache-dependency-path: pyproject.toml
@@ -37,14 +38,11 @@ jobs:
3738
if: steps.setup-python.outputs.cache-hit != 'true'
3839
run: poetry install
3940

40-
- name: Lint code with flake8
41-
run: poetry run flake8
42-
4341
- name: Lint code with black
4442
run: poetry run black --check .
4543

46-
- name: Lint code with isort
47-
run: poetry run isort --check-only .
44+
- name: Lint code with ruff
45+
run: poetry run ruff check .
4846

4947
- name: Test code with pytest
50-
run: poetry run pytest
48+
run: poetry run pytest --doctest-modules
Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
1-
name: Publish Package to PyPI with poetry
1+
name: Publish to PyPI
22

33
on:
44
push:
55
tags:
6-
- 'v*'
6+
- "v*"
7+
# TODO: Only on CI success
78

89
jobs:
910
build-and-test-publish:
@@ -13,4 +14,4 @@ jobs:
1314
- name: Build and publish to pypi
1415
uses: JRubics/[email protected]
1516
with:
16-
pypi_token: ${{ secrets.pypi_password }}
17+
pypi_token: ${{ secrets.pypi_password }}

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,4 +113,4 @@ venv.bak/
113113
.mypy_cache/
114114

115115
.rss-parser
116-
poetry.lock
116+
.ruff_cache

.pre-commit-config.yaml

Lines changed: 6 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ repos:
1414
- repo: local
1515

1616
hooks:
17-
- id: black
17+
- id: black-format-staged
1818
name: black
1919
entry: poetry
2020
args:
@@ -23,26 +23,14 @@ repos:
2323
language: system
2424
types: [ python ]
2525
stages: [ commit ]
26-
# Black should use the config from the pyproject.toml file
2726

28-
- id: isort
29-
name: isort
27+
- id: ruff-check-global
28+
name: ruff
3029
entry: poetry
3130
args:
3231
- run
33-
- isort
32+
- ruff
33+
- check
3434
language: system
3535
types: [ python ]
36-
stages: [ commit ]
37-
# isort's config is also stored in pyproject.toml
38-
39-
- id: flake8
40-
name: flake8
41-
entry: poetry
42-
args:
43-
- run
44-
- flake8
45-
language: system
46-
always_run: true
47-
pass_filenames: false
48-
stages: [ push ]
36+
stages: [ commit, push ]

README.md

Lines changed: 134 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,12 @@
1010
[![License](https://img.shields.io/pypi/l/rss-parser?color=success)](https://github.com/dhvcc/rss-parser/blob/master/LICENSE)
1111
[![GitHub Pages](https://badgen.net/github/status/dhvcc/rss-parser/gh-pages?label=docs)](https://dhvcc.github.io/rss-parser#documentation)
1212

13-
[![Pypi publish](https://github.com/dhvcc/rss-parser/workflows/Pypi%20publish/badge.svg)](https://github.com/dhvcc/rss-parser/actions?query=workflow%3A%22Pypi+publish%22)
13+
![CI](https://github.com/dhvcc/rss-parser/actions/workflows/ci.yml/badge.svg?branch=master)
14+
![PyPi publish](https://github.com/dhvcc/rss-parser/actions/workflows/publish_to_pypi.yml/badge.svg?branch=master)
1415

1516
## About
1617

17-
`rss-parser` is typed python RSS parsing module built using `BeautifulSoup` and `pydantic`
18+
`rss-parser` is typed python RSS parsing module built using [pydantic](https://github.com/pydantic/pydantic) and [xmltodict](https://github.com/martinblech/xmltodict)
1819

1920
## Installation
2021

@@ -27,34 +28,153 @@ or
2728
```bash
2829
git clone https://github.com/dhvcc/rss-parser.git
2930
cd rss-parser
30-
pip install .
31+
poetry build
32+
pip install dist/*.whl
3133
```
3234

3335
## Usage
3436

37+
### Quickstart
38+
3539
```python
3640
from rss_parser import Parser
3741
from requests import get
3842

39-
rss_url = "https://feedforall.com/sample.xml"
40-
xml = get(rss_url)
43+
rss_url = "https://rss.art19.com/apology-line"
44+
response = get(rss_url)
4145

42-
# Limit feed output to 5 items
43-
# To disable limit simply do not provide the argument or use None
44-
parser = Parser(xml=xml.content, limit=5)
45-
feed = parser.parse()
46+
rss = Parser.parse(response.text)
4647

47-
# Print out feed meta data
48-
print(feed.language)
49-
print(feed.version)
48+
# Print out rss meta data
49+
print("Language", rss.channel.language)
50+
print("RSS", rss.version)
5051

5152
# Iteratively print feed items
52-
for item in feed.feed:
53+
for item in rss.channel.items:
5354
print(item.title)
54-
print(item.description)
55+
print(item.description[:50])
56+
57+
# Language en
58+
# RSS 2.0
59+
# Wondery Presents - Flipping The Bird: Elon vs Twitter
60+
# <p>When Elon Musk posted a video of himself arrivi
61+
# Introducing: The Apology Line
62+
# <p>If you could call a number and say you’re sorry
63+
```
64+
65+
Here we can see that description is still somehow has <p> - this is beacause it's placed as [CDATA](https://www.w3resource.com/xml/CDATA-sections.php) like so
66+
67+
```xml
68+
<![CDATA[<p>If you could call ...</p>]]>
69+
```
70+
71+
### Overriding schema
72+
73+
If you want to customize the schema or provide a custom one - use `schema` keyword argument of the parser
74+
75+
```python
76+
from rss_parser.models import XMLBaseModel
77+
from rss_parser.models.rss import RSS
78+
from rss_parser.models.types import Tag
79+
80+
class CustomSchema(RSS, XMLBaseModel):
81+
channel: None = None # Removing previous channel field
82+
custom: Tag[str]
83+
84+
with open("tests/samples/custom.xml") as f:
85+
data = f.read()
86+
87+
rss = Parser.parse(data, schema=CustomSchema)
88+
89+
print("RSS", rss.version)
90+
print("Custom", rss.custom)
91+
92+
# RSS 2.0
93+
# Custom Custom tag data
94+
```
95+
96+
### xmltodict
97+
98+
This library uses [xmltodict](https://github.com/martinblech/xmltodict) to parse XML data. You can see the detailed documentation [here](https://github.com/martinblech/xmltodict#xmltodict)
99+
100+
The basic thing you should know is that your data is processed into dictionaries
101+
102+
For example, this data
103+
104+
```xml
105+
<tag>content</tag>
106+
```
107+
108+
will result in the following
109+
110+
```python
111+
{
112+
"tag": "content"
113+
}
114+
```
115+
116+
*But*, when handling attributes, the content of the tag will be also a dictionary
117+
118+
```xml
119+
<tag attr="1" data-value="data">data</tag>
120+
```
121+
122+
Turns into
123+
124+
```python
125+
{
126+
"tag": {
127+
"@attr": "1",
128+
"@data-value": "data",
129+
"#text": "content"
130+
}
131+
}
132+
```
133+
134+
### Tag field
135+
136+
This is a generic field that handles tags as raw data or a dictonary returned with attributes
137+
138+
*Although this is a complex class, it forwards most of the methods to it's content attribute, so you don't notice a difference if you're only after the .content value*
139+
140+
Example
55141

142+
```python
143+
from rss_parser.models import XMLBaseModel
144+
class Model(XMLBaseModel):
145+
number: Tag[int]
146+
string: Tag[str]
147+
148+
m = Model(
149+
number=1,
150+
string={'@attr': '1', '#text': 'content'},
151+
)
152+
153+
m.number.content == 1 # Content value is an integer, as per the generic type
154+
155+
m.number.content + 10 == m.number + 10 # But you're still able to use the Tag itself in common operators
156+
157+
m.number.bit_length() == 1 # As it's the case for methods/attributes not found in the Tag itself
158+
159+
type(m.number), type(m.number.content) == (<class 'rss_parser.models.image.Tag[int]'>, <class 'int'>) # types are NOT the same, however, the interfaces are very similar most of the time
160+
161+
m.number.attributes == {} # The attributes are empty by default
162+
163+
m.string.attributes == {'attr': '1'} # But are populated when provided. Note that the @ symbol is trimmed from the beggining, however, camelCase is not converted
164+
165+
# Generic argument types are handled by pydantic - let's try to provide a string for a Tag[int] number
166+
167+
m = Model(number='not_a_number', string={'@customAttr': 'v', '#text': 'str tag value'}) # This will lead in the following traceback
168+
169+
# Traceback (most recent call last):
170+
# ...
171+
# pydantic.error_wrappers.ValidationError: 1 validation error for Model
172+
# number -> content
173+
# value is not a valid integer (type=type_error.integer)
56174
```
57175

176+
**If you wish to avoid all of the method/attribute forwarding "magic" - you should use `rss_parser.models.types.TagRaw`**
177+
58178
## Contributing
59179

60180
Pull requests are welcome. For major changes, please open an issue first

0 commit comments

Comments
 (0)