Skip to content

Commit 483f769

Browse files
committed
fix: update eval and docs, check code robustness
1 parent a066bd4 commit 483f769

21 files changed

Lines changed: 168 additions & 154 deletions

.readthedocs.yaml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@ version: 2
66

77
# Set the OS, Python version and other tools you might need
88
build:
9-
os: ubuntu-22.04
9+
os: ubuntu-24.04
1010
tools:
11-
python: "3.11"
11+
python: "3.13"
1212
# You can also specify other tool versions:
1313
# nodejs: "20"
1414
# rust: "1.70"
@@ -33,3 +33,7 @@ sphinx:
3333
python:
3434
install:
3535
- requirements: docs/requirements.txt
36+
# install the checked-out source so autodoc and the version reflect this
37+
# branch/tag rather than the released package from PyPI
38+
- method: pip
39+
path: .

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
## Changelog
22

3+
## 1.10.0
4+
- maintenance: modernize typing, packaging and code
5+
- evaluation: review and correct benchmark ground-truth labels, update and speed up alternatives
6+
- performance: stable day-granular cache key and reduced copying
7+
- fixes: preserve tails in element cleaning
8+
39
## 1.9.4
410
- maintenance: remove LXML version constraint (#184)
511

README.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ $ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html
5454
YMD](https://en.wikipedia.org/wiki/ISO_8601)).
5555
- Detection of both original and updated dates.
5656
- Multilingual.
57-
- Compatible with all recent versions of Python.
57+
- Compatible with Python 3.10 and later.
5858

5959
### How it works
6060

@@ -77,31 +77,32 @@ Finally, the output is validated and converted to the chosen format.
7777

7878
## Performance
7979

80-
1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)
80+
1000 web pages containing identifiable dates (as of 2026-06-01 on Python 3.13)
8181

8282
| Python Package | Precision | Recall | Accuracy | F-Score | Time |
8383
| -------------- | --------- | ------ | -------- | ------- | ---- |
84-
| articleDateExtractor 0.20 | 0.803 | 0.734 | 0.622 | 0.767 | 5x |
85-
| date_guesser 2.1.4 | 0.781 | 0.600 | 0.514 | 0.679 | 18x |
86-
| goose3 3.1.17 | 0.869 | 0.532 | 0.493 | 0.660 | 15x |
87-
| htmldate\[all\] 1.6.0 (fast) | **0.883** | 0.924 | 0.823 | 0.903 | **1x** |
88-
| htmldate\[all\] 1.6.0 (extensive) | 0.870 | **0.993** | **0.865** | **0.928** | 1.7x |
89-
| newspaper3k 0.2.8 | 0.769 | 0.667 | 0.556 | 0.715 | 15x |
90-
| news-please 1.5.35 | 0.801 | 0.768 | 0.645 | 0.784 | 34x |
84+
| articleDateExtractor 0.20 | 0.846 | 0.745 | 0.656 | 0.792 | 3x |
85+
| date_guesser 2.1.4 | 0.832 | 0.611 | 0.544 | 0.705 | 11x |
86+
| goose3 3.1.21 | **0.930** | 0.568 | 0.545 | 0.706 | 14x |
87+
| htmldate\[all\] 1.10.0 (fast) | 0.924 | 0.927 | 0.861 | 0.925 | **1x** |
88+
| htmldate\[all\] 1.10.0 (extensive) | 0.908 | **0.993** | **0.903** | **0.949** | 1.8x |
89+
| newspaper4k 0.9.5 | 0.912 | 0.728 | 0.680 | 0.810 | 2.5x |
90+
| news-please 1.6.16 | 0.845 | 0.777 | 0.680 | 0.810 | 29x |
9191

9292
For the complete results and explanations see [evaluation
9393
page](https://htmldate.readthedocs.io/en/latest/evaluation.html).
9494

9595
## Installation
9696

9797
Htmldate is tested on Linux, macOS and Windows systems, it is compatible
98-
with Python 3.8 upwards. It can notably be installed with `pip` (`pip3`
98+
with Python 3.10 upwards. It can notably be installed with `pip` (`pip3`
9999
where applicable) from the PyPI package repository:
100100

101101
- `pip install htmldate`
102102
- (optionally) `pip install htmldate[speed]`
103103

104-
The last version to support Python 3.6 and 3.7 is `htmldate==1.8.1`.
104+
The last version to support Python 3.6 and 3.7 is `htmldate==1.8.1`; for
105+
Python 3.8 and 3.9 use the `1.9.x` series.
105106

106107
## Documentation
107108

docs/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
# -- Project information -----------------------------------------------------
2222

2323
project = 'htmldate'
24-
copyright = '2023, <a href="https://adrien.barbaresi.eu/">Adrien Barbaresi</a>'
24+
copyright = '2017-2026, <a href="https://adrien.barbaresi.eu/">Adrien Barbaresi</a>'
2525
author = 'Adrien Barbaresi'
2626

2727
# -- General configuration ---------------------------------------------------

docs/evaluation.rst

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ There are comparable software solutions in Python, the following date extraction
1818
- `date_guesser <https://github.com/mitmedialab/date_guesser>`_ extracts publication dates from a web pages along with an accuracy measure (not used here),
1919
- `goose3 <https://github.com/goose3/goose3>`_ can extract information for embedded content,
2020
- `htmldate <https://github.com/adbar/htmldate>`_ is the software package described here, it is designed to extract original and updated publication dates of web pages,
21-
- `newspaper <https://github.com/codelucas/newspaper>`_ is mostly geared towards newspaper texts,
21+
- `newspaper4k <https://github.com/AndyTheFactory/newspaper4k>`_ (the maintained successor of newspaper3k) is mostly geared towards newspaper texts,
2222
- `news-please <https://github.com/fhamborg/news-please>`_ is a news crawler that extracts structured information.
2323

2424
Two alternative packages are not tested here but could be used in addition:
@@ -36,7 +36,7 @@ Description
3636

3737
**Time**: the execution time cannot be easily compared in all cases as some solutions perform a whole series of operations which are irrelevant to this task.
3838

39-
**Errors:** *goose3*'s output isn't always meaningful and/or in a standardized format, these cases were discarded. *news-please* seems to have trouble with some encodings (e.g. in Chinese), in which case it leads to an exception.
39+
**Errors:** *goose3*'s output isn't always meaningful and/or in a standardized format, these cases were discarded.
4040

4141

4242
Results
@@ -45,6 +45,23 @@ Results
4545
The results below show that **date extraction is not a completely solved task** but one for which extractors have to resort to heuristics and guesses. The figures documenting recall and accuracy capture the real-world performance of the tools as the absence of a date output impacts the result.
4646

4747

48+
================================ ========= ========= ========= ========= =======
49+
1000 web pages containing identifiable dates (as of 2026-06-01 on Python 3.13)
50+
--------------------------------------------------------------------------------
51+
Python Package Precision Recall Accuracy F-Score Time
52+
================================ ========= ========= ========= ========= =======
53+
articleDateExtractor 0.20 0.846 0.745 0.656 0.792 3x
54+
date_guesser 2.1.4 0.832 0.611 0.544 0.705 11x
55+
goose3 3.1.21 **0.930** 0.568 0.545 0.706 14x
56+
htmldate[all] 1.10.0 (fast) 0.924 0.927 0.861 0.925 **1x**
57+
htmldate[all] 1.10.0 (extensive) 0.908 **0.993** **0.903** **0.949** 1.8x
58+
newspaper4k 0.9.5 0.912 0.728 0.680 0.810 2.5x
59+
news-please 1.6.16 0.845 0.777 0.680 0.810 29x
60+
================================ ========= ========= ========= ========= =======
61+
62+
This run uses a reviewed version of the ground-truth labels (publication-date corrections) and the maintained *newspaper4k* fork in place of the now-unmaintained *newspaper3k*.
63+
64+
4865
=============================== ========= ========= ========= ========= =======
4966
1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)
5067
-------------------------------------------------------------------------------
@@ -62,6 +79,8 @@ news-please 1.5.35 0.801 0.768 0.645 0.784 34x
6279

6380
Additional data for new pages in English collected by the `Data Culture Group <https://dataculturegroup.org>`_ at Northeastern University.
6481

82+
The discussion below refers to the most recent run (top table), measured against a reviewed version of the publication-date labels.
83+
6584
Precision describes if the dates given as output are correct: *goose3* fares well precision-wise but it fails to extract dates in a large majority of cases (poor recall). The difference in accuracy between *date_guesser* and *newspaper* is consistent with tests described on the `website of the former <https://github.com/mitmedialab/date_guesser>`_.
6685

6786
It turns out that *htmldate* performs better than the other solutions overall. It is also noticeably faster than the strictly comparable packages (*articleDateExtractor* and most certainly *date_guesser*). Despite being measured on a sample, **the higher accuracy and faster processing time are highly significant**. Especially for smaller news outlets, websites and blogs, as well as pages written in languages other than English (in this case mostly but not exclusively German), *htmldate* greatly extends date extraction coverage without sacrificing precision.

docs/index.rst

Lines changed: 8 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ Features
8080
- URLs, HTML files, or HTML trees are given as input (includes batch processing)
8181
- Output as string in any date format (defaults to `ISO 8601 YMD <https://en.wikipedia.org/wiki/ISO_8601>`_)
8282
- Detection of both original and updated dates
83-
- Compatible with all recent versions of Python
83+
- Compatible with Python 3.10 and later
8484

8585

8686
``htmldate`` can examine markup and text. It provides the following ways to date an HTML document:
@@ -94,7 +94,7 @@ Features
9494

9595
The output is thoroughly verified in terms of plausibility and adequateness. If a valid date has been found the library outputs a date string corresponding to either the last update or the original publishing statement (the default), in the desired format.
9696

97-
Markup-based extraction is multilingual by nature, text-based refinements for better coverage currently support German, English and Turkish.
97+
Markup-based extraction is multilingual by nature, text-based refinements for better coverage currently support English, French, German, Indonesian and Turkish.
9898

9999

100100
Installation
@@ -103,16 +103,16 @@ Installation
103103
Main package
104104
~~~~~~~~~~~~
105105

106-
This Python package is tested on Linux, macOS and Windows systems; it is compatible with Python 3.8 upwards. It is available on the package repository `PyPI <https://pypi.org/>`_ and can notably be installed with ``pip`` or ``pipenv``:
106+
This Python package is tested on Linux, macOS and Windows systems; it is compatible with Python 3.10 upwards. It is available on the package repository `PyPI <https://pypi.org/>`_ and can notably be installed with ``pip`` or ``pipenv``:
107107

108108
.. code-block:: bash
109109
110-
$ pip install htmldate # pip3 install on systems where both Python 2 and 3 are installed
110+
$ pip install htmldate
111111
$ pip install --upgrade htmldate # to make sure you have the latest version
112112
$ pip install git+https://github.com/adbar/htmldate.git # latest available code (see build status above)
113113
114114
115-
The last version to support Python 3.6 and 3.7 is ``htmldate==1.8.1``.
115+
The last version to support Python 3.6 and 3.7 is ``htmldate==1.8.1``; for Python 3.8 and 3.9 use the ``1.9.x`` series.
116116

117117

118118
Optional
@@ -131,16 +131,6 @@ The ``dateparser`` package is noticeably slower in its latest versions, version
131131
*For infos on dependency management of Python packages see* `this discussion thread <https://stackoverflow.com/questions/41573587/what-is-the-difference-between-venv-pyvenv-pyenv-virtualenv-virtualenvwrappe>`_.
132132

133133

134-
Experimental
135-
~~~~~~~~~~~~
136-
137-
Experimental compilation with ``mypyc``, as using pre-compiled library may shorten processing speed:
138-
139-
1. Install ``mypy``: ``pip3 install mypy``
140-
2. Compile the package: ``python setup.py --use-mypyc bdist_wheel``
141-
3. Use the newly created wheel: ``pip3 install dist/...``
142-
143-
144134
With Python
145135
-----------
146136

@@ -162,7 +152,7 @@ In case the web page features easily readable metadata in the header, the extrac
162152
.. code-block:: python
163153
164154
>>> find_date('https://creativecommons.org/about/')
165-
'2017-08-11' # has been updated since
155+
'2017-08-11' # may change
166156
>>> find_date('https://creativecommons.org/about/', extensive_search=False)
167157
>>>
168158
@@ -189,7 +179,7 @@ Change the output to a format known to Python's ``datetime`` module, the default
189179
.. code-block:: python
190180
191181
>>> find_date('https://www.gnu.org/licenses/gpl-3.0.en.html', outputformat='%d %B %Y')
192-
'18 November 2016' # may have changed since
182+
'18 November 2016' # may change
193183
194184
195185
Original vs. updated dates
@@ -200,7 +190,7 @@ Although the time delta between original publication and "last modified" info is
200190
.. code-block:: python
201191
202192
>>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/', original_date=True) # modified behavior
203-
'2016-06-23'
193+
'2016-06-23' # may change
204194
205195
For more information see `options page <options.html>`_.
206196

docs/options.rst

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -27,15 +27,15 @@ An external module can be used for download, as described in versions anterior t
2727
>>> import requests
2828
>>> r = requests.get('https://creativecommons.org/about/')
2929
>>> find_date(r.text)
30-
'2017-11-28' # may have changed since
30+
'2017-11-28' # may change
3131
# using htmldate's own fetch_url function
3232
>>> from htmldate.utils import fetch_url
3333
>>> htmldoc = fetch_url('https://blog.wikimedia.org/2018/06/28/interactive-maps-now-in-your-language/')
3434
>>> find_date(htmldoc)
35-
'2018-06-28'
35+
'2018-06-28' # may change
3636
# or simply
3737
>>> find_date('https://blog.wikimedia.org/2018/06/28/interactive-maps-now-in-your-language/') # URL detected
38-
'2018-06-28'
38+
'2018-06-28' # may change
3939
4040
4141
Date format
@@ -46,7 +46,7 @@ Change the output to a format known to Python's ``datetime`` module, the default
4646
.. code-block:: python
4747
4848
>>> find_date('https://www.gnu.org/licenses/gpl-3.0.en.html', outputformat='%d %B %Y')
49-
'18 November 2016' # may have changed since
49+
'18 November 2016' # may change
5050
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html', outputformat='%Y-%m-%dT%H:%M:%S%z')
5151
'2016-12-23T05:11:00-0500'
5252
@@ -62,7 +62,7 @@ Although the time delta between the original publication and the "last modified"
6262
.. code-block:: python
6363
6464
>>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/') # default setting
65-
'2019-06-24'
65+
'2019-06-24' # may change
6666
>>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/', original_date=True) # modified behavior
6767
'2016-06-23'
6868
@@ -77,8 +77,6 @@ See ``settings.py`` file:
7777
:show-inheritance:
7878
:undoc-members:
7979

80-
The module can then be re-compiled locally to apply changes to the settings.
81-
8280

8381
Clearing caches
8482
~~~~~~~~~~~~~~~

docs/requirements.txt

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
11
# version required
2-
sphinx>=8.1.3
3-
# without version specifier
4-
htmldate
2+
sphinx>=9.1.0
3+
# htmldate itself is installed from the repo root (see .readthedocs.yaml)

htmldate/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
__author__ = "Adrien Barbaresi"
88
__license__ = "Apache-2.0"
99
__copyright__ = "Copyright 2017-present, Adrien Barbaresi"
10-
__version__ = "1.9.4"
10+
__version__ = "1.10.0"
1111

1212

1313
import logging

htmldate/cli.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -81,13 +81,13 @@ def process_args(args: argparse.Namespace) -> None:
8181
if args.URL:
8282
htmlstring = fetch_url(args.URL)
8383
if htmlstring is None:
84-
sys.exit(f"No data for URL: {args.URL}" + "\n")
84+
sys.exit(f"No data for URL: {args.URL}\n")
8585
# unicode check
8686
else:
8787
try:
8888
htmlstring = sys.stdin.read()
8989
except UnicodeDecodeError as err:
90-
sys.exit(f"Wrong buffer encoding: {str(err)}" + "\n")
90+
sys.exit(f"Wrong buffer encoding: {err}\n")
9191
result = cli_examine(htmlstring, args)
9292
if result is not None:
9393
sys.stdout.write(result + "\n")

0 commit comments

Comments
 (0)