You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/evaluation.rst
+21-2Lines changed: 21 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ There are comparable software solutions in Python, the following date extraction
18
18
- `date_guesser <https://github.com/mitmedialab/date_guesser>`_ extracts publication dates from a web pages along with an accuracy measure (not used here),
19
19
- `goose3 <https://github.com/goose3/goose3>`_ can extract information for embedded content,
20
20
- `htmldate <https://github.com/adbar/htmldate>`_ is the software package described here, it is designed to extract original and updated publication dates of web pages,
21
-
- `newspaper<https://github.com/codelucas/newspaper>`_ is mostly geared towards newspaper texts,
21
+
- `newspaper4k<https://github.com/AndyTheFactory/newspaper4k>`_ (the maintained successor of newspaper3k) is mostly geared towards newspaper texts,
22
22
- `news-please <https://github.com/fhamborg/news-please>`_ is a news crawler that extracts structured information.
23
23
24
24
Two alternative packages are not tested here but could be used in addition:
@@ -36,7 +36,7 @@ Description
36
36
37
37
**Time**: the execution time cannot be easily compared in all cases as some solutions perform a whole series of operations which are irrelevant to this task.
38
38
39
-
**Errors:** *goose3*'s output isn't always meaningful and/or in a standardized format, these cases were discarded. *news-please* seems to have trouble with some encodings (e.g. in Chinese), in which case it leads to an exception.
39
+
**Errors:** *goose3*'s output isn't always meaningful and/or in a standardized format, these cases were discarded.
40
40
41
41
42
42
Results
@@ -45,6 +45,23 @@ Results
45
45
The results below show that **date extraction is not a completely solved task** but one for which extractors have to resort to heuristics and guesses. The figures documenting recall and accuracy capture the real-world performance of the tools as the absence of a date output impacts the result.
This run uses a reviewed version of the ground-truth labels (publication-date corrections) and the maintained *newspaper4k* fork in place of the now-unmaintained *newspaper3k*.
Additional data for new pages in English collected by the `Data Culture Group <https://dataculturegroup.org>`_ at Northeastern University.
64
81
82
+
The discussion below refers to the most recent run (top table), measured against a reviewed version of the publication-date labels.
83
+
65
84
Precision describes if the dates given as output are correct: *goose3* fares well precision-wise but it fails to extract dates in a large majority of cases (poor recall). The difference in accuracy between *date_guesser* and *newspaper* is consistent with tests described on the `website of the former <https://github.com/mitmedialab/date_guesser>`_.
66
85
67
86
It turns out that *htmldate* performs better than the other solutions overall. It is also noticeably faster than the strictly comparable packages (*articleDateExtractor* and most certainly *date_guesser*). Despite being measured on a sample, **the higher accuracy and faster processing time are highly significant**. Especially for smaller news outlets, websites and blogs, as well as pages written in languages other than English (in this case mostly but not exclusively German), *htmldate* greatly extends date extraction coverage without sacrificing precision.
Copy file name to clipboardExpand all lines: docs/index.rst
+8-18Lines changed: 8 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -80,7 +80,7 @@ Features
80
80
- URLs, HTML files, or HTML trees are given as input (includes batch processing)
81
81
- Output as string in any date format (defaults to `ISO 8601 YMD <https://en.wikipedia.org/wiki/ISO_8601>`_)
82
82
- Detection of both original and updated dates
83
-
- Compatible with all recent versions of Python
83
+
- Compatible with Python 3.10 and later
84
84
85
85
86
86
``htmldate`` can examine markup and text. It provides the following ways to date an HTML document:
@@ -94,7 +94,7 @@ Features
94
94
95
95
The output is thoroughly verified in terms of plausibility and adequateness. If a valid date has been found the library outputs a date string corresponding to either the last update or the original publishing statement (the default), in the desired format.
96
96
97
-
Markup-based extraction is multilingual by nature, text-based refinements for better coverage currently support German, English and Turkish.
97
+
Markup-based extraction is multilingual by nature, text-based refinements for better coverage currently support English, French, German, Indonesian and Turkish.
98
98
99
99
100
100
Installation
@@ -103,16 +103,16 @@ Installation
103
103
Main package
104
104
~~~~~~~~~~~~
105
105
106
-
This Python package is tested on Linux, macOS and Windows systems; it is compatible with Python 3.8 upwards. It is available on the package repository `PyPI <https://pypi.org/>`_ and can notably be installed with ``pip`` or ``pipenv``:
106
+
This Python package is tested on Linux, macOS and Windows systems; it is compatible with Python 3.10 upwards. It is available on the package repository `PyPI <https://pypi.org/>`_ and can notably be installed with ``pip`` or ``pipenv``:
107
107
108
108
.. code-block:: bash
109
109
110
-
$ pip install htmldate# pip3 install on systems where both Python 2 and 3 are installed
110
+
$ pip install htmldate
111
111
$ pip install --upgrade htmldate # to make sure you have the latest version
112
112
$ pip install git+https://github.com/adbar/htmldate.git # latest available code (see build status above)
113
113
114
114
115
-
The last version to support Python 3.6 and 3.7 is ``htmldate==1.8.1``.
115
+
The last version to support Python 3.6 and 3.7 is ``htmldate==1.8.1``; for Python 3.8 and 3.9 use the ``1.9.x`` series.
116
116
117
117
118
118
Optional
@@ -131,16 +131,6 @@ The ``dateparser`` package is noticeably slower in its latest versions, version
131
131
*For infos on dependency management of Python packages see* `this discussion thread <https://stackoverflow.com/questions/41573587/what-is-the-difference-between-venv-pyvenv-pyenv-virtualenv-virtualenvwrappe>`_.
132
132
133
133
134
-
Experimental
135
-
~~~~~~~~~~~~
136
-
137
-
Experimental compilation with ``mypyc``, as using pre-compiled library may shorten processing speed:
138
-
139
-
1. Install ``mypy``: ``pip3 install mypy``
140
-
2. Compile the package: ``python setup.py --use-mypyc bdist_wheel``
141
-
3. Use the newly created wheel: ``pip3 install dist/...``
142
-
143
-
144
134
With Python
145
135
-----------
146
136
@@ -162,7 +152,7 @@ In case the web page features easily readable metadata in the header, the extrac
0 commit comments