Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.read_html(...): Unable to handle malformed colspan / rowspan #59721

Open
3 tasks done
ahmedelq opened this issue Sep 5, 2024 · 4 comments
Open
3 tasks done

BUG: pd.read_html(...): Unable to handle malformed colspan / rowspan #59721

ahmedelq opened this issue Sep 5, 2024 · 4 comments
Labels
Bug Closing Candidate May be closeable, needs more eyeballs IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@ahmedelq
Copy link

ahmedelq commented Sep 5, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
tables = pd.read_html("https://en.wikipedia.org/wiki/Kyrgyz_alphabets")

Issue Description

The problem arises because of a malformed colspan found in a Wikipedia table having a value of 2; instead of 2.
image

The issue stems from this line:

colspan = int(self._attr_getter(td, "colspan") or 1)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[16], line 1
----> 1 tables = pd.read_html("https://en.wikipedia.org/wiki/Kyrgyz_alphabets", converters=defaultdict(lambda: str))

File ~/.local/lib/python3.12/site-packages/pandas/io/html.py:1240, in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only, extract_links, dtype_backend, storage_options)
   1224 if isinstance(io, str) and not any(
   1225     [
   1226         is_file_like(io),
   (...)
   1230     ]
   1231 ):
   1232     warnings.warn(
   1233         "Passing literal html to 'read_html' is deprecated and "
   1234         "will be removed in a future version. To read from a "
   (...)
   1237         stacklevel=find_stack_level(),
   1238     )
-> 1240 return _parse(
   1241     flavor=flavor,
   1242     io=io,
   1243     match=match,
   1244     header=header,
   1245     index_col=index_col,
   1246     skiprows=skiprows,
   1247     parse_dates=parse_dates,
   1248     thousands=thousands,
   1249     attrs=attrs,
   1250     encoding=encoding,
   1251     decimal=decimal,
   1252     converters=converters,
   1253     na_values=na_values,
   1254     keep_default_na=keep_default_na,
   1255     displayed_only=displayed_only,
   1256     extract_links=extract_links,
   1257     dtype_backend=dtype_backend,
   1258     storage_options=storage_options,
   1259 )

File ~/.local/lib/python3.12/site-packages/pandas/io/html.py:1006, in _parse(flavor, io, match, attrs, encoding, displayed_only, extract_links, storage_options, **kwargs)
   1003     raise retained
   1005 ret = []
-> 1006 for table in tables:
   1007     try:
   1008         df = _data_to_frame(data=table, **kwargs)

File ~/.local/lib/python3.12/site-packages/pandas/io/html.py:250, in <genexpr>(.0)
    242 """
    243 Parse and return all tables from the DOM.
    244 
   (...)
    247 list of parsed (header, body, footer) tuples from tables.
    248 """
    249 tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
--> 250 return (self._parse_thead_tbody_tfoot(table) for table in tables)

File ~/.local/lib/python3.12/site-packages/pandas/io/html.py:465, in _HtmlFrameParser._parse_thead_tbody_tfoot(self, table_html)
    462         header_rows.append(body_rows.pop(0))
    464 header = self._expand_colspan_rowspan(header_rows, section="header")
--> 465 body = self._expand_colspan_rowspan(body_rows, section="body")
    466 footer = self._expand_colspan_rowspan(footer_rows, section="footer")
    468 return header, body, footer

File ~/.local/lib/python3.12/site-packages/pandas/io/html.py:521, in _HtmlFrameParser._expand_colspan_rowspan(self, rows, section)
    519     text = (text, href)
    520 rowspan = int(self._attr_getter(td, "rowspan") or 1)
--> 521 colspan = int(self._attr_getter(td, "colspan") or 1)
    523 for _ in range(colspan):
    524     texts.append(text)

ValueError: invalid literal for int() with base 10: '2;'

Expected Behavior

Expected behavior is to sanitize the malformed colspan value proceed with 2.

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.12.5.final.0
python-bits : 64
OS : Linux
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0
setuptools : 69.5.1
pip : 24.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.2.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.24.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.9.1
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.14.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : 2.4.1
pyqt5 : None

@ahmedelq ahmedelq added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 5, 2024
@ahmedelq ahmedelq changed the title BUG: pd.read_html(...): Unable to handle malformed colspan / rowspan while using BUG: pd.read_html(...): Unable to handle malformed colspan / rowspan Sep 6, 2024
@bruhnugget-nice
Copy link

I believe that we can use a bit of regex for this?

Here is my example:

import re

def convert_to_int(s):
    # Make the pattern
    match = re.search(r'\d+', s)
    if match:
        return int(match.group())
    else:
        raise ValueError("No numbers found")

@bruhnugget-nice
Copy link

Then we can do something like this:

s="2;"
r=convert_to_int(s)
print(r)

It should come out as 2!

@bruhnugget-nice
Copy link

However, can we also do this for rowspan?

@rhshadrach
Copy link
Member

Thanks for the report!

The problem arises because of a malformed colspan

I confirmed using the W3C validator that the value is indeed valid. I do not think pandas should guess at the intention of invalid HTML. Raising an error seems appropriate here.

@rhshadrach rhshadrach added IO HTML read_html, to_html, Styler.apply, Styler.applymap Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Closing Candidate May be closeable, needs more eyeballs IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

No branches or pull requests

3 participants