-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: read_html
does not properly structure some html table elements (possible rowspan
or colspan
issues)
#58461
Comments
@jowens If you use another tool for the extraction other than pandas, do you get a different result? |
read_html
does not properly structure some elements in the DataFrame
read_html
does not properly structure some elements in the DataFrame read_html
does not properly structure some elements (possible rowspan
or colspan
issues)
I didn't implement any of this, and haven't checked the implementation but my guess is going to be that; a) reading a grid based table is straight forward. Can anyone confirm? |
Suggestion for that other tool? I'm happy to try. |
@jowens a quick search on google gives this html-extractor - havent used it though (caveat). i asked the earlier question, to see if there is a tool that does it right and we can compare against them. it seems @attack68 has looked into your question more and may have figured out the possible bug? |
Just for posterity, here's the specific Wikipedia revision we're discussing here, in case it gets edited: https://en.wikipedia.org/w/index.php?title=Template:AMD_Radeon_Pro_V_series&oldid=1220301074 and here's a gist where I extracted everything between https://gist.github.com/jowens/8e42fa17a5af4bc16284cfab56ef1473 |
html_table_extractor has similar behavior (same errors). Here's a quick test: https://gist.github.com/jowens/bd15b42accaa20e9c403af89719a5256 (which just has the table manually in the source code). Here's the last line of the output, which corresponds to what's in the issue description.
|
FWIW I just tested with |
read_html
does not properly structure some elements (possible rowspan
or colspan
issues)read_html
does not properly structure some html table elements (possible rowspan
or colspan
issues)
So is the summary here that all tools you have tested for parsing this table, including pandas, return the same results, and that those results are all incorrect. |
Well, two tools (pandas and html_table_extractor), and those two tools return consistent but incorrect results, where incorrect is compared to how a web browser renders it. Since these two tools both (appear to) have different code that parses the table's cells / rowspans / colspans, it seems like a possibility that web browsers (I looked at Chrome/Firefox/Safari, each of which [I think] uses a different back end [Chromium/Gecko/WebKit]) might interpret the table differently than these two tools. Web browsers are surely more forgiving of HTML errors. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
The bottom right of the ingested table puts entries in the wrong columns near the right side for the last two rows. I did some checking of the HTML source and even though it's got some complex rowspan and colspan directives, it appears to be properly constructed.
I acknowledge that I'm using a slightly older pandas than is installed, but I looked through recent issues on and checkins to
read_html
and I don't believe this is fixed/reported.Expected Behavior
I expect the column called "Memory / L3 Cache" to only be populated in the last row.
I expect the two power entries in the last two rows to be placed in the "TDP" column.
Most of the right side of the bottom two rows is misplaced.
Installed Versions
INSTALLED VERSIONS
commit : bdc79c1
python : 3.12.3.final.0
python-bits : 64
OS : Darwin
OS-release : 23.4.0
Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:10:42 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.1
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.8.2
setuptools : 69.5.1
pip : 24.0
Cython : 3.0.10
pytest : 7.4.3
hypothesis : None
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.2.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.3
IPython : 8.23.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 13.0.0.dev0+gb7d2f7ffc.d20240415
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: