Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: thousands separator in read_html alters data even though converter is set for a specific column #39005

Open
2 of 3 tasks
GGegenhuber opened this issue Jan 6, 2021 · 0 comments
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@GGegenhuber
Copy link

GGegenhuber commented Jan 6, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd

def convert_date_fix(date_str):
  return pd.to_datetime(date_str, format='%d0%m%Y')

html_table = """
<table class="gridview" style="width:600px;border-collapse:collapse;" cellspacing="0" cellpadding="4" border="0">
  <tbody>
    <tr>
      <th scope="col" align="left">Portfolio</th>
      <th scope="col" align="right">Waehrung</th>
      <th scope="col" align="right">Volumen</th>
      <th scope="col" align="right">Datum</th>
    </tr>
    <tr>
        <td align="left">Aktie A</td>
        <td align="right">EUR</td>
        <td align="right">20.320.945,77</td>
        <td align="right">05.01.2021</td>
    </tr>
    <tr>
      <td align="left">Aktie B</td>
      <td align="right">EUR</td>
      <td align="right">4.133.996,41</td>
      <td align="right">05.01.2021</td>
    </tr>
    <tr>
        <td align="left">Aktie C</td>
        <td align="right">EUR</td>
        <td align="right">3.855.218,70</td>
        <td align="right">05.01.2021</td>
      </tr>
  </tbody>
</table>"""

#ex1: date-string is converted to int64 with . separator (05.01.2021 becomes 5012021)
data_frame = pd.read_html(html_table, thousands='.', decimal=',')[0]
print(data_frame)
print(data_frame.dtypes)

#ex2: date-string is converted to object but string is altered same as above (05.01.2021 becomes 5012021)
data_frame = pd.read_html(html_table, thousands='.', decimal=',', converters={'Datum' : str})[0]
print(data_frame)
print(data_frame.dtypes)

#ex3: ugly fix that reverts conversion and acutally turns the column into a date
data_frame = pd.read_html(html_table, thousands='.', decimal=',', converters={'Datum' : convert_date_fix})[0]
print(data_frame)
print(data_frame.dtypes)

Problem description

I've set thousands and decimal separator parameters to fit the general number format that is used in Germany and also in my html-table example.

ex1: although the date column gets wrongfully interpreted as integer value, the behaviour is reasonable since no specific converters were defined for that column.

ex2 (the actual problem/bug): when setting a specific converter for the date column and thereby define it as string I'd expect the parser to keep the value untouched and return '05.01.2021', however it also interprets the string as integer and applies the corresponding conversion as above.

ex3: even when setting an own function as converter the original value is still altered and the integer conversion needs to be reverted in my helper function.

Expected Output

When a converter is defined for a specific column, the original value should be passed to the conversion function and conversions that come from thousands-separator for numeric values should not be applied.

Additional improvement/feature request

Adapt read_html to accept a date_parser as in read_csv which would make it easier to automatically parse dates in various formats. The read_html function currently has a boolean parse_dates param (same as read_csv) but in contrast does not support conversion of unconventional date formats.

see also #10684 or this stackoverflow post.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.8.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : German_Austria.1252

pandas : 1.2.0
numpy : 1.20.0rc2
pytz : 2020.5
dateutil : 2.8.1
pip : 20.2.1
setuptools : 49.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@GGegenhuber GGegenhuber added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 6, 2021
@simonjayhawkins simonjayhawkins added IO HTML read_html, to_html, Styler.apply, Styler.applymap and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 26, 2021
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Jan 26, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

No branches or pull requests

3 participants