Skip to content

Problem converting .docx file with to_html method and example from documentation #251

Open
@louloup22

Description

@louloup22

Hello,

I followed the example from your documentation to convert my docx file to html:

from pydocx import PyDocX
# Pass in a path
html = PyDocX.to_html('file.docx')
# Pass in a file object
html = PyDocX.to_html(open('file.docx', 'rb'))
# Pass in a file-like object
from cStringIO import StringIO
buf = StringIO()
with open('file.docx') as f:
    buf.write(f.read())
html = PyDocX.to_html(buf)

As I am using Python 3.6 I changed cStringIO to io. However I always have the same issue with my .docx file at the line buf.write(f.read())

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-24-598c617210d8> in <module>()
     10 buf = StringIO()
     11 with open('file.docx') as f:
---> 12     buf.write(f.read())
     13 html = PyDocX.to_html(buf)

~/anaconda3/lib/python3.6/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 14: invalid start byte

It is the case with all the .docx files I tried. Does anybody can suggest what is wrong ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions