UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte #4

angelo337 · 2016-06-21T12:04:00Z

hi there I am trying to run your code however I am getting this error every time and I am not sure how to solve it, could you please help me out?
this is the out put:

2016-06-20 21:49:27,802 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
/webdav/storage/wikipedia/title_tokens.txt.gz
Traceback (most recent call last):
File "./prepare_shootout.py", line 158, in
corpus = ShootoutCorpus(gensim.utils.smart_open(preprocessed_file))
File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/textcorpus.py", line 61, in init
self.dictionary.add_documents(self.get_texts())
File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/dictionary.py", line 127, in add_documents
self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids
File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/dictionary.py", line 154, in doc2bow
counter[w if isinstance(w, unicode) else unicode(w, 'utf-8')] += 1
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte
gzip: /webdav/storage/wikipedia/lsi_vectors.mm: No such file or directory

THANKS

piskvorky · 2016-06-21T12:41:53Z

Looks like your title_tokens.txt.gz file contains invalid utf8 -- can you check?

That is, ignore gensim and any machine learning. Just iterate through the zip file, check that every line is in utf8.

angelo337 · 2016-06-24T23:30:14Z

thaks a lot, how can i Fix that ?

I am trying to test it on English and also in spanish with accents (like á, é, í, ú, ó), please I have to keep all accents in spanish because a work with accents is not the same as no accents.

example:

si - if
sí - yes

mi - my
mí - me

el - the
él - he

tu - your
tú - you

thanks

From: Radim Rehurek [email protected]
Sent: Tuesday, June 21, 2016 7:41 AM
To: piskvorky/sim-shootout
Cc: Angelo Rodriguez; Author
Subject: Re: [piskvorky/sim-shootout] UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte (#4)

Looks like your title_tokens.txt.gz file contains invalid utf8 -- can you check?

That is, ignore gensim and any machine learning. Just iterate through the zip file, check that every line is in utf8.

You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com//issues/4#issuecomment-227427495, or mute the threadhttps://github.com/notifications/unsubscribe/AMygyH8CfDLwqAcqR3auAPqvuJpYRyYYks5qN9wRgaJpZM4I6qBx.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte #4

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte #4

angelo337 commented Jun 21, 2016

piskvorky commented Jun 21, 2016

angelo337 commented Jun 24, 2016

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte #4

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte #4

Comments

angelo337 commented Jun 21, 2016

piskvorky commented Jun 21, 2016

angelo337 commented Jun 24, 2016