Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte #4

Open
angelo337 opened this issue Jun 21, 2016 · 2 comments

Comments

@angelo337
Copy link

hi there I am trying to run your code however I am getting this error every time and I am not sure how to solve it, could you please help me out?
this is the out put:

2016-06-20 21:49:27,802 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
/webdav/storage/wikipedia/title_tokens.txt.gz
Traceback (most recent call last):
File "./prepare_shootout.py", line 158, in
corpus = ShootoutCorpus(gensim.utils.smart_open(preprocessed_file))
File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/textcorpus.py", line 61, in init
self.dictionary.add_documents(self.get_texts())
File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/dictionary.py", line 127, in add_documents
self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids
File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/dictionary.py", line 154, in doc2bow
counter[w if isinstance(w, unicode) else unicode(w, 'utf-8')] += 1
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte
gzip: /webdav/storage/wikipedia/lsi_vectors.mm: No such file or directory

THANKS

@piskvorky
Copy link
Owner

Looks like your title_tokens.txt.gz file contains invalid utf8 -- can you check?

That is, ignore gensim and any machine learning. Just iterate through the zip file, check that every line is in utf8.

@angelo337
Copy link
Author

thaks a lot, how can i Fix that ?

I am trying to test it on English and also in spanish with accents (like á, é, í, ú, ó), please I have to keep all accents in spanish because a work with accents is not the same as no accents.

example:

si - if
sí - yes

mi - my
mí - me

el - the
él - he

tu - your
tú - you

thanks


From: Radim Rehurek [email protected]
Sent: Tuesday, June 21, 2016 7:41 AM
To: piskvorky/sim-shootout
Cc: Angelo Rodriguez; Author
Subject: Re: [piskvorky/sim-shootout] UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte (#4)

Looks like your title_tokens.txt.gz file contains invalid utf8 -- can you check?

That is, ignore gensim and any machine learning. Just iterate through the zip file, check that every line is in utf8.

You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com//issues/4#issuecomment-227427495, or mute the threadhttps://github.com/notifications/unsubscribe/AMygyH8CfDLwqAcqR3auAPqvuJpYRyYYks5qN9wRgaJpZM4I6qBx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants