You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hi there I am trying to run your code however I am getting this error every time and I am not sure how to solve it, could you please help me out?
this is the out put:
2016-06-20 21:49:27,802 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
/webdav/storage/wikipedia/title_tokens.txt.gz
Traceback (most recent call last):
File "./prepare_shootout.py", line 158, in
corpus = ShootoutCorpus(gensim.utils.smart_open(preprocessed_file))
File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/textcorpus.py", line 61, in init
self.dictionary.add_documents(self.get_texts())
File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/dictionary.py", line 127, in add_documents
self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids
File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/dictionary.py", line 154, in doc2bow
counter[w if isinstance(w, unicode) else unicode(w, 'utf-8')] += 1
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte
gzip: /webdav/storage/wikipedia/lsi_vectors.mm: No such file or directory
THANKS
The text was updated successfully, but these errors were encountered:
I am trying to test it on English and also in spanish with accents (like á, é, í, ú, ó), please I have to keep all accents in spanish because a work with accents is not the same as no accents.
example:
si - if
sí - yes
mi - my
mí - me
el - the
él - he
tu - your
tú - you
thanks
From: Radim Rehurek [email protected]
Sent: Tuesday, June 21, 2016 7:41 AM
To: piskvorky/sim-shootout
Cc: Angelo Rodriguez; Author
Subject: Re: [piskvorky/sim-shootout] UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte (#4)
Looks like your title_tokens.txt.gz file contains invalid utf8 -- can you check?
That is, ignore gensim and any machine learning. Just iterate through the zip file, check that every line is in utf8.
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com//issues/4#issuecomment-227427495, or mute the threadhttps://github.com/notifications/unsubscribe/AMygyH8CfDLwqAcqR3auAPqvuJpYRyYYks5qN9wRgaJpZM4I6qBx.
hi there I am trying to run your code however I am getting this error every time and I am not sure how to solve it, could you please help me out?
this is the out put:
2016-06-20 21:49:27,802 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
/webdav/storage/wikipedia/title_tokens.txt.gz
Traceback (most recent call last):
File "./prepare_shootout.py", line 158, in
corpus = ShootoutCorpus(gensim.utils.smart_open(preprocessed_file))
File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/textcorpus.py", line 61, in init
self.dictionary.add_documents(self.get_texts())
File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/dictionary.py", line 127, in add_documents
self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids
File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/dictionary.py", line 154, in doc2bow
counter[w if isinstance(w, unicode) else unicode(w, 'utf-8')] += 1
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte
gzip: /webdav/storage/wikipedia/lsi_vectors.mm: No such file or directory
THANKS
The text was updated successfully, but these errors were encountered: