Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

sqlite - Python Encoding - Could not decode to utf8

I have an sqlite database that was populated by an external program. Im trying to read the data with python. When I attempt to read the data I get the following error:

OperationalError: Could not decode to UTF-8

If I open the database in sqlite manager and look at the data in the offending record(s) using the inbuilt browse and search it looks fine, however if I export the table as csv, I notice the character £ in the offending records has become ?£

If I read the csv in python, the £ in the offending records is still read as ?£ but its not a problem I can parse this manually. However I need to be able to read the data direct from the database, without the intermediate step of converting to csv.

I have looked at some answers online for similar questions, I have so far tried setting "text_factory = str" and I have also tried changing the datatype of the column from TEXT to BLOB using sqlite manager, but still get the error.

My code below results in the OperationalError: Could not decode to UTF-8

conn = sqlite3.connect('test.db')
conn.text_factory = str
curr = conn.cursor()
curr.execute('''SELECT xml_dump FROM hands_1 LIMIT  5000  , 5001''')
row = curr.fetchone()

All the records above 5000 in the database have this character problem and hence produce the error.

Any help appreciated.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Python is trying to be helpful by converting pieces of text (stored as bytes in a database) into a python str object for you. In order to do this conversion, python has to guess what letter each byte (or group of bytes) returned by your query represents. The default guess is an encoding called utf-8. Obviously, this guess is wrong in your case.

The solution is to give python a little hint as to how to do the mapping from bytes to letters (i.e., unicode characters). You've already come close with the line

conn.text_factory = str

However (based on your response in the comments above), since you are using python 3, str is the default text factory, so that line will do nothing new for you (see the docs).

What happens behind the scenes with this line is that python tries to convert the bytes returned by the query using the str function, kind of like:

your_string = str(the_bytes, 'utf-8') # actually uses `conn.text_factory`, not `str`

...but you want a different encoding where 'utf-8' is. Since you can't change the default encoding of the str function, you will have to mimic it some other way. You can use a one-off nameless function called a lambda for this:

conn.text_factory = lambda x: str(x, 'latin1')

Now when the database is handing the bytes to python, python will try to map them to letters using the 'latin1' scheme instead of the 'utf-8' scheme. Of course, I don't know if latin1 is the correct encoding of your data. Realistically, you will have to try a handful of encodings to find the right one. I would try the following first:

  • 'iso-8859-1'
  • 'utf-16'
  • 'utf-32'
  • 'latin1'

You can find a more complete list here.

Another option is to simply let the bytes coming out of the database remain as bytes. Whether this is a good idea for you depends on your application. You can do it by setting:

conn.text_factory = bytes

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...