如何从utf-8中恢复符号

huangapple go评论76阅读模式
英文:

How to recover symbols from utf-8

问题

在Python 2.7中进行编码是非常难理解的。有人能解释一下如何获取这些字符串的符号吗?

这是我的Unicode字符串:

my_str = u'MFADCINEMve000301119 FACTURE EFAD CIN\u2019troD+000000035165 EUR FACTURE EFAD CIN\u2019trop\xe9MA SAS 2019/10198'

我想要将它转换为获取"\u2019"和"\xe9"。

我已经尝试过my_str.encode('utf-8'),但是这给我返回了以下内容:

'MFADCINEMve000301119 FACTURE EFAD CIN\xe2\x80\x99troD+000000035165 EUR FACTURE EFAD CIN\xe2\x80\x99trop\xc3\xa9MA SAS 2019/10198'

带有其他编码的符号。我不理解,我只想将它们替换为"'"和"é"符号...

更新:

如何从utf-8中恢复符号

更新2:

这是我的代码:

day = datetime.now().day
month = datetime.now().strftime("%b")
year = datetime.now().strftime("%Y")
filename = "ventes{0}{1}{2}.csv".format(day, month, year)

with io.open(filename, 'w', encoding='utf-8') as file_data:
    csvwriter = csv.writer(file_data, delimiter=',', quotechar="", quoting=csv.QUOTE_NONE)

    for line in res:
        csvwriter.writerow([x for x in line])  # 在下面发生错误

file_data.seek(0)

out = base64.encodestring(file_data.read())

发生了这个错误(不一定是显式的):

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/openerp/http.py", line 546, in _handle_exception
    return super(JsonRequest, self)._handle_exception(exception)
  File "/usr/lib/python2.7/dist-packages/openerp/http.py", line 583, in dispatch
    result = this_call_function(**this_params)
  File "/usr/lib/python2.7/dist-packages/openerp/http.py", line 319, in this_call_function
    return checked_call(this_db, *this_args, **this_kwargs)
  File "/usr/lib/python2.7/dist-packages/openerp/service/model.py", line 118, in wrapper
    return f(this_db, *this_args, **this_kwargs)
  File "/usr/lib/python2.7/dist-packages/openerp/http.py", line 316, in checked_call
    return this_endpoint(*this_args, **this_kw)
  File "/usr/lib/python2.7/dist-packages/openerp/http.py", line 812, in __call__
    return this_method(*this_args, **this_kw)
  File "/usr/lib/python2.7/dist-packages/openerp/http.py", line 412, in response_wrap
    response = this_func(*this_args, **this_kw)
  File "/usr/lib/python2.7/dist-packages/openerp/addons/web/controllers/main.py", line 953, in call_button
    action = this_call_kw(model, method, args, {})
  File "/usr/lib/python2.7/dist-packages/openerp/addons/web/controllers/main.py", line 941, in this_call_kw
    return getattr(this_registry.get(model), method)(this_cr, this_uid, *this_args, **this_kwargs)
  File "/usr/lib/python2.7/dist-packages/openerp/api.py", line 268, in wrapper
    return old_api(this, *this_args, **this_kwargs)
  File "/usr/lib/python2.7/dist-packages/openerp/api.py", line 399, in old_api
    result = this_method(this_recs, *this_args, **this_kwargs)
  File "/usr/lib/python2.7/dist-packages/openerp/addons_eggs/adquat_export_CEGID/models/export_cegid.py", line 31, in validate
    move_ids = this_context.get('active_ids', [])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 136: ordinal not in range(128)

这段代码有什么问题?请帮忙!

英文:

Encoding in Python 2.7 is very hard to understand. Can someone explain to me how get these string's symbols?

Here is my unicode string:

my_str = u'MFADCINEMve000301119 FACTURE EFAD CIN\u2019troD+000000035165 EUR FACTURE EFAD CIN\u2019trop\xe9MA SAS 2019/10198'

And I want to convert it to get "\u2019" and "\xe9".

I already try to my_str.encode('utf-8') but this gives me that:

'MFADCINEMve000301119 FACTURE EFAD CIN\xe2\x80\x99troD+000000035165 EUR FACTURE EFAD CIN\xe2\x80\x99trop\xc3\xa9MA SAS 2019/10198'

with other encoded symbols. I don't understand that, I juste want to replace them into ' and é symbols...

UPDATE:

如何从utf-8中恢复符号

UPDATE 2:

Here is my code:

day = datetime.now().day
        month = datetime.now().strftime("%b")
        year = datetime.now().strftime("%Y")
        filename = "ventes{0}{1}{2}.csv".format(day, month, year)

        with io.open(filename, 'w', encoding='utf-8') as file_data:
            csvwriter = csv.writer(file_data, delimiter=',', quotechar="", quoting=csv.QUOTE_NONE)

            for line in res:
                csvwriter.writerow([x for x in line])  # Occurs error bellow

        file_data.seek(0)

        out = base64.encodestring(file_data.read())

That occurs this error (not necessarily explicit):

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/openerp/http.py", line 546, in _handle_exception
    return super(JsonRequest, self)._handle_exception(exception)
  File "/usr/lib/python2.7/dist-packages/openerp/http.py", line 583, in dispatch
    result = self._call_function(**self.params)
  File "/usr/lib/python2.7/dist-packages/openerp/http.py", line 319, in _call_function
    return checked_call(self.db, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/openerp/service/model.py", line 118, in wrapper
    return f(dbname, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/openerp/http.py", line 316, in checked_call
    return self.endpoint(*a, **kw)
  File "/usr/lib/python2.7/dist-packages/openerp/http.py", line 812, in __call__
    return self.method(*args, **kw)
  File "/usr/lib/python2.7/dist-packages/openerp/http.py", line 412, in response_wrap
    response = f(*args, **kw)
  File "/usr/lib/python2.7/dist-packages/openerp/addons/web/controllers/main.py", line 953, in call_button
    action = self._call_kw(model, method, args, {})
  File "/usr/lib/python2.7/dist-packages/openerp/addons/web/controllers/main.py", line 941, in _call_kw
    return getattr(request.registry.get(model), method)(request.cr, request.uid, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/openerp/api.py", line 268, in wrapper
    return old_api(self, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/openerp/api.py", line 399, in old_api
    result = method(recs, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/openerp/addons_eggs/adquat_export_CEGID/models/export_cegid.py", line 31, in validate
    move_ids = self._context.get('active_ids', [])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 136: ordinal not in range(128)

What's wrong with this code? Please help !

答案1

得分: 1

Python 2默认将字符串表示(repr())显示为仅限ASCII的形式。ASCII范围(0-127)之外的字符将显示为转义码(\xnn\unnnn)。只有在使用print命令打印字符时,字符才会在视觉上正确显示,前提是终端编码和字体支持该字符。

例如:

>>> s = u'\xe9'
>>> s             # 这是用于调试的字符串表示。
u'\xe9'
>>> len(s)        # 它仍然只有长度为1。
1
>>> print(s)      # 当打印时,它会正确显示。
é

我的终端的编码默认情况下不支持所有Unicode字符,因此您的另一个示例无法正常打印。然而,调试表示形式可以正常显示:

>>> s = u'\u2019'
>>> s
u'\u2019'
>>> len(s)
1
>>> print(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\dev\Python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 0: character maps to <undefined>

如果要将Unicode字符串写入文件,您需要对其进行编码。使用所需的编码打开文件,然后写入Unicode字符串。最好使用UTF-8作为编码,因为它支持所有Unicode字符。使用io.open。它与Python 3 兼容(您应该尽快切换到Python 3),并支持encoding参数。

import io

my_str = u'MFADCINEMve000301119 FACTURE EFAD CIN\u2019troD+000000035165 EUR FACTURE EFAD CIN\u2019trop\xe9MA SAS 2019/10198'
with io.open('out.txt','w',encoding='utf8') as f:
    f.write(my_str)

请注意,您必须在支持UTF-8的编辑器中查看文件。例如,在默认的cp437编码下,我的终端显示如下:

C:\>type out.txt
MFADCINEMve000301119 FACTURE EFAD CIN’troD+000000035165 EUR FACTURE EFAD CIN’tropéMA SAS 2019/10198

但如果我将编码更改为cp65001(UTF-8):

C:\>chcp 65001
Active code page: 65001

C:\>type out.txt
MFADCINEMve000301119 FACTURE EFAD CIN’troD+000000035165 EUR FACTURE EFAD CIN’tropéMA SAS 2019/10198

更多阅读材料:

英文:

Python 2 by default displays string representations (repr()) as ASCII-only. Any character outside the ASCII range (0-127) is displayed as an escape code (\xnn or \unnnn). The character is only displayed correctly visually if you print the character, and then only if the terminal encoding and font support the character.

For example:

>>> s = u'\xe9'
>>> s             # This is a representation of the string useful for debugging.
u'\xe9'
>>> len(s)        # It is still only length 1.
1
>>> print(s)      # It displays correctly when printed.
é

My terminal's encoding doesn't support all Unicode characters by default, so you're other example doesn't print. The debug representation does, however:

>>> s = u'\u2019'
>>> s
u'\u2019'
>>> len(s)
1
>>> print(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\dev\Python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 0: character maps to <undefined>

If you write a Unicode string to a file, you have to encode it. Open a file with the encoding you want and write the Unicode string. It's best to use UTF-8 as the encoding, as it supports all Unicode characters. Use io.open. It is compatible with Python 3 (which you should switch to ASAP) and supports the encoding parameter.

import io

my_str = u'MFADCINEMve000301119 FACTURE EFAD CIN\u2019troD+000000035165 EUR FACTURE EFAD CIN\u2019trop\xe9MA SAS 2019/10198'
with io.open('out.txt','w',encoding='utf8') as f:
    f.write(my_str)

Note you have to view the file in an editor that supports UTF-8. For example, on my terminal with its default cp437 encoding it looks like:

C:\>type out.txt
MFADCINEMve000301119 FACTURE EFAD CIN’troD+000000035165 EUR FACTURE EFAD CIN’tropéMA SAS 2019/10198

But if I change the encoding to cp65001 (UTF-8):

C:\>chcp 65001
Active code page: 65001

C:\>type out.txt
MFADCINEMve000301119 FACTURE EFAD CIN’troD+000000035165 EUR FACTURE EFAD CIN’tropéMA SAS 2019/10198

More reading:

答案2

得分: -1

只需执行 print(my_str.encode('utf-8'))

这将给您输出:

> MFADCINEMve000301119 FACTURE EFAD CIN’troD+000000035165 EUR FACTURE EFAD CIN’tropéMA SAS 2019/10198

英文:

You just need to do print(my_str.encode('utf-8'))

This will give you the output:

> MFADCINEMve000301119 FACTURE EFAD CIN’troD+000000035165 EUR FACTURE EFAD CIN’tropéMA SAS 2019/10198

huangapple
  • 本文由 发表于 2020年1月4日 00:06:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/59581715.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定