英文:
Using Traditional Chinese with AWS DynamoDB
问题
我有一个移动应用程序,它将数据存储在DynamoDB表中。台湾的一组用户尝试将他们的名字存储在数据库中,但当数据被存储时,变得混乱不清。我已经进行了研究,发现这是因为DynamoDB使用UTF编码,而繁体中文使用Big 5文本编码。我应该如何设置DynamoDB,以便它可以正确存储和检索字符?
英文:
I have a mobile app that stores data in dynamoDB tables. There is a group of users in Taiwan that attempted to store there names in the database. when the data is stored it become garbled. I have researched this and see that it is because dynamoDB uses UTF encoding while tradional chinese uses big 5 text encoding. How do I setup dynamoDB so that it will store and recall the proper characters??
答案1
得分: 1
所以,你首先在脑海中有一个字符串。它是一系列的Unicode字符。这些字符没有固定的字节编码方式。同一个字符串可以用多种方式编码成字节流。Big5是其中之一,UTF-8是另一种。
当你说繁体中文使用Big5编码时,这并不完全正确。它通常会使用Big5编码,但也可以使用UTF-8,而UTF-8有一个很酷的特性,它可以编码所有的Unicode字符序列。这就是为什么它已经成为了标准编码,用于不想为某一字符集进行优化的情况。
所以你的挑战是要确保仔细控制字符和编码,以便将UTF-8序列发送到DynamoDB。只要你在创建字符串时使用基本字符串,标准的SDK就会正确执行这个操作。
你还需要确保在查看数据时不会让自己感到困惑。如果你查看UTF-8字节,但以一种将它们解释为Big5的方式来查看,那么它看起来就像乱码,反之亦然。
你没有说明他们是如何加载数据的。如果他们是从一个文件开始的,可能是这样的情况。你需要以Big5编码的方式读取文件,然后你就会得到字符串版本,然后你可以发送字符串版本,并依赖于SDK在传输时正确地将其转换为UTF-8。
我记得当我第一次学习这些东西时,感到有些困惑。要记住的是大写字母A作为一个概念存在(并且是Unicode中的一个定义字符),有很多机制可以将这个字母存储为硬盘上的二进制。每一种方式都是一种编码。ASCII很受欢迎,但EBCDIC曾经是过去的竞争对手,UTF-16现在也是竞争对手之一。繁体中文是一个字符集(一组字符),你也可以用多种方式对这些字符进行编码。这只是一个关于如何将字符映射到位和字节以及如何再次转换的问题。
英文:
So you start with a string in your head. It's a sequence of Unicode characters. There's no inherent byte encoding to the characters. The same string could be encoded into bytes in a variety of ways. Big5 is one. UTF-8 is another.
When you say that Traditional Chinese uses Big5, that's not entirely true. It may be commonly encoded in Big5, but it could be in UTF-8 instead, and UTF-8 has this cool property that it can encode all Unicode character sequences. That's why it's become the standard encoding for situations where you don't want to optimize for one character set.
So your challenge is make sure to carefully control the characters and encodings so that you're sending UTF-8 sequences to DynamoDB. The standard SDKs would do this correctly as long as you're creating the strings as basic strings in them.
You also have to make sure you're not confusing yourself when you look at the data. If you look at UTF-8 bytes but in a way where you're interpreting them as Big5 then it's going to look like gibberish, or vice versa.
You don't say how they're loading the data. If they're starting with a file, could be that. You'd want to read the file in a language saying it's Big5, then you'll have the string version, and then you can send the string version and rely on the SDK to correctly translate to UTF-8 on the wire.
I remember when I first learned this stuff it was all kind of confusing. The thing to remember is a capital A exists as an idea (and is a defined character in Unicode) and there's a whole lot of mechanisms you could use to put that letter into ones and zeros on disk. Each of those ways is an encoding. ASCII is popular but EBCDIC was another contender from the past, and UTF-16 is yet another contender now. Traditional Chinese is a character set (a set of characters) and you can encode each those characters a bunch of ways too. It's just a question of how you map characters to bits and bytes and back again.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论