字符问题

huangapple go评论81阅读模式
英文:

Character Issues

问题

import java.util.HashMap;
import java.util.Map;

/**
 * https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java
 * https://docs.oracle.com/javase/tutorial/i18n/text/supplementaryChars.html
 * https://www.w3.org/International/questions/qa-what-is-encoding
 * https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
 * author sedri
 */
public class App {

    static String outputString;

    public static void main(String[] args) {

        // May approach to fix the issue
        // Use a map to replace string issue with the correct character
        // The output looks good, but I would need to include all special characters for many languages.
        // What if I have a sentence like: How old are thee?
        Map<String, String> map = new HashMap();
        map.put("e?", "&#233;");
        map.put("o^", "&#243;");

        final String string = "Je?ro^me";
        final String accentString = "J&#233;r&#243;me";
        outputString = string;
        map.forEach((t, u) -> {
            if (outputString.contains(t)) {
                outputString = outputString.replace(t, u);
            }
        });
        System.out.println("Fixed output: " + outputString);
        System.out.println("");

        // End of my attempt at a solution.

        System.out.println("code points: " + string.codePoints().count());
        for (int i = 0; i < string.length(); i++) {
            System.out.println(string.charAt(i) + ": " + Character.codePointAt(string, i));
        }
        System.out.println("");

        System.out.println("code points: " + accentString.codePoints().count());
        for (int i = 0; i < accentString.length(); i++) {
            System.out.println(accentString.charAt(i) + ": " + Character.codePointAt(accentString, i));
        }
        System.out.println("");

        System.out.println("code points: " + outputString.codePoints().count());
        for (int i = 0; i < outputString.length(); i++) {
            System.out.println(outputString.charAt(i) + ": " + Character.codePointAt(outputString, i));
        }
        System.out.println("");
    }
}
英文:

Back Story

I basically retrieve strings from a database. I alter some text or those strings. Then I upload those strings back to the database, replacing the original strings. After looking at the front-end that displays those strings, I noticed the character issues. I no longer have the original strings, but I do have the updated strings.

The Issue

These strings have characters from other languages in them. They are now not displaying correctly. I looked at the code-points, and it appears that the original charter, which was one code-point, is now two different code-points.

&quot;Je?ro^me&quot; //code-points 8. Code-points: 74, 101, 63, 114, 111, 94, 109, 101
&quot;J&#233;r&#243;me&quot; //code-points 6.   Code-points: 74,   233,   114,    243,  109, 101 

The question

How do I get &quot;Je?ro^me&quot; back to &quot;J&#233;r&#243;me&quot;?

Things that I have tried

  1. Used Notepad++ to convert the encoding to or from UTF8, ANSI, and WINDOWS-1252.
  2. Created a Map that looks for things like e? and convert them to &#233;.

Issues with the two attempts to solve the problem

a. The issue still existed after trying different conversions.

b. Two issues here:

  1. I don't know all of the potential e?, o^, etc to look for. There are over 20,000 files that may cover many languages.
  2. What if I have a sentence that ends in e?

Things I researched to gain a better understanding of the issue

  1. https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java
  2. https://docs.oracle.com/javase/tutorial/i18n/text/supplementaryChars.html
  3. https://www.w3.org/International/questions/qa-what-is-encoding
  4. https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

MCVE

import java.util.HashMap;
import java.util.Map;
/**
*https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java
*https://docs.oracle.com/javase/tutorial/i18n/text/supplementaryChars.html
*https://www.w3.org/International/questions/qa-what-is-encoding
*https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
* @author sedri
*/
public class App {
static String outputString; 
public static void main(String[] args) {
//May approach to fix the issue
//Use a map to replace string issue with the correct character
//The output looks good, but I would need to include all special characters for many languages.
//What if I have a sentence like: How old are thee? 
Map&lt;String, String&gt; map = new HashMap();
map.put(&quot;e?&quot;, &quot;&#233;&quot;);
map.put(&quot;o^&quot;, &quot;&#243;&quot;);
final String string = &quot;Je?ro^me&quot;;
final String accentString = &quot;J&#233;r&#243;me&quot;;
outputString = string;
map.forEach((t, u) -&gt; {
if(outputString.contains(t))
{
outputString = outputString.replace(t, u);
}
});
System.out.println(&quot;Fixed output: &quot; + outputString);        
System.out.println(&quot;&quot;);                    
//End of my attempt at a solution.
System.out.println(&quot;code points: &quot; + string.codePoints().count());                
for(int i = 0; i &lt; string.length(); i++)
{
System.out.println(string.charAt(i) + &quot;: &quot; + Character.codePointAt(string, i));
}
System.out.println(&quot;&quot;);    
System.out.println(&quot;code points: &quot; + accentString.codePoints().count());                
for(int i = 0; i &lt; accentString.length(); i++)
{
System.out.println(accentString.charAt(i) + &quot;: &quot; + Character.codePointAt(accentString, i));
}
System.out.println(&quot;&quot;);    
System.out.println(&quot;code points: &quot; + outputString.codePoints().count());  
for(int i = 0; i &lt; outputString.length(); i++)
{
System.out.println(outputString.charAt(i) + &quot;: &quot; + Character.codePointAt(outputString, i));
}        
System.out.println(&quot;&quot;);  
}
}

答案1

得分: 2

你的一个代码点是63(一个问号),这意味着你将无法可靠地将该数据恢复为原始格式。? 可以代表许多未被正确解码的不同字符,这意味着你已经丢失了恢复原始字符所需的重要信息。

你需要做的是在最开始从数据库读取这些字符串时,确立正确的编码方式。由于你没有发布读取这些字符串的代码,我无法准确告诉你如何在哪里进行设置。

希望数据库中的数据尚未被错误的字符编码损坏,否则你可能已经丢失了所需的信息。

你可以尝试通过将“o^”替换为“ó”,部分修复这种损坏,但是如果“è”和“é”都变成了“e?”,你将永远无法确定哪个是哪个。

英文:

The fact that one of your code points is 63 (a question mark) means that you won't be able to reliably revert that data to the original format. The ? can represent many different characters that weren't properly decoded, which means you've lost vital information for restoring the original characters.

What you need to do is establish the correct encoding to use when you read from your database in the first place. Since you haven't posted the code where you read these strings, I can't tell you exactly how or where to do that.

Hopefully the data in the DB itself hasn't already been corrupted by bad character encoding, or else you've already lost the information you need.

You might be able to partially repair such damage by doing things like replacing "o^" with "ó", but if, say, both "è" and "é" turn into "e?", you can never be sure which was which.

huangapple
  • 本文由 发表于 2020年9月3日 01:45:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/63710979.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定