2020年9月3日 01:45:49go评论159阅读模式

英文:

Character Issues

问题

import java.util.HashMap;
import java.util.Map;

/**
 * https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java
 * https://docs.oracle.com/javase/tutorial/i18n/text/supplementaryChars.html
 * https://www.w3.org/International/questions/qa-what-is-encoding
 * https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
 * author sedri
 */
public class App {

    static String outputString;

    public static void main(String[] args) {

        // May approach to fix the issue
        // Use a map to replace string issue with the correct character
        // The output looks good, but I would need to include all special characters for many languages.
        // What if I have a sentence like: How old are thee?
        Map<String, String> map = new HashMap();
        map.put("e?", "&#233;");
        map.put("o^", "&#243;");

        final String string = "Je?ro^me";
        final String accentString = "J&#233;r&#243;me";
        outputString = string;
        map.forEach((t, u) -> {
            if (outputString.contains(t)) {
                outputString = outputString.replace(t, u);
            }
        });
        System.out.println("Fixed output: " + outputString);
        System.out.println("");

        // End of my attempt at a solution.

        System.out.println("code points: " + string.codePoints().count());
        for (int i = 0; i < string.length(); i++) {
            System.out.println(string.charAt(i) + ": " + Character.codePointAt(string, i));
        }
        System.out.println("");

        System.out.println("code points: " + accentString.codePoints().count());
        for (int i = 0; i < accentString.length(); i++) {
            System.out.println(accentString.charAt(i) + ": " + Character.codePointAt(accentString, i));
        }
        System.out.println("");

        System.out.println("code points: " + outputString.codePoints().count());
        for (int i = 0; i < outputString.length(); i++) {
            System.out.println(outputString.charAt(i) + ": " + Character.codePointAt(outputString, i));
        }
        System.out.println("");
    }
}

英文:

Back Story

I basically retrieve strings from a database. I alter some text or those strings. Then I upload those strings back to the database, replacing the original strings. After looking at the front-end that displays those strings, I noticed the character issues. I no longer have the original strings, but I do have the updated strings.

The Issue

These strings have characters from other languages in them. They are now not displaying correctly. I looked at the code-points, and it appears that the original charter, which was one code-point, is now two different code-points.

&quot;Je?ro^me&quot; //code-points 8. Code-points: 74, 101, 63, 114, 111, 94, 109, 101
&quot;J&#233;r&#243;me&quot; //code-points 6.   Code-points: 74,   233,   114,    243,  109, 101

The question

How do I get "Je?ro^me" back to "Jéróme"?

Things that I have tried

Used Notepad++ to convert the encoding to or from UTF8, ANSI, and WINDOWS-1252.
Created a Map that looks for things like e? and convert them to é.

Issues with the two attempts to solve the problem

a. The issue still existed after trying different conversions.

b. Two issues here:

I don't know all of the potential e?, o^, etc to look for. There are over 20,000 files that may cover many languages.
What if I have a sentence that ends in e?

Things I researched to gain a better understanding of the issue

MCVE

import java.util.HashMap;
import java.util.Map;
/**
*https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java
*https://docs.oracle.com/javase/tutorial/i18n/text/supplementaryChars.html
*https://www.w3.org/International/questions/qa-what-is-encoding
*https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
* @author sedri
*/
public class App {
static String outputString; 
public static void main(String[] args) {
//May approach to fix the issue
//Use a map to replace string issue with the correct character
//The output looks good, but I would need to include all special characters for many languages.
//What if I have a sentence like: How old are thee? 
Map&lt;String, String&gt; map = new HashMap();
map.put(&quot;e?&quot;, &quot;&#233;&quot;);
map.put(&quot;o^&quot;, &quot;&#243;&quot;);
final String string = &quot;Je?ro^me&quot;;
final String accentString = &quot;J&#233;r&#243;me&quot;;
outputString = string;
map.forEach((t, u) -&gt; {
if(outputString.contains(t))
{
outputString = outputString.replace(t, u);
}
});
System.out.println(&quot;Fixed output: &quot; + outputString);        
System.out.println(&quot;&quot;);                    
//End of my attempt at a solution.
System.out.println(&quot;code points: &quot; + string.codePoints().count());                
for(int i = 0; i &lt; string.length(); i++)
{
System.out.println(string.charAt(i) + &quot;: &quot; + Character.codePointAt(string, i));
}
System.out.println(&quot;&quot;);    
System.out.println(&quot;code points: &quot; + accentString.codePoints().count());                
for(int i = 0; i &lt; accentString.length(); i++)
{
System.out.println(accentString.charAt(i) + &quot;: &quot; + Character.codePointAt(accentString, i));
}
System.out.println(&quot;&quot;);    
System.out.println(&quot;code points: &quot; + outputString.codePoints().count());  
for(int i = 0; i &lt; outputString.length(); i++)
{
System.out.println(outputString.charAt(i) + &quot;: &quot; + Character.codePointAt(outputString, i));
}        
System.out.println(&quot;&quot;);  
}
}

答案1

得分: 2

你的一个代码点是63（一个问号），这意味着你将无法可靠地将该数据恢复为原始格式。? 可以代表许多未被正确解码的不同字符，这意味着你已经丢失了恢复原始字符所需的重要信息。

你需要做的是在最开始从数据库读取这些字符串时，确立正确的编码方式。由于你没有发布读取这些字符串的代码，我无法准确告诉你如何在哪里进行设置。

希望数据库中的数据尚未被错误的字符编码损坏，否则你可能已经丢失了所需的信息。

你可以尝试通过将“o^”替换为“ó”，部分修复这种损坏，但是如果“è”和“é”都变成了“e?”，你将永远无法确定哪个是哪个。

英文:

The fact that one of your code points is 63 (a question mark) means that you won't be able to reliably revert that data to the original format. The ? can represent many different characters that weren't properly decoded, which means you've lost vital information for restoring the original characters.

What you need to do is establish the correct encoding to use when you read from your database in the first place. Since you haven't posted the code where you read these strings, I can't tell you exactly how or where to do that.

Hopefully the data in the DB itself hasn't already been corrupted by bad character encoding, or else you've already lost the information you need.

You might be able to partially repair such damage by doing things like replacing "o^" with "ó", but if, say, both "è" and "é" turn into "e?", you can never be sure which was which.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

字符问题

问题

答案1

CellEditorListener在jtable的某一列获得焦点时调用getCellEditorValue方法。

QueryDsl不会为git子模块的@Entity类生成q类。

Spring Data JPA / Hibernate 处理关联

如何使用Google Drive Java API的v3版本复制Google Drive上的文件？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论