Java字符串:处理/转换系统本地字符编码

huangapple go评论68阅读模式
英文:

Java String: Treating/converting system native character encoding

问题

当访问 Windows 系统资源(与音频相关)时,我发现 Windows 使用自己的字符集提供了这些资源的描述字符串,而 Java 则默认将这些字符串视为 Unicode 编码的。因此,我得到了一堆问号:

????????? ???????? ???????

使用 String .codePointAt () 方法,我发现这些问号实际上隐藏了一些使用 Windows-1252 编码的文本。当然,我希望能看到实际的文本。于是,我开始了将此字符串转换为可读内容的努力。

经过半天的时间,在 Stackoverflow 和 Google 上搜寻相关主题后,我取得了一些进展,但这只引发了更多的问题。以下是我的代码:

import java.nio.ByteBuffer;
import java.nio.charset.Charset;
import javax.sound.sampled.AudioSystem;


public class Study_Encoding {

    //private static final Charset utf8Charset = Charset .forName ("UTF-8");
    private static final Charset win1251Charset = Charset .forName ("Windows-1251");
    private static final Charset win1252Charset = Charset .forName ("Windows-1252");

    public static void main(String[] args) {

        String str = AudioSystem .getMixerInfo () [0] .getName ();

        System .out .println ("原始字符串:");
        System .out .println (str + "\n");

        System .out .println ("其代码点:");
        displayCodePointSequence (str);

        System .out .println ("Windows-1251 解码字节数组(错误):");
        byte [] win1251ByteArr = str .getBytes (win1251Charset);
        displayByteSequence (win1251ByteArr);

        System .out .println ("Windows-1252 解码字节数组(正确):");
        byte [] win1252ByteArr = str .getBytes (win1252Charset);
        displayByteSequence (win1252ByteArr);

        System .out .println ("Windows-1252 编码字符串(错误):");
        try {
            System .out .println (win1252Charset .newDecoder ()
                    .decode (ByteBuffer .wrap (win1252ByteArr)) .toString () + "\n");
        } catch (Exception e) {
            System .out .println ("错误:" + e .toString ());
        }

        System .out .println ("Windows-1251 编码字符串(正确):");
        try {
            System .out .println (win1251Charset .newDecoder ()
                    .decode (ByteBuffer .wrap (win1252ByteArr)) .toString () + "\n");
        } catch (Exception e) {
            System .out .println ("错误:" + e .toString ());
        }
    }

    private static void displayCodePointSequence (String str) {

        if (null == str) {
            System .out .println ("没有字符串");
            return;
        }
        if (str .isEmpty ()) {
            System .out .println ("空字符串");
            return;
        }
        for (int k = 0; str .length () > k; ++k) {
            System .out .print (str .codePointAt (k) + " ");
        }
        System .out .println ("[" + str .length () + "]\n");
    }

    private static void displayByteSequence (byte [] byteArr) {

        if (null == byteArr) {
            System .out .println ("没有数组");
            return;
        }
        if (0 == byteArr .length) {
            System .out .println ("空数组");
            return;
        }
        for (int k = 0; byteArr .length > k; ++k) {
            System .out .print ((((int) byteArr [k]) & 0xFF) + " ");
        }
        System .out .println ("[" + byteArr .length + "]\n");
    }
}

这个程序产生以下输出(其中最后一行是我一直想要的内容):

原始字符串:
????????? ???????? ???????

其代码点:
207 229 240 226 232 247 237 251 233 32 231 226 243 234 238 226 238 233 32 228 240 224 233 226 229 240 [26]

Windows-1251 解码字节数组(错误):
63 63 63 63 63 63 63 63 63 32 63 63 63 63 63 63 63 63 32 63 63 63 63 63 63 63 [26]

Windows-1252 解码字节数组(正确):
207 229 240 226 232 247 237 251 233 32 231 226 243 234 238 226 238 233 32 228 240 224 233 226 229 240 [26]

Windows-1252 编码字符串(错误):
????????? ???????? ???????

Windows-1251 编码字符串(正确):
Первичный звуковой драйвер

正如任何人所见,win1251 和 win1252 编码出现了混淆。此外,我猜想,有一种方法可以使 Java 程序将所有字符串都视为某种本地编码的字符串(这是我不想要的!!!),或者至少视为系统提供的编码。所以...

...我的问题是:
1)如何转换字符串?(我想我已经解决了)
2)发生了什么?(涉及混合字符集等问题)
3)如何做正确?(字符串获取,如果不是,字符串转换)

编辑:
似乎我没有表达清楚,但我不是在谈论文本文件的内容,而是关于系统提供的字符串,如设备(物理和虚拟)的名称和描述,也许是文件和目录名称。在上面的示例中,字符串 "Первичный звуковой драйвер" 在英语 Windows 中应该类似于 "Default Audio Device"。

英文:

When accessing Windows System resources (related to audio) I found that Windows provides description strings of said resources using its own charset, while Java treats these strings as it treats all the strings by default: unicode-encoded. So, instead of sensible text I got a bunch of question marks:

????????? ???????? ???????

Using String .codePointAt () method I discovered that these questions actually hide some text with Windows-1252 encoding. Which of course I would like to see. And so my crusade to convert this string into something readable had begun.

Half a day later, after I've rummaged Stackoverflow and Google for related topics I got some progress, but that only led to more questions. So, there's my code:

import java.nio.ByteBuffer;
import java.nio.charset.Charset;
import javax.sound.sampled.AudioSystem;


public class Study_Encoding {
    
    //private static final Charset utf8Charset = Charset .forName ("UTF-8");
    private static final Charset win1251Charset = Charset .forName ("Windows-1251");
    private static final Charset win1252Charset = Charset .forName ("Windows-1252");
    
    public static void main(String[] args) {
        
        String str = AudioSystem .getMixerInfo () [0] .getName ();
        
        System .out .println ("Original string:");
        System .out .println (str + "\n");
        
        System .out .println ("Its code-points:");
        displayCodePointSequence (str);
        
        System .out .println ("Windows-1251-decoded byte array (wrong):");
        byte [] win1251ByteArr = str .getBytes (win1251Charset);
        displayByteSequence (win1251ByteArr);
        
        System .out .println ("Windows-1252-decoded byte array (right):");
        byte [] win1252ByteArr = str .getBytes (win1252Charset);
        displayByteSequence (win1252ByteArr);
        
        System .out .println ("Windows-1252-encoded string (wrong):");
        try {
            System .out .println (win1252Charset .newDecoder ()
                    .decode (ByteBuffer .wrap (win1252ByteArr)) .toString () + "\n");
        } catch (Exception e) {
            System .out .println ("ERROR:" + e .toString ());
        }
        
        System .out .println ("Windows-1251-encoded string (right):");
        try {
            System .out .println (win1251Charset .newDecoder ()
                    .decode (ByteBuffer .wrap (win1252ByteArr)) .toString () + "\n");
        } catch (Exception e) {
            System .out .println ("ERROR:" + e .toString ());
        }
    }
    
    private static void displayCodePointSequence (String str) {
        
        if (null == str) {
            System .out .println ("No string");
            return;
        }
        if (str .isEmpty ()) {
            System .out .println ("Empty string");
            return;
        }
        for (int k = 0; str .length () > k; ++k) {
            System .out .print (str .codePointAt (k) + " ");
        }
        System .out .println ("[" + str .length () + "]\n");
    }
    
    private static void displayByteSequence (byte [] byteArr) {
        
        if (null == byteArr) {
            System .out .println ("No array");
            return;
        }
        if (0 == byteArr .length) {
            System .out .println ("Empty array");
            return;
        }
        for (int k = 0; byteArr .length > k; ++k) {
            System .out .print ((((int) byteArr [k]) & 0xFF) + " ");
        }
        System .out .println ("[" + byteArr .length + "]\n");
    }
}

This program produces following output (where the last line is what I want to get all along):

Original string:
????????? ???????? ???????

Its code-points:
207 229 240 226 232 247 237 251 233 32 231 226 243 234 238 226 238 233 32 228 240 224 233 226 229 240 [26]

Windows-1251-decoded byte array (wrong):
63 63 63 63 63 63 63 63 63 32 63 63 63 63 63 63 63 63 32 63 63 63 63 63 63 63 [26]

Windows-1252-decoded byte array (right):
207 229 240 226 232 247 237 251 233 32 231 226 243 234 238 226 238 233 32 228 240 224 233 226 229 240 [26]

Windows-1252-encoded string (wrong):
????????? ???????? ???????

Windows-1251-encoded string (right):
Первичный звуковой драйвер

As anyone can see win1251 and win1252 encodings for some reason got mixed. Also, I guess, there is a way to make Java program treat all the strings as strings in some native encoding (which I DO NOT WANT!!!) or at least system-provided as one. So,...

...my questions are:

  1. How to convert a string? (Which I've solved, I guess)
  2. What's going on? (With mixed charsets and all else)
  3. How to do it right? (String acquisition, if not, string conversion)

EDIT:

It seems I have not made it clear, but I'm not talking about content of the text files, but about system-provided strings such as names and descriptions of devices (physical and virtual), maybe file and directory names. In example above string "Первичный звуковой драйвер" should be something like "Default Audio Device" in English Windows.

答案1

得分: 1

这是一个复杂的问题,但基本情况如下:

  1. 没有不带编码的字符串。最常见的形式(C字符串)使用ASCII编码。Java本身使用UTF16编码。
  2. 在某些字符集之间,没有完美的编码转换。例如,ASCII -> EBCDIC -> ASCII会导致字符串损坏,因为这些字符集之间缺乏一对一的关系。
  3. 在我看来,该文件包含某种字符集中的数据,而您希望将其转换为Java的本机形式(UTF16)。这非常简单。您可以使用FileInputStream读取字节数据。您可以使用Reader读取字符串数据。因此,您希望您的Reader执行转换:
    https://docs.oracle.com/javase/8/docs/api/java/io/InputStreamReader.html#InputStreamReader(java.io.InputStream,%20java.nio.charset.Charset)

因此,基本上您需要的代码类似于:

try (BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(myFile), StandardCharsets.CHARSETOFCHOICE)))
{
   String line;
   while ((line = br.readLine()) != null)
   {
      // 对字符串进行操作。
   }
}

我要重申,根据源/目标字符集,转换可能是不完美的,可能会导致损坏。

英文:

This is a convoluted question, but the basics are:

  1. There's no such thing as a string without encoding. The most common form (the c-string) uses ASCII encoding. Java natively uses UTF16.
  2. There's no perfect encoding conversion between certain character sets. For instance ASCII -> EBCDIC -> ASCII results in a corrupt string due to the lack of a 1:1 relationship between these character sets.
  3. To me, it seems the file contains data in 1 character set, and you are wanting to convert it to the Java native form (UTF16). This is very simple. You can use a FileInputStream to read the byte data. You can use a Reader to read in String data. Hence you want your reader to perform the conversion:
    https://docs.oracle.com/javase/8/docs/api/java/io/InputStreamReader.html#InputStreamReader(java.io.InputStream,%20java.nio.charset.Charset)

So basically, the code you are after is something like:

try (BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(myFile), StandardCharsets.CHARSETOFCHOICE)))
{
   String line;
   while ((line = br.readLine()) != null)
   {
      // Do what you want with the string.
   }
}

I will reiterate that the conversion may be imperfect depending on the source/target character set and may lead to corruption.

huangapple
  • 本文由 发表于 2020年8月27日 01:44:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/63603043.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定