在运行Java构件时显示Unicode字符的问题,在IntelliJ IDEA中运行时一切正常。

huangapple go评论82阅读模式
英文:

A problem in showing Unicode Character when running Java artifact, but everything ok while running in IntelliJ IDEA

问题

以下是翻译好的内容:

目标是从数据库中读取数据并将记录写入文件。
在IntelliJ IDEA中运行代码时,它将Unicode字符写入文件,与数据库内容相同。
但是当我构建Artifact(Jar文件)并在Windows中运行它时,输出文件显示问号字符'?',而不是正确显示数据库内容。
换句话说,尽管英文字母和数字显示正确,但非英语字符(如波斯字符、阿拉伯字符等)会出现问题。

相关的Java代码片段:

BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile.txt , true), "cp1256"));

while (resultSet.next()) {

    try {
        singleRow = resultSet.getString("CODE") + "|"
                + resultSet.getString("ACTIVITY") + "|"
                + resultSet.getString("TEL") + "|"
                + resultSet.getString("ZIPCD") + "|"
                + resultSet.getString("ADDR");

    } catch (Exception e) {
        LogUtil.writeLog(Constants.LOG_ERROR, e.getMessage());
    }

    out.write(singleRow + System.getProperty("line.separator"));
}

在IntelliJ IDEA的DEBUG模式下的输出文件内容:

130143|Active|ابتداي بلوار ميرداماد،کوچه سوم پلاک پنج|524|35254410
190730|Active|خیابان زیتون، بین انوشه و زیبا پلاک یک|771|92542001

在对应的JAR文件中运行的输出文件内容:

130143|Active|35254410|524|??? ? ??? ??????? ????? ????
190730|Active|92542001|771|????? ??? ??????? ????? ??? ??
请问您可以告诉我程序出了什么问题吗?
<details>
<summary>英文:</summary>
The goal is to read from the database and write the records into a file.
When running code in IntelliJ IDEA, it writes Unicode characters as same as database content.
But when I build the artifact (Jar File) and run it in windows, the output file shows question mark character &#39;?&#39; instead of showing Database content correctly.
In another word, Although English characters and numbers are showing correctly, Problem occurs in non-English characters (e.g. Persian characters, Arabic or ...)
related parts of java code:
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile.txt , true), &quot;cp1256&quot;));
while (resultSet.next()) {
try {
singleRow = resultSet.getString(&quot;CODE&quot;) + &quot;|&quot;
+ resultSet.getString(&quot;ACTIVITY&quot;) + &quot;|&quot;
+ resultSet.getString(&quot;TEL&quot;) + &quot;|&quot; 
+ resultSet.getString(&quot;ZIPCD&quot;) + &quot;|&quot;
+ resultSet.getString(&quot;ADDR&quot;);
} catch (Exception e) {
LogUtil.writeLog(Constants.LOG_ERROR, e.getMessage());
}
out.write(singleRow + System.getProperty(&quot;line.separator&quot;));
}
Output file content by running IntelliJ IDEA DEBUG mode:
130143|Active|ابتداي بلوار ميرداماد،کوچه سوم پلاک پنج|524|35254410 
190730|Active|خیابان زیتون، بین انوشه و زیبا پلاک یک|771|92542001
Output file content by running corresponding JAR File:
130143|Active|35254410|524|??? ? ??? ??????? ????? ????
190730|Active|92542001|771|????? ??? ??????? ????? ??? ??
Could you please tell me what is wrong with the program?
</details>
# 答案1
**得分**: 2
以下是翻译好的内容:
```java
你必须按照以下方式更改你的代码:
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile.txt , true), StandardCharsets.UTF_8));
while (resultSet.next()) {
try {
singleRow = resultSet.getString("CODE") + "|"
+ resultSet.getString("ACTIVITY") + "|"
+ resultSet.getString("TEL") + "|" 
+ resultSet.getString("ZIPCD") + "|"
+ resultSet.getString("ADDR") ;
} catch (Exception e) {
LogUtil.writeLog(Constants.LOG_ERROR, e.getMessage());
}
byte[] bytes = singleRow.getBytes(StandardCharsets.UTF_8);
String utf8EncodedString = new String(bytes, StandardCharsets.UTF_8);
out.write(utf8EncodedString + System.getProperty("line.separator"));
}
String.getBytes() 使用系统的默认字符集。你可以通过以下方式查看你的环境字符集:
System.out.println("Charset.defaultCharset=" + Charset.defaultCharset());
在 IntelliJ 中运行时,系统的默认字符集取自 IntelliJ 环境。
在 JAR 文件中运行时,系统的默认字符集取自操作系统(在末尾解释)。
由于 Windows 和 IntelliJ 环境的不同字符集,你会得到不同的输出。
强烈建议在将字节转换为字符串或者反之时,明确指定 "ISO-8859-1"、"US-ASCII"、"UTF-8" 或其他你想要的字符集。
singleRow.getBytes(StandardCharsets.UTF_8)
参见 [此链接][1] 获取更多信息
--------------------------------------------------------
***什么是 Windows-1252 和 Windows-1256?***
&gt; **Windows-1252**
&gt;
&gt;Windows-1252 或 CP-1252(代码页 1252)是拉丁字母的**单字节**(0-255)字符编码,在 Microsoft Windows 的旧组件中默认用于英语和许多欧洲语言,包括西班牙语、法语和德语。
前 128 个代码(0-127)与标准 ASCII 码相同。其他代码(128-255)取决于系统语言(西班牙语、法语、德语)。
&gt;
&gt; **Windows-1256**
&gt;
&gt;Windows-1256 是一个代码页,用于在 Microsoft Windows 下编写阿拉伯语(以及可能使用**阿拉伯**字母的其他一些语言,如**波斯**和乌尔都语)。
这些是一些用于法语的 **Windows-1252** 拉丁字符,因为这种欧洲语言在北非的前法国殖民地具有一定的历史重要性。这使得使用 **Windows 1256** 时法语和阿拉伯语文本可以混合使用,而无需进行代码页切换(但大写带有变音符号的字母不包括在内)。
***在使用 Unicode(波斯语)字符时该怎么办?***
由于波斯语存在一些具有相似符号的不同字符,比如 "ی" 和 "ي",在这种编码中会用 "ي"(U+064a)替换 "ی"(U+06cc),因为 ***Windows-1256*** 没有 U+06cc 字符。
*对于波斯语,使用 UTF-8 编码而不是 Windows-1256,以避免编码问题*。
*请注意,Windows-1256 仅使用 1 个字节,而 UTF-8 使用更多字节(1 到 4 个字节)*。
这些编码的比较在 [此处][2] 
***如何更改 Windows 默认字符集?***
目前在 Microsoft Windows 系统中,**Windows-1252** 是大多数西方国家的默认编码。
要将 Microsoft Windows 的默认字符集更改为合适的 Unicode,请按照 [此处][3] 进行操作。
如果按照以下方式更改为**波斯语**,你的默认字符集将更改为 **Windows-1256**
[![enter image description here][4]][4]
***如何更改特定软件的字符集(例如编程软件)?***
你必须按照特定软件的指示更改其 Unicode 设置。
**1- 对于 Notepad++**  
[![enter image description here][5]][5]
**2- 对于 XML 文件或字段**
[![enter image description here][6]][6]
**3- 对于 IntelliJ 文件**
打开要编辑的文件。
从主菜单中选择文件 | 文件编码,或者单击状态栏上的文件编码。
从弹出菜单中选择所需的编码。
[![enter image description here][7]][7]
如果在所选编码旁显示了“或”,这意味着此编码可能会更改文件内容。在这种情况下,IntelliJ IDEA 打开一个对话框,你可以在其中决定如何处理文件:选择 Reload 以从磁盘加载文件到编辑器中,并仅将编码更改应用于编辑器,或选择 Convert 以使用你选择的编码覆盖文件内容。
**4- 对于 IntelliJ 控制台输出编码**
IntelliJ IDEA 使用在“设置/首选项”对话框中的“文件编码”页面中定义的 IDE 编码来创建文件。你可以使用系统默认值,也可以从可用编码列表中选择。默认情况下,此编码会影响控制台输出。如果你希望控制台输出的编码与全局 IDE 设置不同,请配置相应的 JVM 选项:
1. 在“帮助”菜单中,点击“编辑自定义 VM 选项”。
2. 添加“-Dconsole.encoding”选项,并将值设置为所需的编码。例如:**-Dconsole.encoding=UTF-8**
3. 重新启动 IntelliJ IDEA。
[1]: https://stackoverflow.com/questions/12659417/why-does-javas-string-getbytes-uses-iso-8859-1/12659567
[2]: https://bizbrains
<details>
<summary>英文:</summary>
You must change your code as follows:
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile.txt , true), StandardCharsets.UTF_8));
while (resultSet.next()) {
try {
singleRow = resultSet.getString(&quot;CODE&quot;) + &quot;|&quot;
+ resultSet.getString(&quot;ACTIVITY&quot;) + &quot;|&quot;
+ resultSet.getString(&quot;TEL&quot;) + &quot;|&quot; 
+ resultSet.getString(&quot;ZIPCD&quot;) + &quot;|&quot;
+ resultSet.getString(&quot;ADDR&quot;) ;
} catch (Exception e) {
LogUtil.writeLog(Constants.LOG_ERROR, e.getMessage());
}
byte[] bytes = singleRow.getBytes(StandardCharsets.UTF_8);
String utf8EncodedString = new String(bytes, StandardCharsets.UTF_8);
out.write(utf8EncodedString + System.getProperty(&quot;line.separator&quot;));
}
**String.getBytes()** uses the system *default character set*.You can see your **environment** charset via :
System.out.println(&quot;Charset.defaultCharset=&quot;+ Charset.defaultCharset());
When **running from IntelliJ** , the system default character set is taken from IntelliJ environment.
When **running from JAR** file, the system default character set is taken from the Operating system (Explained at the end).
Because of the different charset of your windows and IntelliJ environment, you get different output.
It is highly recommended to **explicitly** specify &quot;ISO-8859-1&quot; or &quot;US-ASCII&quot; or &quot;UTF-8&quot; or whatever character set you to want when converting bytes into Strings of vice-versa
singleRow.getBytes(StandardCharsets.UTF_8)
see [this link][1] for more ionformation
--------------------------------------------------------
***what are Windows-1252 and Windows-1256 ?***
&gt; **Windows-1252**
&gt;
&gt;Windows-1252 or CP-1252 (code page 1252) is a **single-byte**(0-255) character.
encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German.
The first 128 code (0-127) is the same as the standard ASCII code. The other codes(128-255) depend on system language ( Spanish, French, German).
&gt;
&gt; **Windows-1256**
&gt;
&gt;Windows-1256 is a code page used to write Arabic (and possibly some other languages that use **Arabic** script, like **Persian** and Urdu) under Microsoft Windows.
These are some **Windows-1252** Latin characters used for French since this European language has some historic relevance in former French colonies in North Africa. This allowed French and Arabic text to be intermixed when using **Windows 1256** without any need for code-page switching (however, upper-case letters with diacritics were not included).
***What should I Do when using Unicode(persian) characters?***
Because of existing some different characters that have similar notations such as “ی” and “ي”  in Persian, this encoding will replace “ی”  (U+06cc) with “ي”( U+064a), because ***Windows-1256***  has not  U+06cc character.
*for Persian, instate of using Windows-1256  use UTF-8 encoding to avoid encoding problems*.
*Consider that Windows-1256 uses only 1 byte and UTF-8 take more bytes (1 to 4 bytes.)*
A comparison of these encoding  are [here][2] 
***How to change windows Default character set?***
now on  Microsoft windows  **Windows-1252** is the **default** encoding used by Windows systems in most western countries.
To change your Microsoft windows  default character set to suitable Unicode follow [this ][3].
If you change as follows to **Persian**, your default charset will be changed to **Windows-1256**
[![enter image description here][4]][4]
***How to change specific software character set (some for programming)?***
you must change your specific software Unicode as it’s instructions.
**1-	for notepad++**  
[![enter image description here][5]][5]
**2-	 on xml file or field**
[![enter image description here][6]][6]
**3-	For IntelliJ files**
Open the desired file for editing.
From the main menu, select File | File encoding or click the file encoding on the status bar.
Select the desired encoding from the popup.
[![enter image description here][7]][7]
If or is displayed next to the selected encoding, it means that this encoding might change the file contents. In this case, IntelliJ IDEA opens a dialog where you can decide what you want to do with the file: choose Reload to load the file in the editor from disk and apply encoding changes to the editor only, or choose Convert to overwrite the file with the encoding of your choice.
**4-IntelliJ  Console output encoding**
IntelliJ IDEA creates files using the IDE encoding defined in the File Encodings page of the Settings / Preferences dialog Ctrl+Alt+S. You can use either the system default or select from the list of available encodings. By default, this encoding affects console output. If you want the encoding for console output to be different from the global IDE settings, configure the corresponding JVM option:
1.	On the Help menu, click Edit Custom VM Options.
2.	Add the -Dconsole.encoding option and set the value to the necessary encoding. For example: **-Dconsole.encoding=UTF-8**
3.	Restart IntelliJ IDEA.
[1]: https://stackoverflow.com/questions/12659417/why-does-javas-string-getbytes-uses-iso-8859-1/12659567
[2]: https://bizbrains.com/blog/encoding-101-part-2-windows-1252-vs-utf-8/
[3]: https://www.digitalcitizen.life/changing-display-language-used-non-unicode-programs
[4]: https://i.stack.imgur.com/q4773.png
[5]: https://i.stack.imgur.com/4GcyM.png
[6]: https://i.stack.imgur.com/avtZS.png
[7]: https://i.stack.imgur.com/cg07P.png
</details>

huangapple
  • 本文由 发表于 2020年9月26日 20:49:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/64077873.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定