When I create a file with java 8, using the Shift-JIS charset, some chars are substitute with char '?'

huangapple go评论104阅读模式
英文:

When I create a file with java 8, using the Shift-JIS charset, some chars are substitute with char '?'

问题

我在使用Shift-JIS字符集创建文件时遇到了问题。

以下是我想要写入txt文件的文本示例:

>繰戻_日経選挙システム保守2019年1月10日~;[2019年度更新]横浜第1DCコロケ―ション(2ラック)

使用Shift-JIS字符集时,在文件中我发现两个 '?',而不是 ~ 和 ―:

>繰戻_日経選挙システム保守2019年1月10日?;[2019年度更新]横浜第1DCコロケ?ション(2ラック)

使用UTF-8字符集时,在文件中显示正常:

>繰戻_日経選挙システム保守2019年1月10日~;[2019年度更新]横浜第1DCコロケ―ション(2ラック)

这是我的代码:

package it.grupposervizi.easy.ef.etl.elaboration;

import com.nimbusds.jose.util.StandardCharset;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Arrays;
import java.util.List;
import org.apache.commons.io.FileUtils;

public class TestShiftJIS {

  private static final String TEXT = "繰戻_日経選挙システム保守2019年1月10日~;[2019年度更新]横浜第1DCコロケ―ション(2ラック)";
  private static final String DIRECTORY = "C:\\temp\\japan\\";
  private static final String SHIFT_JIS = "Shift-JIS";
  private static final String UTF_8 = StandardCharset.UTF_8.name();
  private static final String EXTENSION = ".txt";

  public static void main(String[] args) {

    final List<String> charsets = Arrays.asList(SHIFT_JIS, UTF_8);
    charsets.forEach(c -> {
      final String fName = DIRECTORY + c + EXTENSION;
      File file = new File(fName);
      try {
        FileUtils.writeStringToFile(file, TEXT, Charset.forName(c));
      } catch (IOException e) {
        throw new RuntimeException(e);
      }
    });

    System.out.println("End Test");
  }
}

你有没有想法为什么这两个字符没有包含在Shift-JIS字符集中呢?

英文:

I have a problem when I create a file using the Shift-JIS charset.

This is an example of text that I want write into a txt file:

>繰戻_日経選挙システム保守2019年1月10日~;[2019年度更新]横浜第1DCコロケ―ション(2ラック)

Using Shift-JIS charset, into the file I find two '?' instead of ~ and ―:

>繰戻_日経選挙システム保守2019年1月10日?;[2019年度更新]横浜第1DCコロケ?ション(2ラック)

Using UTF-8 charset, into the file I find (all correct):

>繰戻_日経選挙システム保守2019年1月10日~;[2019年度更新]横浜第1DCコロケ―ション(2ラック)

This is my code:

package it.grupposervizi.easy.ef.etl.elaboration;

import com.nimbusds.jose.util.StandardCharset;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Arrays;
import java.util.List;
import org.apache.commons.io.FileUtils;

public class TestShiftJIS {

  private static final String TEXT = &quot;繰戻_日経選挙システム保守2019年1月10日~;[2019年度更新]横浜第1DCコロケ―ション(2ラック)&quot;;
  private static final String DIRECTORY = &quot;C:\\temp\\japan\\&quot;;
  private static final String SHIFT_JIS = &quot;Shift-JIS&quot;;
  private static final String UTF_8 = StandardCharset.UTF_8.name();
  private static final String EXTENSION = &quot;.txt&quot;;

  public static void main(String[] args) {

    final List&lt;String&gt; charsets = Arrays.asList(SHIFT_JIS, UTF_8);
    charsets.forEach(c -&gt; {
      final String fName = DIRECTORY + c + EXTENSION;
      File file = new File(fName);
      try {
        FileUtils.writeStringToFile(file, TEXT, Charset.forName(c));
      } catch (IOException e) {
        throw new RuntimeException(e);
      }
    });

    System.out.println(&quot;End Test&quot;);
  }
}

Do you have any idea why these two chars are not included into the Shift-JIS charset?

答案1

得分: 1



尝试保存具有不同于默认编码的罕见编码的文件。尝试更改字符的编码。
有关编码的更多信息 » https://en.wikipedia.org/wiki/Character_encoding

尝试使用:`Charset.forName("CP943C")`
英文:

///EDIT:

You try to save file that has uncommon (different from default) encoding. Try to change encoding of chars.
more about encoding » https://en.wikipedia.org/wiki/Character_encoding

///

Try using: Charset.forName(&quot;CP943C&quot;)

答案2

得分: 0

@JosefZ基本上已经给出了答案:Shift-JIS不支持(U+FF5E)和(U+FF5E)。

这可以通过使用Charset.newEncoder().canEncode(char)来验证:

public class ShiftJisTest {
    public static void main(String[] args) {
        // Some Japanese text containing special characters
        String s = "\u7e70\u623b\u005f\u65e5\u7d4c\u9078\u6319\u30b7\u30b9\u30c6\u30e0\u4fdd\u5b88\u0032\u0030\u0031\u0039\u5e74\u0031\u6708\u0031\u0030\u65e5\uff5e\u003b\u005b\u0032\u0030\u0031\u0039\u5e74\u5ea6\u66f4\u65b0\u005d\u6a2a\u6d5c\u7b2c\uff11\u0044\u0043\u30b3\u30ed\u30b1\u2015\u30b7\u30e7\u30f3\uff08\uff12\u30e9\u30c3\u30af\uff09";
        Charset charset = Charset.forName("Shift-JIS");
        for (char c : s.toCharArray()) {
            CharsetEncoder encoder = charset.newEncoder();
            if (!encoder.canEncode(c)) {
                System.out.printf("%s (U+%04X)%n", c, (int) c);
            }
        }
        
        try {
            charset.newEncoder().encode(CharBuffer.wrap(s));
        } catch (CharacterCodingException e) {
            // java.nio.charset.UnmappableCharacterException: Input length = 1
            e.printStackTrace();
        }
    }
}

你看到?的原因是因为Apache Commons IO的FileUtils.writeStringToFile(File, String, Charset)在内部使用了String.getBytes(Charset),其文档说:
> [...] This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array.

而CharsetEncoder的文档说:
> [...] The replacement is initially set to the encoder's default replacement, which often (but not always) has the initial value { (byte)'?' }

英文:

@JosefZ has basically already given the answer: Shift-JIS does not support (U+FF5E) and (U+FF5E).

This can be verified using Charset.newEncoder().canEncode(char):

public class ShiftJisTest {
    public static void main(String[] args) {
        // 繰戻_日経選挙システム保守2019年1月10日~;[2019年度更新]横浜第1DCコロケ―ション(2ラック)
        String s = &quot;\u7e70\u623b\u005f\u65e5\u7d4c\u9078\u6319\u30b7\u30b9\u30c6\u30e0\u4fdd\u5b88\u0032\u0030\u0031\u0039\u5e74\u0031\u6708\u0031\u0030\u65e5\uff5e\u003b\u005b\u0032\u0030\u0031\u0039\u5e74\u5ea6\u66f4\u65b0\u005d\u6a2a\u6d5c\u7b2c\uff11\u0044\u0043\u30b3\u30ed\u30b1\u2015\u30b7\u30e7\u30f3\uff08\uff12\u30e9\u30c3\u30af\uff09&quot;;
        Charset charset = Charset.forName(&quot;Shift-JIS&quot;);
        for (char c : s.toCharArray()) {
            CharsetEncoder encoder = charset.newEncoder();
            if (!encoder.canEncode(c)) {
                System.out.printf(&quot;%s (U+%04X)%n&quot;, c, (int) c);
            }
        }
        
        try {
            charset.newEncoder().encode(CharBuffer.wrap(s));
        } catch (CharacterCodingException e) {
            // java.nio.charset.UnmappableCharacterException: Input length = 1
            e.printStackTrace();
        }
    }
}

The reason why you are seeing ? is because Apache Commons IO's FileUtils.writeStringToFile(File, String, Charset) internally (1, 2) uses String.getBytes(Charset) whose documentation says:
> [...] This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array.

And the CharsetEncoder documentation says:
> [...] The replacement is initially set to the encoder's default replacement, which often (but not always) has the initial value { (byte)&#39;?&#39; }

答案3

得分: 0

根据@Marcono1234的回答,在Java中的Shift-JIS映射不支持(U+FF5E)和(U+FF5E)。要将这些代码点从UTF-8映射到Shift-JIS编码,您需要使用Charset.forName("windows-31j");或者Charset.forName("MS932");,而不是使用Charset.forName("Shift-JIS");

英文:

As @Marcono1234 answered, the Shift-JIS mapping in Java does not support (U+FF5E) and (U+FF5E). To map these codepoints from UTF-8 into Shift-JIS encoding, you have to use Charset.forName(&quot;windows-31j&quot;); or Charset.forName(&quot;MS932&quot;); rather than Charset.forName(&quot;Shift-JIS&quot;);.

huangapple
  • 本文由 发表于 2020年9月2日 21:19:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/63706439.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定