readChar()方法显示日文字符

huangapple go评论65阅读模式
英文:

The readChar() method displays japanese character

问题

这是代码:

package com.project;

import java.io.*;
import java.util.StringTokenizer;

public class Main {

    public static void main(String[] args) throws IOException {
        int N, i=0;
        char C;
        char[] charArray = new char[100];
        String fileLocation = "file.txt";
        BufferedReader buffer = new BufferedReader(new InputStreamReader(System.in));
        do {
            System.out.println("请输入单词的索引");
            N = Integer.parseInt(buffer.readLine());
            if (N!=0) {
                RandomAccessFile word = new RandomAccessFile(new File(fileLocation), "r");
                do {
                    word.seek((2*(N-1))+i);
                    C = word.readChar();
                    charArray[i] = C;
                    i++;
                }while(charArray[i-1] != ' ');
                System.out.println("索引为 " + N + " 的单词是:");
                for (char carTemp : charArray )
                    System.out.print(carTemp);
                System.out.print("\n");

            }
        }while(N!=0);
        buffer.close();
    }
}

输出结果:

请输入单词的索引
...
索引为 N 的单词是:
...

注意:由于您要求只提供代码翻译,我已经省略了详细说明部分。如果您对代码的功能有疑问,请随时提问。

英文:

I'm trying to write a code that pick-up a word from a file according to an index entered by the user but the problem is that the method readChar() from the RandomAccessFile class is returning japanese characters, I must admit that it's not the first time that I've seen this on my lenovo laptop , sometimes on some installation wizards I can see mixed stuff with normal characters mixed with japanese characters, do you think it comes from the laptop or rather from the code?

This is the code:

package com.project;

import java.io.*;
import java.util.StringTokenizer;

public class Main {

    public static void main(String[] args) throws IOException {
        int N, i=0;
        char C;
        char[] charArray = new char[100];
        String fileLocation = "file.txt";
        BufferedReader buffer = new BufferedReader(new InputStreamReader(System.in));
        do {
            System.out.println("enter the index of the word");
            N = Integer.parseInt(buffer.readLine());
            if (N!=0) {
                RandomAccessFile word = new RandomAccessFile(new File(fileLocation), "r");
                do {
                    word.seek((2*(N-1))+i);
                    C = word.readChar();
                    charArray[i] = C;
                    i++;
                }while(charArray[i-1] != ' ');
                System.out.println("the word of index " + N + " is: " );
                for (char carTemp : charArray )
                System.out.print(carTemp);
                System.out.print("\n");

            }
        }while(N!=0);
        buffer.close();
    }
}

i get this output :

瑯潕啰灰灥敲牃䍡慳獥攨⠩⤍ഊੴ瑯潌䱯潷睥敲牃䍡慳獥攨⠩⤍ഊ੣捯潭浣捡慴琨⡓却瑲物楮湧朩⤍ഊ੣捨桡慲牁䅴琨⡩楮湴琩⤍ഊੳ獵畢扳獴瑲物楮湧木⠠⁳獴瑡慲牴琠⁩楮湤摥數砬Ⱐ⁥敮湤搠⁩楮湤摥數砩⤍ഊੴ瑲物業洨⠩Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 100 out of bounds for length 100
	at Main.main(Main.java:21)

答案1

得分: 1

char 是16位,即2个字节。

seek 寻找字节边界。

如果文件包含字符,则它们位于偶数偏移处:0、2、4...

表达式 (2*(N-1))+i)i 为偶数时为偶数;如果为奇数,则肯定会落在字符中间,从而读取垃圾数据。

i 从零开始,但递增1,即半个字符。

你的 seek 参数应该是 (2*(N-1+i))


另一种解释:你的文件根本不包含 char;例如,你创建了一个ASCII文件,其中一个字符是一个字节。

在这种情况下,错误在于尝试使用 readChar 函数读取ASCII(一种已过时的字符编码)。

但是,如果文件包含ASCII字符,将乘以2来寻找位置的目的是不明确的。这显然没有任何有用的目的。

英文:

char is 16 bits, i.e. 2 bytes.

seek seeks to a byte boundary.

If the file contains chars then they are at even offsets: 0, 2, 4...

The expression (2*(N-1))+i) is even iff i is even; if odd, you are sure to land in the middle of a char, and thus read garbage.

i starts at zero, but you increment by 1, i.e., half a character.

Your seek argument should probably be (2*(N-1+i)).


Alternative explanation: your file does not contain chars at all; for example, you created an ASCII file in which a character is a single byte.

In that case, the error is attempting to read ASCII (an obsolete character encoding) with a readChar function.

But if the file contains ASCII, the purpose of multiplying by 2 in the seek argument is obscure. It apparently serves no useful purpose.

答案2

得分: 1

有许多问题,所有这些问题都与基本误解有关。

首先:你磁盘上的一个文件 - 先别管 Java 中的 File 接口,或者其他任何编程语言;文件本身 - 不能且永远不会 存储文本。它存储的是字节。也就是原始数据,以比特(在几十年来的每台机器上都是如此,但在历史上曾有其他方法)表示,这些比特组织成称为字节的8个一组。

文本是一种抽象;是对某个特定字节值序列的解释。这基本上是不可避免的基于某种编码。因为这不是一篇博客,我会为你省略掉历史课,但可以说的是 Java 的 char 类型并不仅仅存储文本字符。它存储一个无符号的双字节值,可以表示一个文本字符。因为 Unicode 中的文本字符比两个字节能表示的要多,有时数组中两个相邻的 char 被用来表示一个文本字符。(当然,可能有些代码滥用 char 类型,只是因为某人想要一个 short 类型的无符号等价物。我可能自己也写过类似的代码。那个时代对我来说有点模糊。)

无论如何,重点是:使用 .readChar() 将会从你的文件中读取两个字节,并将它们存储到你的 char[] 中,对应的数值不会是你想要的 - 除非你的文件恰好使用了 Java 本地使用的相同编码,称为 UTF-16

如果你不知道文件的编码,是无法正确地读取和解释文件的。就此打住。你最多只能自欺欺人地认为你可以读取它。你也无法随机访问文本文件 - 也就是按照文本字符数量进行索引 - 除非所涉及的编码具有恒定的宽度。(否则,当然,你不能仅通过计算给定文本字符所在文件中的字节距离来定位它;它取决于前面的字符占用了多少字节,而这取决于它们是哪些字符。)许多文本编码不具有恒定的宽度。最流行的之一,实际上也是大多数任务的明智默认推荐,不是恒定宽度的。在这种情况下,你对所描述的问题无能为力。

无论如何,一旦你知道了文件的编码,从 Java 文件中检索文本字符的预期方法是使用其中一种 Reader 类,比如 InputStreamReader

> InputStreamReader 是从字节流到字符流的桥梁:它读取字节并使用指定的字符集对其进行解码以生成字符。它使用的字符集可以通过名称指定,也可以明确给出,或者可以接受平台的默认字符集。

(这里的 字符集 简单地意味着 Java 用于表示文本编码的类的实例。)

可能可以稍微修改问题描述:定位到一个字节偏移量,然后从该偏移量开始获取文本字符。然而,并不能保证在“从该偏移量开始的文本字符”有任何意义,或者实际上可以被解码。如果偏移量恰好位于多字节字符的编码中间,剩余部分未必是有效的编码文本。

英文:

There are many things wrong, all of which have to do with fundamental misconceptions.

First off: A file on your disk - never mind the File interface in Java, or any other programming language; the file itself - does not and cannot store text. Ever. It stores bytes. That is, raw data, as (on every machine that's been relevant for decades, but historically there have been other ways to do it) quantified in bits, which are organized into groups of 8 that are called bytes.

Text is an abstraction; an interpretation of some particular sequence of byte values. It depends - fundamentally and unavoidably - on an encoding. Because this isn't a blog, I'll spare you the history lesson here, but suffice to say that Java's char type does not simply store a character of text. It stores an unsigned two-byte value, which may represent a character of text. Because there are more characters of text in Unicode than two bytes can represent, sometimes two adjacent chars in an array are required to represent a character of text. (And, of course, there is probably code out there that abuses the char type simply because someone wanted an unsigned equivalent of short. I may even have written some myself. That era is a blur for me.)

Anyway, the point is: using .readChar() is going to read two bytes from your file, and store them into a char within your char[], and the corresponding numeric value is not going to be anything like the one you wanted - unless your file happens to be encoded using the same encoding that Java uses natively, called UTF-16.

You cannot properly read and interpret the file without knowing the file encoding. Full stop. You can at best delude yourself into believing that you can read it. You also cannot have "random access" to a text file - i.e., indexing according to a number of characters of text - unless the encoding in question is constant width. (Otherwise, of course, you can't just calculate the distance-in-bytes into the file where a given character of text is; it depends on how many bytes the previous characters took up, which depends on which characters they are.) Many text encodings are not constant width. One of the most popular, which frankly is the sane default recommendation for most tasks these days, is not. In which case you are simply out of luck for the problem you describe.

At any rate, once you know the encoding of your file, the expected way to retrieve a character of text from a file in Java is to use one of the Reader classes, such as InputStreamReader:

> An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.

(Here, charset simply means an instance of the class that Java uses to represent text encodings.)

You may be able to fudge your problem description a little bit: seek to a byte offset, and then grab the text characters starting at that offset. However, there is no guarantee that the "text characters starting at that offset" make any sense, or in fact can be decoded at all. If the offset happens to be in the middle of a multi-byte encoding for a character, the remaining part isn't necessarily valid encoded text.

答案3

得分: 0

我将文件的编码更改为UTF-16并修改了程序以显示正确的索引即表示每个单词开头的索引现在它正常工作了谢谢你们

import java.io.*;

public class Main {
    public static void main(String[] args) throws IOException {
        int N, i=0, j=0, k=0;
        char C;
        char[] charArray = new char[100];
        String fileLocation = "file.txt";
        BufferedReader buffer = new BufferedReader(new InputStreamReader(System.in));
        DataInputStream in = new DataInputStream(new FileInputStream(fileLocation));
        boolean EOF=false;
        do {
            try {
                j++;
                C = in.readChar();
               if((C==' ')||(C=='\n')){
                    System.out.print(j+1+"\t");
                }

            }catch (IOException e){
                EOF=true;
            }

        }while (EOF!=true);
        System.out.println("\n");
        do {
            System.out.println("enter the index of the word");
            N = Integer.parseInt(buffer.readLine());
            if (N!=0) {
                RandomAccessFile word = new RandomAccessFile(new File(fileLocation), "r");


                do {

                    word.seek((2*(N-1+i)));
                    C = word.readChar();
                    charArray[i] = C;
                    i++;
                }while(charArray[i-1] != ' ' && charArray[i-1] != '\n');
                System.out.print("the word of index " + N + " is: " );
                for (char carTemp : charArray )
                    System.out.print(carTemp);
                System.out.print("\n");
                 i=0;
                charArray = new char[100];
            }
        }while(N!=0);
        buffer.close();

   }
}
英文:

I changed the encoding of the file to UTF-16 and modified the programe in order to display the right indexes, those that represents the beginning of each word, now it works fine, Thank you guys.

  import java.io.*;
public class Main {
public static void main(String[] args) throws IOException {
int N, i=0, j=0, k=0;
char C;
char[] charArray = new char[100];
String fileLocation = "file.txt";
BufferedReader buffer = new BufferedReader(new InputStreamReader(System.in));
DataInputStream in = new DataInputStream(new FileInputStream(fileLocation));
boolean EOF=false;
do {
try {
j++;
C = in.readChar();
if((C==' ')||(C=='\n')){
System.out.print(j+1+"\t");
}
}catch (IOException e){
EOF=true;
}
}while (EOF!=true);
System.out.println("\n");
do {
System.out.println("enter the index of the word");
N = Integer.parseInt(buffer.readLine());
if (N!=0) {
RandomAccessFile word = new RandomAccessFile(new File(fileLocation), "r");
do {
word.seek((2*(N-1+i)));
C = word.readChar();
charArray[i] = C;
i++;
}while(charArray[i-1] != ' ' && charArray[i-1] != '\n');
System.out.print("the word of index " + N + " is: " );
for (char carTemp : charArray )
System.out.print(carTemp);
System.out.print("\n");
i=0;
charArray = new char[100];
}
}while(N!=0);
buffer.close();
}
}

huangapple
  • 本文由 发表于 2020年9月21日 06:50:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/63984370.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定