字符阅读器(类似于FileReader),具有基于字符偏移的动态重新定位功能。

huangapple go评论72阅读模式
英文:

Character reader (like FileReader) with dynamic repositioning based on characters offset

问题

我正在寻找类似于RandomAccessFile的东西,它允许根据字符(而不是字节)偏移在大文件内进行定位,并允许从那里开始读取。我找到的FileReader或大多数实现都没有像RandomAccessFile中提供的寻位方法。

是否存在这样的阅读器?

英文:

I am looking for something similar to a RandomAccessFile which would allow a positioning inside a big file based on the character (not the byte) offset, and allow reading from there. The FileReader or most implementations I came accross do not have a seek-like method such as the one provided in RandomAccessFile.

Does any such reader exist?

答案1

得分: 2

tl;dr 没有这样的Reader存在,因为它无法轻松地与每种编码一起使用。

如果您使用的编码是固定宽度编码(如ISO-8859-*,Windows代码页,ASCII,UCS-2等),那么这可能有效,因为您只需将字符偏移乘以某个常数(通常为1,具体取决于编码)即可获得字节偏移量。

实际上,您可以通过在底层InputStream上进行寻找来轻松模拟这个过程(确保不要使用BufferedReader,因为寻找可能会打乱缓冲)。

但是,一些非常流行的编码(以及一些不太流行的编码)是可变宽度的,这意味着每个字符可以由不同数量的字节表示。UTF-8和UTF-16是众所周知的示例,但是其他编码如Shift-JIS也具有此属性。

对于可变宽度编码,如果没有任何索引或先前知识,就无法创建这样的流。可以通过只读取所需数量的字节来实现seek(),但这实际上并没有实现真正的“seek”,因为您实际上必须从磁盘读取“跳过的”字节以知道要前进多远。

英文:

tl;dr No such Reader exists, because it can't easily be made to work with every encoding.

If the encoding you use is a fixed-width encoding (such as ISO-8859-*, a Windows codepage, ASCII, UCS-2, ...) then this could work, since you'd simply have to multiply the character offset by some constant (depending on the encoding, usually 1) to get the byte offset.

In fact you can easily emulate this yourself, by seeking on the underlying InputStream (make sure not to use a BufferedReader, since the buffering could be thrown off by the seeking).

But a couple of very popular encodings (and a few more less popular ones) are variable-width, meaning that each character can be represented by different numbers of bytes. UTF-8 and UTF-16 ones are well-known examples, but others like Shift-JIS have this property as well.

For variable-width encodings creating such a stream without any indexing or prior knowledge is not possible. One could implement the seek() by just reading the desired number of bytes, but that wouldn't actually have the benefit of a real seek, since you'd actually have to read the "skipped" bytes from disk to know how far to go.

huangapple
  • 本文由 发表于 2020年10月21日 15:43:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/64458880.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定