英文:
Buffered vs Unbuffered. How actually buffer work?
问题
每次读取字节时都会访问文件。我读到缓冲区减少了对文件的访问次数。问题是如何实现的呢?在图片的缓冲区部分,当我们将字节从文件加载到缓冲区时,与图片中的非缓冲区部分一样,我们访问了文件,那么优化在哪里呢?我的意思是...缓冲区在每次读取字节时都必须访问文件,所以即使缓冲区中的数据读取速度更快,也不会改善读取过程中的性能。我漏掉了什么?
英文:
How actually a buffer optimize the process of reading/writing?
Every time when we read a byte we access the file. I read that a buffer reduces the number of accesses the file. The question is how?. In the Buffered section of picture, when we load bytes from the file to the buffer we access the file just like in Unbuffered section of picture so where is the optimization?
I mean ... the buffer must access the file every time when reads a byte so
even if the data in the buffer is read faster this will not improve performance in the process of reading. What am I missing?
答案1
得分: 5
基本误解在于假设文件是逐字节读取的。大多数存储设备,包括硬盘和固态硬盘,都将数据组织成块。同样,网络协议传输数据时使用数据包而不是单个字节。
这会影响控制器硬件和低级软件(驱动程序和操作系统)的工作方式。通常情况下,在这个级别甚至无法传输单个字节。因此,请求读取单个字节最终会读取一个块,并忽略除一个字节之外的所有内容。更糟糕的是,写入单个字节可能意味着读取整个块,更改其中一个字节,并将块写回设备。对于网络传输,发送一个仅包含一个字节有效负载的数据包意味着使用99%的带宽用于元数据而不是实际有效负载。
请注意,有时需要立即响应或需要确保某个时刻已完成写入,例如出于安全考虑。这就是为什么存在不带缓冲的I/O。但对于大多数普通用例,您仍然希望传输一系列字节,并且应该以适合底层硬件的大小的块传输。
请注意,即使底层系统自己注入了缓冲区,或者硬件确实传输单个字节,执行100次操作系统调用以每次传输一个字节仍然比执行一次操作系统调用要慢得多,告诉它一次传输100个字节。
但您不应将缓冲区视为文件和您的程序之间的东西,就像您的图片中所建议的那样。您应该将缓冲区视为您程序的一部分。就像您不会认为String
对象是程序和字符源之间的东西,而是一种处理这些项的自然方式一样。例如,当您使用[InputStream的批量读取方法](例如FileInputStream
的批量读取方法)并且目标数组足够大时,没有必要将输入流包装在BufferedInputStream
中;这不会提高性能。您应该尽量避免使用[单字节读取方法]。
作为另一个实际示例,当您使用InputStreamReader
时,它将已读取的字节存储到缓冲区中(因此不需要额外的BufferedInputStream
),并且内部使用的CharsetDecoder
将在该缓冲区上操作,将生成的字符写入目标字符缓冲区。当您使用例如Scanner
时,模式匹配操作将在字符解码操作的目标字符缓冲区上工作(当源是InputStream
或ByteChannel
时)。然后,在将匹配结果作为字符串传递时,它们将通过从字符缓冲区进行另一个批量复制操作来创建。因此,以块的方式处理数据已经是常态,而不是异常情况。
这已经纳入了NIO设计中。因此,与InputStream
API一样,支持[单字节读取方法]并通过提供缓冲装饰器进行修复不同,NIO的ByteChannel
子类型只提供使用[应用程序管理的缓冲区]的方法。
因此,我们可以说,缓冲不是提高性能,而是传输和处理数据的自然方式。相反,不进行缓冲会降低性能,因为它需要将自然的大块数据操作转换为单个项目操作。
[InputStream的批量读取方法]: https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/io/InputStream.html#read(byte%5B%5D,int,int)(java.io.InputStream中的方法int read(byte,int,int))
[单字节读取方法]: https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/io/InputStream.html#read()(java.io.InputStream中的方法int read())
[应用程序管理的缓冲区]: https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/nio/ByteBuffer.html(java.nio.ByteBuffer类)
Scanner
: https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/util/Scanner.html(java.util.Scanner类)
英文:
The fundamental misconception is to assume that a file is read byte by byte. Most storage devices, including hard drives and solid-state discs, organize the data in blocks. Likewise, network protocols transfer data in packets rather than single bytes.
This affects how the controller hardware and low-level software (drivers and operating system) work. Often, it is not even possible to transfer a single byte on this level. So, requesting the read of a single byte ends up reading one block and ignoring everything but one byte. Even worse, writing a single byte may imply reading an entire block, changing one bye of it, and writing the block back to the device. For network transfers, sending a packet with a payload of only one byte implies using 99% of the bandwidth for metadata rather than actual payload.
Note that sometimes, an immediate response is needed or a write is required to be definitely completed at some point, e.g. for safety. That’s why unbuffered I/O exists at all. But for most ordinary use cases, you want to transfer a sequence of bytes anyway and it should be transferred in chunks of a size suitable to the underlying hardware.
Note that even if the underlying system injects a buffering on its own or when the hardware truly transfers single bytes, performing 100 operating system calls to transfer a single byte on each still is significantly slower than performing a single operating system call telling it to transfer 100 bytes at once.
But you should not consider the buffer to be something between the file and your program, as suggested in your picture. You should consider the buffer to be part of your program. Just like you would not consider a String
object to be something between your program and a source of characters, but rather a natural way to process such items. E.g. when you use the bulk read method of InputStream
(e.g. of a FileInputStream
) with a sufficiently large target array, there is no need to wrap the input stream in a BufferedInputStream
; it would not improve the performance. You should just stay away from the single byte read method as much as possible.
As another practical example, when you use an InputStreamReader
, it will already read the bytes into a buffer (so no additional BufferedInputStream
is needed) and the internally used CharsetDecoder
will operate on that buffer, writing the resulting characters into a target char buffer. When you use, e.g. Scanner
, the pattern matching operations will work on that target char buffer of a charset decoding operation (when the source is an InputStream
or ByteChannel
). Then, when delivering match results as strings, they will be created by another bulk copy operation from the char buffer. So processing data in chunks is already the norm, not the exception.
This has been incorporated into the NIO design. So, instead of supporting a single byte read method and fixing it by providing a buffering decorator, as the InputStream
API does, NIO’s ByteChannel
subtypes only offer methods using application managed buffers.
So we could say, buffering is not improving the performance, it is the natural way of transferring and processing data. Rather, not buffering is degrading the performance by requiring a translation from the natural bulk data operations to single item operations.
答案2
得分: 2
基本上,对于读取操作,如果您请求 1 字节,缓冲区将读取 1000 字节并返回第一个字节;对于接下来的 999 次读取 1 字节,它将不会从文件中读取任何内容,而是使用其内部的缓冲区(位于RAM中)。只有在您读取完所有的 1000 字节后,它才会实际从实际文件中再次读取另外 1000 字节。
写入操作也是类似的,但是相反。如果您写入 1 字节,它将被缓冲,只有在写入了 1000 字节后,它们才会被写入文件。
请注意,选择缓冲区大小会相当大地改变性能,可以参考例如 https://stackoverflow.com/a/237495/2442804 获取更多细节,考虑到文件系统块大小、可用 RAM 等因素。
英文:
Basically for reading if you request 1 byte the buffer will read 1000 bytes and return you the first byte, for next 999 reads for 1 byte it will not read anything from the file but use its internal buffer in RAM. Only after you read all the 1000 bytes it will actually read another 1000 bytes from the actual file.
Same thing for writing but in reverse. If you write 1 byte it will be buffered and only if you have written 1000 bytes they may be written to the file.
Note that choosing the buffer size changes the performance quite a bit, see e.g. https://stackoverflow.com/a/237495/2442804 for further details, respecting file system block size, available RAM, etc.
答案3
得分: 2
以下是您要翻译的内容:
根据您的图片所述,缓冲文件内容保存在内存中,而非缓冲文件除非流式传输到程序,否则不会直接读取。
File
仅表示路径。以下是来自File Javadoc的信息:
> 文件和目录路径名的抽象表示。
同时,像ByteBuffer这样的缓冲流会从文件中获取内容(取决于缓冲类型,是直接还是间接),并将其分配到堆内存中。
> 该方法返回的缓冲区通常具有比非直接缓冲区更高的分配和释放成本。直接缓冲区的内容可能驻留在正常的垃圾收集堆之外,因此它们对应用程序的内存占用可能不明显。 因此,建议主要为需要底层系统的本机I/O操作的大型、长寿命缓冲区分配直接缓冲区。总的来说,仅在直接缓冲区在程序性能方面提供可测量的收益时才分配直接缓冲区是最好的选择。
实际上,这取决于条件,如果文件被反复访问,那么缓冲是比非缓冲更快的解决方案。但如果文件大于主存储器并且只访问一次,那么非缓冲似乎是更好的解决方案。
英文:
As stated in your picture, buffered file contents are saved in memory and unbuffered file is not read directly unless it is streamed to program.
File
is only representation on path only. Here is from File Javadoc:
> An abstract representation of file and directory pathnames.
Meanwhile, buffered stream like ByteBuffer takes content (depends on buffer type, direct or indirect) from file and allocate it into memory as heap.
> The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance.
Actually depends on the condition, if the file is accessed repeatedly, then buffered is a faster solution rather than unbuffered. But if the file is larger than main memory and it is accessed once, unbuffered seems to be better solution.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论