如何正确读写大型二进制文件中的随机块?

huangapple go评论69阅读模式
英文:

How to read/write random chunks from large binary files properly?

问题

我正在编写一个用于处理二进制文件的库。具体来说是"Log Information Standard (LIS) 79 Subset",其中包含各种类型的记录,其条目具有各种数据类型。每个条目可以是单个值、数组,或具有更复杂的结构。文件大小可能从3-5 KB到数GB不等。

**目标:**能够读取和修改任何大小的文件的任何部分。

已尝试的方法:

  • 第一个实现简单地读取整个文件,然后将每个记录和数据类型写入适当的类实例,同时将其缓冲区写入。对于小文件来说效果很好,但当文件大小大于1GB时,速度非常慢且内存占用很高。
  • 然后我尝试完全停止使用缓冲区。读取文件,存储每个组件的偏移、大小和类型。这种方法在读取方面效果很好,但在一些研究后我了解到,没有办法在文件的随机位置插入数据。

所以问题是如何正确处理这样的数据,而不过度使用内存?

英文:

I am writing a lib for working with binary files. Specifically "The Log Information Standard (LIS) 79 Subset", which has various types of records with entries of various datatypes. Each entry may be a single value, array or have more complex structure. File size may vary from 3-5 KBs to several GBs.

Goal: read and modify any part of the file of any size.

What has been tried:

  • first implementation simply read the whole file, then wrote every record and datatype to appropriate class instance along with its buffer. It was perfectly fine for small files, but extremely slow and RAM hungry when file size is bigger than 1GB.
  • then I tried to stop using buffers completely. Read the file, stored offset, size and type for each component. This approach worked well for reading, but as I understood after some research there are no way to insert data in the random position of the file.

So the question is how to properly handle such data without overusing RAM?

答案1

得分: 1

我不了解LIS文件,所以下面将涉及一般的二进制文件。

许多二进制文件格式都会有某种类型的索引,以及实际的数据条目。因此,读取文件将包括扫描索引,直到找到所需内容,然后跳转到索引中指定的偏移量。索引可以在文件开头定义为一个块,或者作为链表分布在整个文件中。实际格式可能会更加复杂,但它可能对于作为简化的心智模型有所帮助。

如果你了解文件格式,你可以简单地使用BinaryReader来读取值,并在文件中进行跳转。可能需要使用某种状态机来跟踪正在读取的内容。

在文件的随机位置插入数据是非常困难的。你必须在浪费空间、移动数据以及碎片化之间进行选择。数据库花费了大量的工作来寻找在各种极端之间取得平衡的方法。

但如果你正在使用现有格式,选择将为你做出。如果格式不适用于廉价插入,你可能需要移动文件中绝大部分数据,实际上需要重新写入整个文件。如果幸运的话,格式可能允许廉价地追加数据。

如果格式不适用于廉价修改,你最有可能需要将其转换为一些便于修改的格式。如果你可以将所有数据都保留在内存中,那可能会简化事情。

你也可以将索引解析成内存结构,并在内存中保留任何更新,直到需要将数据写回磁盘。因此,一个虚构的格式可能看起来像这样。关键在于只从磁盘中读取所需的最小数据量,并且在内存中进行添加或修改。请注意,这仅用于说明目的。

public class Index
{
    private readonly Dictionary<string, IEntry> entries = new();

    public IEnumerable<string> List => entries.Keys;
    public byte[] Read(string key) => entries[key].Read();
    public void UpdateOrAdd(string key, byte[] data) => entries[key] = new MemoryEntry(data);

    public static Index Load(Stream source)
    {
        var br = new BinaryReader(source);
        var numEntries = br.ReadInt32();
        var result = new Index();
        
        for (int i = 0; i < numEntries; i++)
        {
            var key = br.ReadString();
            var length = br.ReadInt32();

            // 注意:将索引信息和数据混合在一起会使读取/追加变得容易,但加载会变慢。
            var offset = (int)br.BaseStream.Position;
            result.entries[key] = new FileEntry(source, offset, length);
            br.BaseStream.Position += length;
        }
        return result;
    }

    public void Save(Stream destination)
    {
        var bw = new BinaryWriter(destination);
        bw.Write(entries.Count);
        var list = entries.ToList();
        foreach (var (key, value) in list)
        {
            bw.Write(key);
            bw.Write(value.Length);
            value.CopyTo(bw.BaseStream);
        }
    }
}

public interface IEntry
{
    public void CopyTo(Stream destination);
    public byte[] Read();
    public int Length { get; }
}

public class MemoryEntry : IEntry
{
    private readonly byte[] data;
    public MemoryEntry(byte[] data) => this.data = data;
    public void CopyTo(Stream destination) => destination.Write(data, 0, data.Length);
    public byte[] Read() => data;
    public int Length => data.Length;
}

public class FileEntry : IEntry
{
    private readonly Stream fileStream;
    private readonly int offset;
    private readonly int length;

    public FileEntry(Stream fileStream, int offset, int length)
    {
        this.fileStream = fileStream;
        this.offset = offset;
        this.length = length;
    }

    public void CopyTo(Stream destination)
    {
        fileStream.Position = offset;
        fileStream.CopyTo(destination, length);
    }

    public byte[] Read()
    {
        fileStream.Position = offset;
        var result = new byte[length];
        fileStream.Position = offset;
        var bytesRead = fileStream.Read(result, 0, length);
        if (bytesRead != length) throw new InvalidOperationException("Invalid binary format");
        return result;
    }

    public int Length => length;
}
英文:

I have no knowledge about LIS files, so this will will be about binary files in general.

Many binary file formats will have some type of index, as well as the actual data entries themselves. So reading the file would consist of scanning thru the index until you find what you are looking for and then jump to the offset specified in the index. The index might be defined in a chunk at the beginning of the file, or as a linked list, spread throughout the file. The actual format will likely be much more complicated, but it might be useful as a simplified mental model.

If you know the format you could simply use a BinaryReader to read values, and jump around in the file accordingly. Probably using some kind of state machine to keep track of what it is you are reading.

> insert data in the random position of the file

This is really difficult to do well. You will have to chose between wasting space, moving data around, and fragmentation. Databases spend a lot of effort at trying to find a happy medium between each extreme.

But if you are working with an existing format you will have the choice made for you. If the format is not designed for cheap insertions you will likely need to move the vast majority of data in the file, essentially require you to rewrite the entire thing. If you are lucky the format might allow appending of data cheaply.

If the format is not designed for cheap modification you should most likely need to convert it to some format that is cheap to modify. If you can keep it all in memory that will likely simplify things.

You could also just parse the index into an in memory structure, and keep any updates in memory until it is time to write data back to disk. So a imaginary format could look something like this. The key here is that you only read the least amount of data needed from disk, and that additions or modifications are done in memory. Note that this is only for illustrative purposes only.

public class Index
{
private readonly Dictionary&lt;string, IEntry&gt; entries = new();
public IEnumerable&lt;string&gt; List =&gt; entries.Keys;
public byte[] Read(string key) =&gt; entries[key].Read();
public void UpdateOrAdd(string key, byte[] data) =&gt; entries[key] = new MemoryEntry(data);
public static Index Load(Stream source)
{
var br = new BinaryReader(source);
var numEntries = br.ReadInt32();
var result = new Index();
for (int i = 0; i &lt; numEntries; i++)
{
var key = br.ReadString();
var length = br.ReadInt32();
// Note. Mixing index information and data like this will make it 
// easy to read/append, but slower to load. 
var offset = (int)br.BaseStream.Position;
result.entries[key] = new FileEntry(source, offset, length);
br.BaseStream.Position += length;
}
return result;
}
public void Save(Stream destination)
{
var bw = new BinaryWriter(destination);
bw.Write(entries.Count);
var list = entries.ToList();
foreach (var (key, value) in list)
{
bw.Write(key);
bw.Write(value.Length);
value.CopyTo(bw.BaseStream);
}
}
}
public interface IEntry
{
public void CopyTo(Stream destination);
public byte[] Read();
public int Length { get; }
}
public class MemoryEntry : IEntry
{
private readonly byte[] data;
public MemoryEntry(byte[] data) =&gt; this.data = data;
public void CopyTo(Stream destination) =&gt; destination.Write(data, 0, data.Length);
public byte[] Read() =&gt; data;
public int Length =&gt; data.Length;
}
public class FileEntry : IEntry
{
private readonly Stream fileStream;
private readonly int offset;
private readonly int length;
public FileEntry(Stream fileStream, int offset, int length)
{
this.fileStream = fileStream;
this.offset = offset;
this.length = length;
}
public void CopyTo(Stream destination)
{
fileStream.Position = offset;
fileStream.CopyTo(destination, length);
}
public byte[] Read()
{
fileStream.Position = offset;
var result = new byte[length];
fileStream.Position = offset;
var bytesRead = fileStream.Read(result, 0, length);
if (bytesRead != length) throw new InvalidOperationException(&quot;Invalid binary format&quot;);
return result;
}
public int Length =&gt; length;
}

huangapple
  • 本文由 发表于 2023年6月27日 20:27:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76564866.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定