你应该使用哪些类型/函数来跟踪文件位置以进行随机访问?

huangapple go评论73阅读模式
英文:

What types/functions should I use to track a file position for random access?

问题

I recommend using the fseek and fgetc functions to navigate and read characters from the file. You can safely cast between long int and size_t in this context. To jump forward or backward an arbitrary number of characters, you can combine fseek with calculations based on the pattern's length and indices.

Here's a brief summary of your options:

  • Fseek and Fgetc: This is a reasonable choice, and casting between long int and size_t should generally be safe in practice.

  • Fsetpos: While it's more portable, as you mentioned, it may not allow for arbitrary jumps in the file.

  • Binary Stream: Opening the file as a binary stream is a valid approach and can help with compatibility. You can still perform the necessary calculations for jumping while using binary streams.

  • File Descriptor: Using file descriptors gives you more control but might require additional handling for buffering. Casting between size_t and off_t can be done with appropriate error checking, but it's less straightforward than using streams.

Overall, using fseek and fgetc with proper casting and calculations seems like a reasonable choice for implementing the Boyer-Moore algorithm for file input. It strikes a balance between simplicity and functionality.

英文:

I'm trying to implement a simplified Boyer-Moore string search algorithm that reads its input text from a file. The algorithm requires that I start at a given file position and read its characters backwards, periodically jumping forward a precomputed number of characters. The jumps are computed based on the pattern's length and indices, so I was storing them as type size_t. What function should I use to read file characters at specific positions, and what type should I use to store these positions? I'm new to C, but these are the options I've considered:

Fseek

I could use fseek and getc to jump around the file, but this uses a long int as its character index. I don't know if it's safe to cast between this and a size_t, and regardless, the GNU C manual recommends against fseeking text streams for portability reasons.

Fsetpos

This is supposed to be more portable, but I don't think I can use this to jump forward or backward an arbitrary number of characters.

Binary Stream

I could get around the fseek compatibility issue by opening the file as a binary stream. But I don't know if this could cause other compatibility issues when dealing with pattern/input text, and anyways, this doesn't solve the issue of casting between long int and size_t.

File Descriptor

I could use file descriptors instead of streams. But then I need to cast between size_t and off_t, and I don't know how safe that is. I would also give up FILE's buffering, which I'm not sure is advisable. I could try to roll my own buffering, or maybe use an alternate library, but this seems like a massive pain.

My first implementation passed the input text as a command line argument, so it didn't deal with file IO at all. But I don't think this would scale well for large text inputs, and the more I've read about file IO the more stuck I feel. What do you suggest?

答案1

得分: 3

size_t & long conversions

如果long足够大来表示文件偏移,并且您的size_t值表示文件偏移,那么在这两者之间进行转换没有问题。(也无需显式转换。)

可移植性

那么,long实际上是否足够大来表示文件偏移?在Windows上,long被广泛知道是其最小大小,32位。即使在64位程序中也是如此。因此,如果您计划在使用fseek接口处理大小为2 GiB或更大的文件时,可能会存在可移植性问题。对于较小的文件,您应该不会遇到问题。

向前或向后跳转任意数量的字符

在Windows中使用的CRLF换行符在这里会导致问题,无论您使用什么接口。

您很可能可以解决这个问题。这取决于您对“字符”的定义,可能还取决于跳转需要多么精确。您还没有提供足够的信息供我们帮助您。

英文:

size_t ⇔ long conversions

If long is large enough for a file offset, and if your size_t value represents a file offset, then there's no problem with converting between these two. (And no need for an explicit cast.)

Portability

So is long actually large enough for a file offset? long is well known to be its minimum size on Windows, 32 bits. Even in 64-bit programs. So there could be portability issues if you plan on handling files with a size of 2 GiB or greater while using the fseek interface. You should have no problems with smaller files.

Jumping forward or backward an arbitrary number of characters

The CRLF line endings used in Windows will bite you here, no matter what interface you use.

It's very likely you can work around this problem. It depends on your definition of "character", and it might depend on how precise the jump needs to be. You haven't provided enough information for us to help you.

huangapple
  • 本文由 发表于 2023年6月13日 02:17:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76459308.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定