英文:
What types/functions should I use to track a file position for random access?
问题
I recommend using the fseek
and fgetc
functions to navigate and read characters from the file. You can safely cast between long int
and size_t
in this context. To jump forward or backward an arbitrary number of characters, you can combine fseek
with calculations based on the pattern's length and indices.
Here's a brief summary of your options:
-
Fseek and Fgetc: This is a reasonable choice, and casting between
long int
andsize_t
should generally be safe in practice. -
Fsetpos: While it's more portable, as you mentioned, it may not allow for arbitrary jumps in the file.
-
Binary Stream: Opening the file as a binary stream is a valid approach and can help with compatibility. You can still perform the necessary calculations for jumping while using binary streams.
-
File Descriptor: Using file descriptors gives you more control but might require additional handling for buffering. Casting between
size_t
andoff_t
can be done with appropriate error checking, but it's less straightforward than using streams.
Overall, using fseek
and fgetc
with proper casting and calculations seems like a reasonable choice for implementing the Boyer-Moore algorithm for file input. It strikes a balance between simplicity and functionality.
英文:
I'm trying to implement a simplified Boyer-Moore string search algorithm that reads its input text from a file. The algorithm requires that I start at a given file position and read its characters backwards, periodically jumping forward a precomputed number of characters. The jumps are computed based on the pattern's length and indices, so I was storing them as type size_t
. What function should I use to read file characters at specific positions, and what type should I use to store these positions? I'm new to C, but these are the options I've considered:
Fseek
I could use fseek
and getc
to jump around the file, but this uses a long int
as its character index. I don't know if it's safe to cast between this and a size_t
, and regardless, the GNU C manual recommends against fseeking text streams for portability reasons.
Fsetpos
This is supposed to be more portable, but I don't think I can use this to jump forward or backward an arbitrary number of characters.
Binary Stream
I could get around the fseek
compatibility issue by opening the file as a binary stream. But I don't know if this could cause other compatibility issues when dealing with pattern/input text, and anyways, this doesn't solve the issue of casting between long int
and size_t
.
File Descriptor
I could use file descriptors instead of streams. But then I need to cast between size_t
and off_t
, and I don't know how safe that is. I would also give up FILE
's buffering, which I'm not sure is advisable. I could try to roll my own buffering, or maybe use an alternate library, but this seems like a massive pain.
My first implementation passed the input text as a command line argument, so it didn't deal with file IO at all. But I don't think this would scale well for large text inputs, and the more I've read about file IO the more stuck I feel. What do you suggest?
答案1
得分: 3
size_t
& long
conversions
如果long
足够大来表示文件偏移,并且您的size_t
值表示文件偏移,那么在这两者之间进行转换没有问题。(也无需显式转换。)
可移植性
那么,long
实际上是否足够大来表示文件偏移?在Windows上,long
被广泛知道是其最小大小,32位。即使在64位程序中也是如此。因此,如果您计划在使用fseek
接口处理大小为2 GiB或更大的文件时,可能会存在可移植性问题。对于较小的文件,您应该不会遇到问题。
向前或向后跳转任意数量的字符
在Windows中使用的CRLF换行符在这里会导致问题,无论您使用什么接口。
您很可能可以解决这个问题。这取决于您对“字符”的定义,可能还取决于跳转需要多么精确。您还没有提供足够的信息供我们帮助您。
英文:
size_t
⇔ long
conversions
If long
is large enough for a file offset, and if your size_t
value represents a file offset, then there's no problem with converting between these two. (And no need for an explicit cast.)
Portability
So is long
actually large enough for a file offset? long
is well known to be its minimum size on Windows, 32 bits. Even in 64-bit programs. So there could be portability issues if you plan on handling files with a size of 2 GiB or greater while using the fseek
interface. You should have no problems with smaller files.
Jumping forward or backward an arbitrary number of characters
The CRLF line endings used in Windows will bite you here, no matter what interface you use.
It's very likely you can work around this problem. It depends on your definition of "character", and it might depend on how precise the jump needs to be. You haven't provided enough information for us to help you.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论