稀疏文件在使用io.Copy()时非常庞大。

huangapple go评论75阅读模式
英文:

Sparse files are huge with io.Copy()

问题

我想将文件从一个地方复制到另一个地方,问题是我处理了很多稀疏文件。

有没有(简单的)方法可以在目标位置复制稀疏文件而不会变得非常庞大?

我的基本代码:

out, err := os.Create(bricks[0] + "/" + fileName)
in, err := os.Open(event.Name)
io.Copy(out, in)
英文:

I want to copy files from one place to another and the problem is I deal with a lot of sparse files.

Is there any (easy) way of copying sparse files without becoming huge at the destination?

My basic code:

out, err := os.Create(bricks[0] + "/" + fileName)
in, err := os.Open(event.Name)
io.Copy(out, in)

答案1

得分: 8

一些背景理论

请注意,io.Copy() 传输的是原始字节,这在考虑到它从 io.Readerio.Writer 进行数据传输时是可以理解的,这两个接口分别提供了 Read([]byte)Write([]byte) 方法。因此,io.Copy() 能够处理任何提供字节的源和任何消耗字节的目标。

另一方面,文件中空洞的位置是一种“侧信道”信息,而“经典”的系统调用(如 read(2))会将这种信息隐藏起来,不让用户看到。io.Copy() 无法以任何方式传递这种侧信道信息。

换句话说,最初,文件的稀疏性只是为了在用户不知情的情况下实现数据的高效存储。

因此,io.Copy() 本身无法处理稀疏文件。

如何处理

你需要深入一层,使用 syscall 包和一些手动操作来实现这一切。

要处理空洞,你应该使用 lseek(2) 系统调用的 SEEK_HOLESEEK_DATA 特殊值,尽管它们在形式上是非标准的,但是它们被所有 主要 平台支持。

不幸的是,这些“whence”位置的支持在标准的 syscall 包(截至 Go 1.8.1)和 golang.org/x/sys 代码库中都不存在。

但是不要担心,只需要两个简单的步骤:

  1. 首先,标准的 syscall.Seek() 实际上是映射到相关平台上的 lseek(2)

  2. 接下来,你需要找出你需要支持的平台上 SEEK_HOLESEEK_DATA 的正确值。

    请注意,它们在不同平台上可能不同

    例如,在我的 Linux 系统上,我可以简单地运行以下命令:

     $ grep -E 'SEEK_(HOLE|DATA)' </usr/include/unistd.h 
     #  define SEEK_DATA     3       /* Seek to next data.  */
     #  define SEEK_HOLE     4       /* Seek to next hole.  */
    

    …以找出这些符号的值。

现在,假设你在你的包中创建了一个特定于 Linux 的文件,其中包含类似以下内容:

// +build linux

const (
    SEEK_DATA = 3
    SEEK_HOLE = 4
)

然后,你可以使用这些值与 syscall.Seek() 一起使用。

要传递给 syscall.Seek() 和相关函数的文件描述符可以通过使用 os.File 值的 Fd() 方法从已打开的文件中获取。

在读取时的模式是检测包含数据的区域,并从这些区域读取数据,参见这个示例。

请注意,这处理的是读取稀疏文件的情况;但是,如果你想要实际上以稀疏方式 传输 它们,也就是说,保持它们的这个特性,情况就更加复杂了:它似乎更不具备可移植性,因此需要进行一些研究和实验。

在 Linux 上,你可以尝试使用 fallocate(2)FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,尝试在你要写入的文件末尾打洞;如果这个操作合法地失败了(返回 syscall.EOPNOTSUPP),你只需将与你正在读取的空洞覆盖的块数相同的零块传输到目标文件中,希望操作系统会自动将它们转换为空洞。

请注意,某些文件系统根本不支持空洞,作为一个概念。一个例子是 FAT 文件系统家族中的文件系统。我要告诉你的是,在你的情况下,无法创建稀疏文件可能实际上是目标文件系统的一个属性。

你可能会对 Go 问题 #13548 "archive/tar: add support for writing tar containing sparse files" 感兴趣。


还有一点需要注意:你还可以考虑检查要复制源文件的目标目录是否与源文件位于同一文件系统中,如果是这样,可以使用 syscall.Rename()(在 POSIX 系统上)或 os.Rename() 来仅在不实际复制数据的情况下将文件移动到不同的目录中。

英文:

Some background theory

Note that io.Copy() pipes raw bytes &ndash; which is sort of understandable once you consider that it pipes data from an io.Reader to an io.Writer which provide Read([]byte) and Write([]byte), correspondingly.
As such, io.Copy() is able to deal with absolutely any source providing
bytes and absolutely any sink consuming them.

On the other hand, the location of the holes in a file is a "side-channel" information which "classic" syscalls such as read(2) hide from their users.
io.Copy() is not able to convey such side-channel information in any way.

IOW, initially, file sparseness was an idea to just have efficient storage of the data behind the user's back.

So, no, there's no way io.Copy() could deal with sparse files in itself.

What to do about it

You'd need to go one level deeper and implement all this using the syscall package and some manual tinkering.

To work with holes, you should use the SEEK_HOLE and SEEK_DATA special values for the lseek(2) syscall which are, while formally non-standard, are supported by all major platforms.

Unfortunately, the support for those "whence" positions is not present
neither in the stock syscall package (as of Go 1.8.1)
nor in the golang.org/x/sys tree.

But fear not, there are two easy steps:

  1. First, the stock syscall.Seek() is actually mapped to lseek(2)
    on the relevant platforms.

  2. Next, you'd need to figure out the correct values for SEEK_HOLE and
    SEEK_DATA for the platforms you need to support.

    > Note that they are free to be different between different platforms!

    Say, on my Linux system I can do simple

     $ grep -E &#39;SEEK_(HOLE|DATA)&#39; &lt;/usr/include/unistd.h 
     #  define SEEK_DATA     3       /* Seek to next data.  */
     #  define SEEK_HOLE     4       /* Seek to next hole.  */
    

    &hellip;to figure out the values for these symbols.

Now, say, you create a Linux-specific file in your package
containing something like

// +build linux

const (
    SEEK_DATA = 3
    SEEK_HOLE = 4
)

and then use these values with the syscall.Seek().

The file descriptor to pass to syscall.Seek() and friends
can be obtained from an opened file using the Fd() method
of os.File values.

The pattern to use when reading is to detect regions containing data, and read the data from them &ndash; see this for one example.

Note that this deals with reading sparse files; but if you'd want to actually transfer them as sparse &ndash; that is, with keeping this property of them, &ndash; the situation is more complicated: it appears to be even less portable, so some research and experimentation is due.

On Linux, it appears you could try to use fallocate(2) with
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE to try to punch a hole at the
end of the file you're writing to; if that legitimately fails
(with syscall.EOPNOTSUPP), you just shovel as many zeroed blocks to the destination file as covered by the hole you're reading &ndash; in the hope
the OS will do the right thing and will convert them to a hole by itself.

Note that some filesystems do not support holes at all &ndash; as a concept.
One example is the filesystems in the FAT family.
What I'm leading you to is that inability of creating a sparse file might
actually be a property of the target filesystem in your case.

You might find Go issue #13548 "archive/tar: add support for writing tar containing sparse files" to be of interest.


One more note: you might also consider checking whether the destination directory to copy a source file resides in the same filesystem as the source file, and if this holds true, use the syscall.Rename() (on POSIX systems)
or os.Rename() to just move the file across different directories w/o
actually copying its data.

答案2

得分: 0

你不需要使用系统调用。

package main

import "os"

func main() {
	f, _ := os.Create("/tmp/sparse.dat")
	f.Write([]byte("start"))
	f.Seek(1024*1024*10, 0)
	f.Write([]byte("end"))
}

然后你会看到:

$ ls -l /tmp/sparse.dat
-rw-rw-r-- 1 soren soren 10485763 Jun 25 14:29 /tmp/sparse.dat
$ du /tmp/sparse.dat
8	/tmp/sparse.dat

确实,你不能直接使用io.Copy。相反,你需要实现一个替代io.Copy的方法,该方法从src中读取一个块,检查它是否全是'\0'。如果是的话,只需使用dst.Seek(len(chunk), os.SEEK_CUR)跳过dst中的该部分。具体的实现留给读者作为练习 稀疏文件在使用io.Copy()时非常庞大。

英文:

You don't need to resort to syscalls.

package main

import &quot;os&quot;

func main() {
	f, _ := os.Create(&quot;/tmp/sparse.dat&quot;)
	f.Write([]byte(&quot;start&quot;))
	f.Seek(1024*1024*10, 0)
	f.Write([]byte(&quot;end&quot;))
}

Then you'll see:

$ ls -l /tmp/sparse.dat
-rw-rw-r-- 1 soren soren 10485763 Jun 25 14:29 /tmp/sparse.dat
$ du /tmp/sparse.dat
8	/tmp/sparse.dat

It's true you can't use io.Copy as is. Instead you need to implement an alternative to io.Copy which reads a chunk from the src, checks if it's all &#39;\0&#39;. If it is, just dst.Seek(len(chunk), os.SEEK_CUR) to skip past that part in dst. That particular implementation is left as an exercise to the reader 稀疏文件在使用io.Copy()时非常庞大。

huangapple
  • 本文由 发表于 2017年3月27日 06:19:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/43035271.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定