How can I compare two files in golang?

huangapple go评论89阅读模式
英文:

How can I compare two files in golang?

问题

使用Python,我可以做到以下事情:

equals = filecmp.cmp(file_old, file_new)

在Go语言中是否有任何内置函数可以实现这个功能?我在Google上搜索了一下,但没有成功。

我可以使用hash/crc32包中的一些哈希函数,但这比上面的Python代码更麻烦。

英文:

With Python I can do the next:

equals = filecmp.cmp(file_old, file_new)

Is there any builtin function to do that in go language? I googled it but without success.

I could use some hash function in hash/crc32 package, but that is more work that the above Python code.

答案1

得分: 13

完成@captncraig的回答后,如果你想知道两个文件是否相同,你可以使用OS包中的SameFile(fi1, fi2 FileInfo)方法。

SameFile报告fi1和fi2是否描述了同一个文件。例如,在Unix上,这意味着两个底层结构的设备和inode字段是相同的;

否则,如果你想检查文件的内容,这里有一个解决方案,它逐行检查两个文件,避免将整个文件加载到内存中。

首先尝试:https://play.golang.org/p/NlQZRrW1dT


**编辑:**按字节块读取并在文件大小不同的情况下快速失败。https://play.golang.org/p/YyYWuCRJXV

const chunkSize = 64000

func deepCompare(file1, file2 string) bool {
    // 检查文件大小...

	f1, err := os.Open(file1)
	if err != nil {
		log.Fatal(err)
	}
    defer f1.Close()

	f2, err := os.Open(file2)
	if err != nil {
		log.Fatal(err)
	}
    defer f2.Close()

	for {
		b1 := make([]byte, chunkSize)
		_, err1 := f1.Read(b1)

		b2 := make([]byte, chunkSize)
		_, err2 := f2.Read(b2)

		if err1 != nil || err2 != nil {
			if err1 == io.EOF && err2 == io.EOF {
				return true
			} else if err1 == io.EOF || err2 == io.EOF {
				return false
			} else {
				log.Fatal(err1, err2)
			}
		}

		if !bytes.Equal(b1, b2) {
			return false
		}
	}
}
英文:

To complete the @captncraig answer, if you want to know if the two files are the same, you can use the SameFile(fi1, fi2 FileInfo) method from the OS package.

> SameFile reports whether fi1 and fi2 describe the same file. For example, on Unix this means that the device and inode fields of the two underlying structures are identical;

Otherwise, if you want to check the files contents, here is a solution which checks the two files line by line avoiding the load of the entire files in memory.

First try: https://play.golang.org/p/NlQZRrW1dT


EDIT: Read by bytes chunks and fail fast if the files have not the same size. https://play.golang.org/p/YyYWuCRJXV

const chunkSize = 64000

func deepCompare(file1, file2 string) bool {
    // Check file size ...

	f1, err := os.Open(file1)
	if err != nil {
		log.Fatal(err)
	}
    defer f1.Close()

	f2, err := os.Open(file2)
	if err != nil {
		log.Fatal(err)
	}
    defer f2.Close()

	for {
		b1 := make([]byte, chunkSize)
		_, err1 := f1.Read(b1)

		b2 := make([]byte, chunkSize)
		_, err2 := f2.Read(b2)

		if err1 != nil || err2 != nil {
			if err1 == io.EOF && err2 == io.EOF {
				return true
			} else if err1 == io.EOF || err2 == io.EOF {
				return false
			} else {
				log.Fatal(err1, err2)
			}
		}

		if !bytes.Equal(b1, b2) {
			return false
		}
	}
}

答案2

得分: 11

我不确定该函数是否按照你的想法执行。根据文档

除非给出并且为假,具有相同os.stat()签名的文件被认为是相等的。

你的调用只比较了os.stat签名,其中只包括:

  1. 文件模式
  2. 修改时间
  3. 大小

你可以从Go的os.Stat函数中了解到这三个信息。这只能表明它们是完全相同的文件,或者是指向同一文件的符号链接,或者是该文件的副本。

如果你想深入比较,可以打开两个文件并进行比较(Python版本每次读取8k)。你可以使用crc或md5对两个文件进行哈希,但如果长文件的开头有差异,你可能希望尽早停止。我建议从每个读取器中每次读取一定数量的字节,并使用bytes.Compare进行比较。

英文:

I am not sure that function does what you think it does. From the docs,

> Unless shallow is given and is false, files with identical os.stat() signatures are taken to be equal.

Your call is comparing only the signature of os.stat, which only includes:

  1. File mode
  2. Modified Time
  3. Size

You can learn all three of these things in Go from the os.Stat function. This really would only indicate that they are literally the same file, or symlinks to the same file, or a copy of that file.

If you want to go deeper you can open both files and compare them (python version reads 8k at a time).

You could use an crc or md5 to hash both files, but if there are differences at the beginning of a long file, you want to stop early. I would recommend reading some number of bytes at a time from each reader and comparing with bytes.Compare.

答案3

得分: 9

使用bytes.Equal如何?

package main

import (
	"fmt"
	"io/ioutil"
	"log"
	"bytes"
)

func main() {
	// 根据评论,最好不要将整个文件读入内存
	// 这只是一个简单的示例。
	f1, err1 := ioutil.ReadFile("lines1.txt")

	if err1 != nil {
		log.Fatal(err1)
	}

	f2, err2 := ioutil.ReadFile("lines2.txt")

	if err2 != nil {
		log.Fatal(err2)
	}

	fmt.Println(bytes.Equal(f1, f2)) // 根据评论,这样做性能更好。
}
英文:

How about using bytes.Equal?

package main

import (
"fmt"
"io/ioutil"
"log"
"bytes"
)

func main() {
    // per comment, better to not read an entire file into memory
    // this is simply a trivial example.
    f1, err1 := ioutil.ReadFile("lines1.txt")

    if err1 != nil {
	    log.Fatal(err1)
    }

    f2, err2 := ioutil.ReadFile("lines2.txt")

    if err2 != nil {
	    log.Fatal(err2)
    }

    fmt.Println(bytes.Equal(f1, f2)) // Per comment, this is significantly more performant.
}

答案4

得分: 1

你可以使用类似equalfile的包。

主要的API是:

func CompareFile(path1, path2 string) (bool, error)

Godoc文档:https://godoc.org/github.com/udhos/equalfile

示例代码:

package main

import (
	"fmt"
	"os"
	"github.com/udhos/equalfile"
)

func main() {
	if len(os.Args) != 3 {
		fmt.Printf("usage: equal file1 file2\n")
		os.Exit(2)
	}

	file1 := os.Args[1]
	file2 := os.Args[2]

	equal, err := equalfile.CompareFile(file1, file2)
	if err != nil {
		fmt.Printf("equal: error: %v\n", err)
		os.Exit(3)
	}

	if equal {
		fmt.Println("equal: files match")
		os.Exit(0)
	}

	fmt.Println("equal: files differ")
	os.Exit(1)
}

希望对你有帮助!

英文:

You can use a package like equalfile

Main API:

func CompareFile(path1, path2 string) (bool, error)

Godoc: https://godoc.org/github.com/udhos/equalfile

Example:

package main

import (
    "fmt"
	"os"
	"github.com/udhos/equalfile"
 )

func main() {
 	if len(os.Args) != 3 {
		fmt.Printf("usage: equal file1 file2\n")
		os.Exit(2)
	}

	file1 := os.Args[1]
	file2 := os.Args[2]

	equal, err := equalfile.CompareFile(file1, file2)
	if err != nil {
		fmt.Printf("equal: error: %v\n", err)
		os.Exit(3)
	}

	if equal {
		fmt.Println("equal: files match")
		os.Exit(0)
	}

	fmt.Println("equal: files differ")
	os.Exit(1)
}

答案5

得分: 1

在检查了现有的答案后,我为比较任意(有限的)io.Reader和文件编写了一个简单的包,作为一个方便的方法:https://github.com/hlubek/readercomp

示例:

package main

import (
	"fmt"
	"log"
	"os"

	"github.com/hlubek/readercomp"
)

func main() {
	result, err := readercomp.FilesEqual(os.Args[1], os.Args[2])
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(result)
}
英文:

After checking the existing answers I whipped up a simple package for comparing arbitrary (finite) io.Reader and files as a convenience method: https://github.com/hlubek/readercomp

Example:

package main

import (
	"fmt"
	"log"
	"os"

	"github.com/hlubek/readercomp"
)

func main() {
	result, err := readercomp.FilesEqual(os.Args[1], os.Args[2])
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(result)
}

答案6

得分: 0

这是我写的一个io.Reader。你可以使用_, err := io.Copy(ioutil.Discard, newCompareReader(a, b))来检查两个流是否具有相同的内容。这个实现通过限制不必要的数据复制来优化性能。

package main

import (
	"bytes"
	"errors"
	"fmt"
	"io"
)

type compareReader struct {
	a    io.Reader
	b    io.Reader
	bBuf []byte // 需要一个缓冲区来将B的数据与从A读取的数据进行比较
}

func newCompareReader(a, b io.Reader) io.Reader {
	return &compareReader{
		a: a,
		b: b,
	}
}

func (c *compareReader) Read(p []byte) (int, error) {
	if c.bBuf == nil {
		// 假设p的长度保持不变,因此我们可以优化它们的缓冲区大小相等
		c.bBuf = make([]byte, len(p))
	}

	// 只读取我们可以适应p和bBuf的数据量
	readA, errA := c.a.Read(p[0:min(len(p), len(c.bBuf))])
	if readA > 0 {
		// bBuf保证至少有readA的空间
		if _, errB := io.ReadFull(c.b, c.bBuf[0:readA]); errB != nil { // 文档:"只有在没有读取任何字节时才会出现EOF"
			if errB == io.ErrUnexpectedEOF {
				return readA, errors.New("compareReader: A的数据比B多")
			} else {
				return readA, fmt.Errorf("compareReader: 从B读取时出错:%w", errB)
			}
		}

		if !bytes.Equal(p[0:readA], c.bBuf[0:readA]) {
			return readA, errors.New("compareReader: 字节不相等")
		}
	}
	if errA == io.EOF {
		// 在正常情况下,也期望从B获得EOF。可能是多余的调用,因为我们可能已经从上面的循环中得到了它,但在这里检查更容易
		readB, errB := c.b.Read(c.bBuf)
		if readB > 0 {
			return readA, errors.New("compareReader: B的数据比A多")
		}

		if errB != io.EOF {
			return readA, fmt.Errorf("compareReader: 从A得到EOF,但从B没有:%w", errB)
		}
	}

	return readA, errA
}
英文:

Here's an io.Reader I whipped out. You can _, err := io.Copy(ioutil.Discard,
newCompareReader(a, b))
to get an error if two streams don't share equal contents. This implementation is optimized for performance by limiting unnecessary data copying.

package main
import (
"bytes"
"errors"
"fmt"
"io"
)
type compareReader struct {
a    io.Reader
b    io.Reader
bBuf []byte // need buffer for comparing B's data with one that was read from A
}
func newCompareReader(a, b io.Reader) io.Reader {
return &compareReader{
a: a,
b: b,
}
}
func (c *compareReader) Read(p []byte) (int, error) {
if c.bBuf == nil {
// assuming p's len() stays the same, so we can optimize for both of their buffer
// sizes to be equal
c.bBuf = make([]byte, len(p))
}
// read only as much data as we can fit in both p and bBuf
readA, errA := c.a.Read(p[0:min(len(p), len(c.bBuf))])
if readA > 0 {
// bBuf is guaranteed to have at least readA space
if _, errB := io.ReadFull(c.b, c.bBuf[0:readA]); errB != nil { // docs: "EOF only if no bytes were read"
if errB == io.ErrUnexpectedEOF {
return readA, errors.New("compareReader: A had more data than B")
} else {
return readA, fmt.Errorf("compareReader: read error from B: %w", errB)
}
}
if !bytes.Equal(p[0:readA], c.bBuf[0:readA]) {
return readA, errors.New("compareReader: bytes not equal")
}
}
if errA == io.EOF {
// in happy case expecting EOF from B as well. might be extraneous call b/c we might've
// got it already from the for loop above, but it's easier to check here
readB, errB := c.b.Read(c.bBuf)
if readB > 0 {
return readA, errors.New("compareReader: B had more data than A")
}
if errB != io.EOF {
return readA, fmt.Errorf("compareReader: got EOF from A but not from B: %w", errB)
}
}
return readA, errA
}

答案7

得分: 0

"标准的方法是使用os.SameFile对它们进行状态比较。

os.SameFile的功能与Python的filecmp.cmp(f1, f2)相似(即shallow=true),它只比较通过stat获取的文件信息。

func SameFile(fi1, fi2 FileInfo) bool

SameFile报告fi1和fi2是否描述了同一个文件。例如,在Unix上,这意味着两个底层结构的设备和inode字段是相同的;在其他系统上,决策可能基于路径名。SameFile仅适用于此包的Stat返回的结果。在其他情况下,它返回false。

但是,如果你实际上想要比较文件的内容,你将需要自己完成。"

英文:

> The standard way is to stat them and use os.SameFile.
>
> -- https://groups.google.com/g/golang-nuts/c/G-5D6agvz2Q/m/2jV_6j6LBgAJ

os.SameFile should roughly do the same things as Python's filecmp.cmp(f1, f2) (ie. shallow=true, meaning it only compares the file infos obtained by stat).

> func SameFile(fi1, fi2 FileInfo) bool
>
> SameFile reports whether fi1 and fi2 describe the same file. For example, on Unix this means that the device and inode fields of the two underlying structures are identical; on other systems the decision may be based on the path names. SameFile only applies to results returned by this package's Stat. It returns false in other cases.

But if you actually want to compare the file's content, you'll have to do it yourself.

答案8

得分: 0

这段代码逐个比较两个文件的内容,一旦发现两个文件不同就停止比较。它只使用了标准库函数。

这是对这个的改进,通过使用io.ReadFull()解决了mat007christopher提出的短读取问题。它还避免了重新分配缓冲区。

package util

import (
	"bytes"
	"io"
	"os"
)

// 判断两个文件是否具有相同的内容。
// chunkSize 是要扫描的块的大小;传入0以获取一个合理的默认值。
// *跟随*符号链接。
//
// 如果发生其他错误,可能会返回错误;在这种情况下,应忽略'same'的值。
//
// 源自 https://stackoverflow.com/a/30038571
// 根据 CC-BY-SA-4.0 许可证由多位贡献者提供
func FileCmp(file1, file2 string, chunkSize int) (same bool, err error) {

	if chunkSize == 0 {
		chunkSize = 4 * 1024
	}

	// 快捷方式:检查文件元数据
	stat1, err := os.Stat(file1)
	if err != nil {
		return false, err
	}

	stat2, err := os.Stat(file2)
	if err != nil {
		return false, err
	}

	// 输入是否是同一个文件?
	if os.SameFile(stat1, stat2) {
		return true, nil
	}

	// 输入是否具有相同的大小?
	if stat1.Size() != stat2.Size() {
		return false, nil
	}

	// 长路径:比较内容
	f1, err := os.Open(file1)
	if err != nil {
		return false, err
	}
	defer f1.Close()

	f2, err := os.Open(file2)
	if err != nil {
		return false, err
	}
	defer f2.Close()

	b1 := make([]byte, chunkSize)
	b2 := make([]byte, chunkSize)
	for {
		n1, err1 := io.ReadFull(f1, b1)
		n2, err2 := io.ReadFull(f2, b2)

		// https://pkg.go.dev/io#Reader
		// > 调用者应始终在考虑错误 err 之前处理 n > 0 个字节返回。
		// > 这样做可以正确处理在读取一些字节后发生的 I/O 错误,
		// > 以及允许的 EOF 行为。

		if !bytes.Equal(b1[:n1], b2[:n2]) {
			return false, nil
		}

		if (err1 == io.EOF && err2 == io.EOF) || (err1 == io.ErrUnexpectedEOF && err2 == io.ErrUnexpectedEOF) {
			return true, nil
		}

		// 其他错误,如网络连接中断或错误的传输
		if err1 != nil {
			return false, err1
		}
		if err2 != nil {
			return false, err2
		}
	}
}

让我惊讶的是这在标准库中找不到。

英文:

This does a piece-by-piece comparison of the two files, quitting as soon as it knows the two files are different. It only needs standard library functions.

It's an improvement to this that handles the short-read problem raised by mat007 and christopher by using io.ReadFull(). It also avoids reallocating the buffers.

package util

import (
	"bytes"
	"io"
	"os"
)

// Decide if two files have the same contents or not.
// chunkSize is the size of the blocks to scan by; pass 0 to get a sensible default.
// *Follows* symlinks.
//
// May return an error if something else goes wrong; in this case, you should ignore the value of 'same'.
//
// derived from https://stackoverflow.com/a/30038571
// under CC-BY-SA-4.0 by several contributors
func FileCmp(file1, file2 string, chunkSize int) (same bool, err error) {

	if chunkSize == 0 {
		chunkSize = 4 * 1024
	}

	// shortcuts: check file metadata
	stat1, err := os.Stat(file1)
	if err != nil {
		return false, err
	}

	stat2, err := os.Stat(file2)
	if err != nil {
		return false, err
	}

	// are inputs are literally the same file?
	if os.SameFile(stat1, stat2) {
		return true, nil
	}

	// do inputs at least have the same size?
	if stat1.Size() != stat2.Size() {
		return false, nil
	}

	// long way: compare contents
	f1, err := os.Open(file1)
	if err != nil {
		return false, err
	}
	defer f1.Close()

	f2, err := os.Open(file2)
	if err != nil {
		return false, err
	}
	defer f2.Close()

	b1 := make([]byte, chunkSize)
	b2 := make([]byte, chunkSize)
	for {
		n1, err1 := io.ReadFull(f1, b1)
		n2, err2 := io.ReadFull(f2, b2)

		// https://pkg.go.dev/io#Reader
		// > Callers should always process the n > 0 bytes returned
		// > before considering the error err. Doing so correctly
		// > handles I/O errors that happen after reading some bytes
		// > and also both of the allowed EOF behaviors.

		if !bytes.Equal(b1[:n1], b2[:n2]) {
			return false, nil
		}

		if (err1 == io.EOF && err2 == io.EOF) || (err1 == io.ErrUnexpectedEOF && err2 == io.ErrUnexpectedEOF) {
			return true, nil
		}

		// some other error, like a dropped network connection or a bad transfer
		if err1 != nil {
			return false, err1
		}
		if err2 != nil {
			return false, err2
		}
	}
}

It surprised me that this wasn't anywhere in the standard library.

答案9

得分: -1

以下是翻译好的内容:

这样的实现应该能解决问题,并且与其他答案相比应该更节省内存。我查看了github.com/udhos/equalfile,对我来说似乎有点过度。在调用compare()之前,你应该进行两次os.Stat()调用,并比较文件大小以提前退出快速路径。

之所以使用这个实现而不是其他答案,是因为如果不必要,你不希望将两个文件的全部内容都保存在内存中。你可以从A和B中读取一定数量的数据,进行比较,然后继续读取下一部分数据,每次从每个文件中读取一个缓冲区的数据,直到完成。只是你必须小心,因为你可能从A中读取了50个字节,然后从B中读取了60个字节,因为你的读取可能因某种原因被阻塞。

这个实现假设Read()调用不会同时返回N > 0(读取了一些字节)和error != nil。这是os.File的行为方式,但不是其他Read的实现方式,比如net.TCPConn。

import (
  "os"
  "bytes"
  "errors"
)

var errNotSame = errors.New("文件内容不同")

func compare(p1, p2 string) error {
	var (
		buf1 [8192]byte
		buf2 [8192]byte
	)

	fh1, err := os.Open(p1)
	if err != nil {
		return err
	}
	defer fh1.Close()

	fh2, err := os.Open(p2)
	if err != nil {
		return err
	}
	defer fh2.Close()

	for {
		n1, err1 := fh1.Read(buf1[:])
		n2, err2 := fh2.Read(buf2[:])

		if err1 == io.EOF && err2 == io.EOF {
			// 文件相同!
			return nil
		}
		if err1 == io.EOF || err2 == io.EOF {
			return errNotSame
		}
		if err1 != nil {
			return err1
		}
		if err2 != nil {
			return err2
		}

		// n1读取不完整
		for n1 < n2 {
			more, err := fh1.Read(buf1[n1:n2])
			if err == io.EOF {
				return errNotSame
			}
			if err != nil {
				return err
			}
			n1 += more
		}
		// n2读取不完整
		for n2 < n1 {
			more, err := fh2.Read(buf2[n2:n1])
			if err == io.EOF {
				return errNotSame
			}
			if err != nil {
				return err
			}
			n2 += more
		}
		if n1 != n2 {
			// 不应该发生
			return fmt.Errorf("文件比较读取不同步: %d != %d", n1, n2)
		}

		if bytes.Compare(buf1[:n1], buf2[:n2]) != 0 {
			return errNotSame
		}
	}
}
英文:

Something like this should do the trick, and should be memory-efficient compared to the other answers. I looked at github.com/udhos/equalfile and it seemed a bit overkill to me. Before you call compare() here, you should do two os.Stat() calls and compare file sizes for an early out fast path.

The reason to use this implementation over the other answers is because you don't want to hold the entirety of both files in memory if you don't have to. You can read an amount from A and B, compare, and then continue reading the next amount, one buffer-load from each file at a time until you are done. You just have to be careful because you may read 50 bytes from A and then 60 bytes from B because your read may have blocked for some reason.

This implemention assumes a Read() call will not return N > 0 (some bytes read) at the same time as an error != nil. This is how os.File behaves, but not how other implementations of Read may behave, such as net.TCPConn.

import (
  &quot;os&quot;
  &quot;bytes&quot;
  &quot;errors&quot;
)

var errNotSame = errors.New(&quot;File contents are different&quot;)

func compare(p1, p2 string) error {
	var (
		buf1 [8192]byte
		buf2 [8192]byte
	)

	fh1, err := os.Open(p1)
	if err != nil {
		return err
	}
	defer fh1.Close()

	fh2, err := os.Open(p2)
	if err != nil {
		return err
	}
	defer fh2.Close()

	for {
		n1, err1 := fh1.Read(buf1[:])
		n2, err2 := fh2.Read(buf2[:])

		if err1 == io.EOF &amp;&amp; err2 == io.EOF {
			// files are the same!
			return nil
		}
		if err1 == io.EOF || err2 == io.EOF {
			return errNotSame
		}
		if err1 != nil {
			return err1
		}
		if err2 != nil {
			return err2
		}

		// short read on n1
		for n1 &lt; n2 {
			more, err := fh1.Read(buf1[n1:n2])
			if err == io.EOF {
				return errNotSame
			}
			if err != nil {
				return err
			}
			n1 += more
		}
		// short read on n2
		for n2 &lt; n1 {
			more, err := fh2.Read(buf2[n2:n1])
			if err == io.EOF {
				return errNotSame
			}
			if err != nil {
				return err
			}
			n2 += more
		}
		if n1 != n2 {
			// should never happen
			return fmt.Errorf(&quot;file compare reads out of sync: %d != %d&quot;, n1, n2)
		}

		if bytes.Compare(buf1[:n1], buf2[:n2]) != 0 {
			return errNotSame
		}
	}
}

huangapple
  • 本文由 发表于 2015年4月8日 10:52:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/29505089.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定