使用MD5哈希进行docx文件比较时出现问题。

huangapple go评论88阅读模式
英文:

Issue with docx file comparison using md5 hash

问题

public boolean compareFiles(File newFileInput, File oldFileInput) throws IOException {
    HashCode newFile = Files.asByteSource(newFileInput).hash(Hashing.md5());
    HashCode oldFile = Files.asByteSource(oldFileInput).hash(Hashing.md5());
    System.out.println("HashCode New File : " + newFile + "\nHashCode Old File : " + oldFile);
    if (newFile.equals(oldFile)) {
        return true;
    } else {
        return false;
    }
}

我已使用上述代码来获取两个不同 docx 文件的哈希码,以便比较它们的文件内容、样式等。尽管内容和样式相同,哈希码却不同。

是否有方法可以比较 docx 文件的内容和样式?

英文:
public boolean compareFiles(File newFileInput,File oldFileInput) throws IOException {
    HashCode newFile = Files.asByteSource(newFileInput).hash(Hashing.md5());
    HashCode oldFile = Files.asByteSource(oldFileInput).hash(Hashing.md5());
    System.out.println("HashCode New File : "+newFile +"\nHashCode Old File : "+oldFile);
    if(newFile.equals(oldFile))
    {
      return true;
    }
    else
    {
       return false;
    }
}

i have used above code to find the hashcode of two differnet docx file in order to compare them for file content,style etc.
Despite of same content and style, the hashcode comes different.

any way for comparing docx files for content and style ?

答案1

得分: 3

确定两种复杂文件类型是否相似是非常棘手的,甚至可能是不可能的。DOCX 文件包含的远不止一些文本,包括是否为粗体等等。

有方法可以使文档的外观完全相同,但具有不同的属性,还有许多元数据被保存在其中(例如作者)。这不仅仅是一个技术问题,更多是一个哲学问题。让我给你举个例子:

你需要比较两辆汽车,并判断它们是否相同。在一些明显的情况下,它们在客观上是不同的,比如一辆重型卡车和一辆小型城市电动车。但如果它们是相同类型,但颜色不同呢?或者是相同类型,相同颜色,但油箱里的燃料量不同呢?

对于 DOCX 文件也是一样的。相同的文本,但不同的颜色?相同的内容,但不同的作者?相同的……但不同的……?

也许你可以透露一些关于你想实现什么的更多信息,否则我怀疑我们无法提供更多帮助。

如果你确实需要以某种方式比较两个 DOCX 文件(或任何其他类似复杂性的文件类型),找一个可以解析它们的库,并自己构建逻辑。然而,即使花几年时间这样做,可能也难以得出令人满意的结果。

如果你更倾向于一些不太正统的解决方案,可以使用一个库来构建文档页面的图像,并将它们作为图像进行比较。这将确保页面看起来相同。然而,基于你对相似性的定义,这并不意味着它们是相同的文档。

如果可以选择另一种文件格式,那可能是个好主意。然而,仍然会有一些棘手的部分。甚至 Markdown(我们用来格式化这个网站上的问答的语言)也不能逐字节比较。

This
**weird**
post

将会渲染为

This **weird** post

输出为

This weird post。

英文:

Determining, whether two complex file types are equalish is always very tricky, if not impossible. DOCX contains much more than just some text and whether it is bold or not.

There are ways to make the document look exactly the same with different properties and there are also lots of metadata saved in (the author for example). It is then not just a technical problem, it is more about a philosophical problem. Let me give you an example:

You are expected to compare two cars and say, if they are the same or not. There are some obvious cases when they are objectively different, like a heavy lorry and a small city EV. But what if they are of the same type, but of a different color? Or same type, same color, but different amount of fuel in the tank?

The same goes for DOCX. Same text, but different colors? Same content, but different authors? Same … but different …?

Maybe you can disclose some more information about what you are trying to achieve, otherwise I doubt we can help you more.

If you really need to somehow compare two DOCX files (or any other types of similar complexity), find a library that can parse them and build the logic by yourself. However one might spend years doing so without a satisfactory result.

If you are more into dirty hacky solutions, use a library to build an image of the pages of the document and compare them as images. This will ensure the pages look the same. However, based on your definition of equality, that doesn't have to mean they are the same documents.

If you can choose another file format, it might be a good idea to do so. However, there will still be some tricky parts. Even Markdown (the language we use to format Q&A on this site) cannot be compared byte to byte.

This
**weird**
post

will render the same as

This **weird** post

into

This weird post.

答案2

得分: 0

有多个关于我的问题的建议和答案,谢谢。

造成 docx 文件不匹配的原因在于元数据信息中,每次我们创建一个 doc/docx 文件时,时间戳都会发生变化。尽管我尝试过更改两个文件的时间戳(访问时间、修改时间和创建时间)使其相同以进行比较,但并没有成功。原因是除了这些时间戳之外,还有一个称为 "Zip Modify Date" 的元信息,在查看文件属性时是不可见的。我发现这个时间戳是导致哈希码不匹配的原因之一。另外,由于 zip 时间戳的差异,base64 编码的字符串也会不同。

所以,我可以进行比较的选项有:

  1. 将 docx 文件转换为 xml 文件。
  2. 将 docx 文件进行压缩,解压缩后遍历所有 xml 文件,查找哈希码并进行比较(根据答案的建议)。

"2" 是一个不错的选择,但需要大量迭代,而且解压缩会创建很多文件夹。

"1" 则相对直接,我尝试过使用外部库 docx4j 将 docx 转换为 xml,然后可以进行哈希码匹配,这种方法有效。

我尝试了不同的选项,因为我在寻找一种简单而不是太复杂的比较 Word 文档内容和样式的方法。

链接:https://stackoverflow.com/questions/20364845/convert-docx-to-xml-file

英文:

There were multiple suggestions and answers to my question, thanks for that.
The reasons for mismatch in docx file is there in the metadata info, everytime we create a doc/docx file, the timestamp changes. Though i tried to change the timestamp(accessed,modified and created) of both the files to make it same and compare, which didn't work out. The reason is apart from these time stamps, there is a meta info called Zip Modify Date, which isn't visible when we see the file properties. this timestamp i found as one of the reason there was mismatch in hashcode. Also, the base64 encoded strings was different because of the zip timestamp.

So, the options i had to do the comparison were :

  1. converting the docx file to xml file
  2. Zip the docx file, unzip it and iterate though all the xml files to find the
    hashcode and compare the hascodes.(suggeted as of the answers)

"2" was good but it required lot of iterations and unzipping would create lot many folders.

"1" , was straight fwd, as i tried it using external lib -> docx4j , which converted the docx to xml and then i could match the hashcode , it worked.

https://stackoverflow.com/questions/20364845/convert-docx-to-xml-file

I had to try different options since i was looking for simplest and not so complex way to compare content and styles of the word document.

huangapple
  • 本文由 发表于 2020年10月20日 15:10:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/64440179.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定