Which is the best way to Compare two documents in Java without any complexity and precise result

huangapple go评论71阅读模式
英文:

Which is the best way to Compare two documents in Java without any complexity and precise result

问题

我有两个 Word 文档,我正在尝试在 Java 中进行比较。
我尝试使用

md5 哈希码

HashCode newFile = Files.asByteSource(newFileInput).hash(Hashing.md5());
HashCode oldFile = Files.asByteSource(oldFileInput).hash(Hashing.md5());

还尝试了以下方式,

boolean isEqual = FileUtils.contentEquals(oldFile , newFile);

尽管内容相同,使用在线工具和 Beyond Compare 进行了内容比较,
但上述两种方法的哈希码仍然显示为 MISMATCH(不匹配)。

是否有解决方法?或者有没有办法使用 Java 中的任何 API 来比较任何文件类型。
我需要对两个 Word 文件进行深度比较,包括空格、字体、内容等。

预期结果:两个文件应该匹配。

英文:

I have two word documents which i am trying to compare in java .
I tried using

md5 hashcode

HashCode newFile = Files.asByteSource(newFileInput).hash(Hashing.md5());
HashCode oldFile = Files.asByteSource(oldFileInput).hash(Hashing.md5());

and also using,

boolean isEqual = FileUtils.contentEquals(oldFile , newFile);

Even though the contents are same ,compared the content using online tools and beyond compare,
still the hashcode in both above method comes as MISMATCH.

any solutions? or way to compare any file type using any API in Java.
i need to do deep compare between two word files as in for spaces,fonts , content. etc..

Expected Result : Both file should match

答案1

得分: 2

即使您的两个文档看起来相同,甚至包含相同的格式内容,最后修改日期等细微更改都会导致比较失败。JSON 文档更容易进行比较,但 Word 文档是二进制的。微小的更改可以完全改变文档。

因此,您必须采用困难的方式:自己找一个库来读取 Word 文件的内容,并特别检查两个文件的内容。

英文:

Even if both of your documents look the same or even if both contains the same formatted content, a slightly change like the last modified date will result in a failed comparison. JSON documents are more easier to compare but Word documents are binary. The smallest change can change the document completely.

So you have to do it the hard way: Find a library to read the content of the Word files by yourself and check the content of both files specifically.

答案2

得分: 0

有多个建议和答案针对我的问题,感谢这些。导致 docx 文件不匹配的原因在元数据信息中,每次创建 doc/docx 文件时,时间戳都会更改。尽管我尝试过更改两个文件的时间戳(访问、修改和创建时间)使其相同以进行比较,但并没有成功。原因是除了这些时间戳之外,还有一个叫做“Zip 修改日期”的元信息,在查看文件属性时无法看到。我发现这个时间戳是哈希码不匹配的原因之一。另外,由于 zip 时间戳不同,base64 编码的字符串也不同。

因此,我可以进行比较的选项是:

  1. 将 docx 文件转换为 xml 文件。
  2. 将 docx 文件压缩为 zip 格式,解压缩并迭代遍历所有 xml 文件,以找到哈希码并进行比较(根据答案中的建议)。

“2” 是一个不错的选择,但需要大量迭代,并且解压缩会创建很多文件夹。

“1” 则相对直接,因为我尝试过使用外部库(例如 docx4j)将 docx 转换为 xml,然后我可以匹配哈希码,这种方法有效。

将 DOCX 文件转换为 XML 文件

我不得不尝试不同的选项,因为我正在寻找一种简单而不太复杂的方法来比较 word 文档的内容和样式。

英文:

There were multiple suggestions and answers to my question, thanks for that. The reasons for mismatch in docx file is there in the metadata info, everytime we create a doc/docx file, the timestamp changes. Though i tried to change the timestamp(accessed,modified and created) of both the files to make it same and compare, which didn't work out. The reason is apart from these time stamps, there is a meta info called Zip Modify Date, which isn't visible when we see the file properties. this timestamp i found as one of the reason there was mismatch in hashcode. Also, the base64 encoded strings was different because of the zip timestamp.

So, the options i had to do the comparison were :

1. converting the docx file to xml file
2. Zip the docx file, unzip it and iterate though all the xml files to find  the hashcode and compare the hascodes.(suggeted as of the answers)

"2" was good but it required lot of iterations and unzipping would create lot many folders.

"1" , was straight fwd, as i tried it using external lib -> docx4j , which converted the docx to xml and then i could match the hashcode , it worked.

Convert DOCX to XML file

I had to try different options since i was looking for simplest and not so complex way to compare content and styles of the word document.

答案3

得分: -1

阅读字符串中的文件,然后使用字符串比较,您可以使用StringBuilder或tostring方法将文件转换为字符串。

InputStream is = new FileInputStream("manifest.mf");
BufferedReader buf = new BufferedReader(new InputStreamReader(is));
        
String line = buf.readLine();
StringBuilder sb = new StringBuilder();
            
while(line != null){
    sb.append(line).append("\n");
    line = buf.readLine();
}

String fileAsString = sb.toString();
System.out.println("Contents : " + fileAsString);
     
String one = "fdgfhdgkifgh";
String two = "fdgfhdgkifgh";
    
if(one.equals(two) || one == two ){
    //something
}else{
    //not compared
}
英文:

Read the files in string and then compare the strings using, you can convert the files in string using stringbuilder or tostring method.
InputStream is = new FileInputStream("manifest.mf");
BufferedReader buf = new BufferedReader(new InputStreamReader(is));

String line = buf.readLine();
StringBuilder sb = new StringBuilder();
        
while(line != null){
   sb.append(line).append("\n");
   line = buf.readLine();
}

String fileAsString = sb.toString();
System.out.println("Contents : " + fileAsString);

string one = "fdgfhdgkifgh";
string two = "fdgfhdgkifgh";

if(one.Equals(two) || one == two ){
//something
}else{
//not compared
}

答案4

得分: -3

在Java中,您可以使用string1.equals(string2)来比较两个字符串。

因此在您的情况下,您需要使用newFile.equals(oldFile)这行代码。

英文:

In Java, you can compare two Strings with string1.equals(string2).

So in your case you need the line newFile.equals(oldFile)

huangapple
  • 本文由 发表于 2020年10月16日 13:21:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/64383310.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定