有没有一种方法可以使用Java检查文件中的重复行?

huangapple go评论67阅读模式
英文:

Is there a way to check for duplicate lines within a file using Java?

问题

我试图读取.inp文件中的每一行,对于每一行非重复的内容,将其写入新文件。目前我所拥有的代码存在的问题是,无论是否是前面的行的重复,所有行都被写入输出文件。我使用了一个Scanner对象来读取文件,以及一个BufferedReader/FileWriter对象来写入输出文件。

如何避免写入重复的内容?

String book = reader.nextLine();
boolean duplicate = false;

while (reader.hasNext() == true) {
    try {
        duplicate = reader.hasNext(book);

        if (duplicate == true) {
            book = reader.nextLine();
        } else {
            writer.write(book + "\n");
            book = reader.nextLine();
        }
    } catch (NoSuchElementException ex) {
        break;
    }
}
英文:

I'm attempting to read each line within an .inp file and for every non-duplicate, write the line to a new file. The issue I'm running into with the code I have so far is that all lines are written into the output file, regardless of if they're duplicates of previous line(s) or not. I'm using a Scanner object to read the file and a BufferedReader/FileWriter object to write the output file.

How do I avoid writing the duplicates?

String book = reader.nextLine();
boolean duplicate = false;

while (reader.hasNext() == true) {
    try {
        duplicate = reader.hasNext(book);

        if (duplicate == true) {
            book = reader.nextLine();
        } else {
            writer.write(book + "\n");
            book = reader.nextLine();
        }
    } catch (NoSuchElementException ex) {
        break;
    }
}

答案1

得分: 1

根据情况:

  • 如果重复的行是连续的,保持一个变量来存储前一行,并与之进行比较。
  • 如果重复的行不是连续的,并且相对较少的短行,将已处理过的行存储在HashSet中,在处理一行时,检查集合是否已经contains()该行。
  • 如果重复的行不是连续的,并且有相对较少但较长的行,不要将完整的行存储在HashSet中,而是存储每行的哈希值(例如SHA1),然后进行比较。
  • 如果重复的行不是连续的,并且有很多长的行,将上述技术与某种形式的持久性数据库或数据存储相结合。

<sup><sup>(*)</sup> 相对于可用内存</sup>

英文:

Depending on the situation:

  • If the duplicate lines are sequential, maintain a variable to store the previous line and compare against it.
  • If the duplicate lines are not sequential, and there are relatively <sup>(*)</sup> few short lines, store the lines you've already processed in a HashSet and upon processing a line check whether the set already contains() the line or not.
  • If the duplicate lines are not sequential, and there are relatively <sup>(*)</sup> few but long lines, instead of storing the complete lines in a HashSet, store a hash (e.g. SHA1) of each line, and compare against that.
  • If the duplicate lines are not sequential, and there are a lot of long lines, combine the techniques described above with some form of persistent database or data store.

<sup><sup>(*)</sup> Relative to available memory</sup>

huangapple
  • 本文由 发表于 2020年9月8日 11:41:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/63786804.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定