英文:
Is there a way to check for duplicate lines within a file using Java?
问题
我试图读取.inp文件中的每一行,对于每一行非重复的内容,将其写入新文件。目前我所拥有的代码存在的问题是,无论是否是前面的行的重复,所有行都被写入输出文件。我使用了一个Scanner对象来读取文件,以及一个BufferedReader/FileWriter对象来写入输出文件。
如何避免写入重复的内容?
String book = reader.nextLine();
boolean duplicate = false;
while (reader.hasNext() == true) {
try {
duplicate = reader.hasNext(book);
if (duplicate == true) {
book = reader.nextLine();
} else {
writer.write(book + "\n");
book = reader.nextLine();
}
} catch (NoSuchElementException ex) {
break;
}
}
英文:
I'm attempting to read each line within an .inp file and for every non-duplicate, write the line to a new file. The issue I'm running into with the code I have so far is that all lines are written into the output file, regardless of if they're duplicates of previous line(s) or not. I'm using a Scanner object to read the file and a BufferedReader/FileWriter object to write the output file.
How do I avoid writing the duplicates?
String book = reader.nextLine();
boolean duplicate = false;
while (reader.hasNext() == true) {
try {
duplicate = reader.hasNext(book);
if (duplicate == true) {
book = reader.nextLine();
} else {
writer.write(book + "\n");
book = reader.nextLine();
}
} catch (NoSuchElementException ex) {
break;
}
}
答案1
得分: 1
根据情况:
- 如果重复的行是连续的,保持一个变量来存储前一行,并与之进行比较。
- 如果重复的行不是连续的,并且相对较少的短行,将已处理过的行存储在
HashSet
中,在处理一行时,检查集合是否已经contains()
该行。 - 如果重复的行不是连续的,并且有相对较少但较长的行,不要将完整的行存储在
HashSet
中,而是存储每行的哈希值(例如SHA1),然后进行比较。 - 如果重复的行不是连续的,并且有很多长的行,将上述技术与某种形式的持久性数据库或数据存储相结合。
<sup><sup>(*)</sup> 相对于可用内存</sup>
英文:
Depending on the situation:
- If the duplicate lines are sequential, maintain a variable to store the previous line and compare against it.
- If the duplicate lines are not sequential, and there are relatively <sup>(*)</sup> few short lines, store the lines you've already processed in a
HashSet
and upon processing a line check whether the set alreadycontains()
the line or not. - If the duplicate lines are not sequential, and there are relatively <sup>(*)</sup> few but long lines, instead of storing the complete lines in a
HashSet
, store a hash (e.g. SHA1) of each line, and compare against that. - If the duplicate lines are not sequential, and there are a lot of long lines, combine the techniques described above with some form of persistent database or data store.
<sup><sup>(*)</sup> Relative to available memory</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论