删除重复内容 Java

huangapple go评论116阅读模式
英文:

Remove repeated content Java

问题

以下是翻译好的部分:

我得到了这段文本,我需要过滤掉这些重复的行和单词。
我不知道是否有比我现在所做的更好的方法。

我使用了以下代码,将这些行放入一个 HashSet 中,以便它们不会重复。

  1. import java.io.BufferedReader;
  2. import java.io.File;
  3. import java.io.FileReader;
  4. import java.io.FileWriter;
  5. import java.util.HashSet;
  6. import java.util.Set;
  7. public class Testecc {
  8. public static void main(String args[]) throws Exception {
  9. String filePath = "C://teste//teste1.txt";
  10. String input = null;
  11. //Buffered reader
  12. BufferedReader br = new BufferedReader(new FileReader(filePath));
  13. while((input=br.readLine()) !=null){
  14. input=br.readLine();
  15. //FileWriter (creating the file)
  16. FileWriter writer = new FileWriter("C://teste//teste.txt");
  17. //hashset to eliminate duplicates
  18. Set set = new HashSet();
  19. String line;
  20. //adding lines to the hashset
  21. while((line=br.readLine())!=null){
  22. String line1= line.substring(0,31);
  23. String line2=line.substring(31);
  24. System.out.println(line);
  25. if(set.add(line2)){
  26. writer.append(line1+line2+"\n");
  27. }
  28. }
  29. writer.flush();
  30. System.out.println("Done!");
  31. }
  32. }
  33. }

使用这个代码,我移除了重复的行,如下所示:

(重复行的示例)

但我还需要移除重复的单词。
我真的没有主意了。
我该如何做到这一点?

英文:

I got this text, and I need to filter out these repeated lines and words.
I don't know if there's a better way than what I'm doing.

  1. 00:00:00,413|03:50:25,600|ISDB|>> FALAM QUE A GENTE COMBINA
  2. 00:00:00,413|03:50:25,600|ISDB|PERFEITAMENTE. EU
  3. 00:00:01,135|00:00:01,315|ISDB|>> FALAM QUE A GENTE COMBINA
  4. 00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
  5. 00:00:01,315|00:00:02,218|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
  6. 00:00:01,315|00:00:02,218|ISDB|BOBAS PARA
  7. 00:00:02,218|00:00:02,398|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
  8. 00:00:02,218|00:00:02,398|ISDB|BOBAS PARA AMIGOS
  9. 00:00:02,398|00:00:02,759|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
  10. 00:00:02,398|00:00:02,759|ISDB|BOBAS PARA AMIGOS E AO
  11. 00:00:02,759|00:00:03,274|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
  12. 00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV?
  13. 00:00:03,274|00:00:04,357|ISDB|BOBAS PARA AMIGOS E AO INV?
  14. 00:00:03,274|00:00:04,357|ISDB|DISSO TROUXERAM ISSO A?
  15. 00:00:04,357|00:00:05,259|ISDB|BOBAS PARA AMIGOS E AO INV?
  16. 00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A? ELES
  17. 00:00:05,259|00:00:05,414|ISDB|DISSO TROUXERAM ISSO A? ELES
  18. 00:00:05,414|00:00:05,775|ISDB|DISSO TROUXERAM ISSO A? ELES
  19. 00:00:05,414|00:00:05,775|ISDB|COLOCARAM AS FOTOS
  20. 00:00:05,775|00:00:06,677|ISDB|DISSO TROUXERAM ISSO A? ELES
  21. 00:00:05,775|00:00:06,677|ISDB|COLOCARAM AS FOTOS COMO
  22. 00:00:06,677|00:00:06,858|ISDB|DISSO TROUXERAM ISSO A? ELES
  23. 00:00:06,677|00:00:06,858|ISDB|COLOCARAM AS FOTOS COMO PAPEL
  24. 00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
  25. 00:00:06,858|03:50:32,400|ISDB|PAREDE, PARECE AT?QUE
  26. 00:00:07,914|00:00:07,916|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
  27. 00:00:07,914|00:00:07,916|ISDB|PAREDE, PARECE AT?QUE EU
  28. 00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT?QUE EU GOSTO
  29. 00:00:08,997|00:00:09,178|ISDB|PAREDE, PARECE AT?QUE EU GOSTO

And I'm using that code, to put these lines in a HashSet so they don't be repeated.

  1. import java.io.BufferedReader;
  2. import java.io.File;
  3. import java.io.FileReader;
  4. import java.io.FileWriter;
  5. import java.util.HashSet;
  6. import java.util.Scanner;
  7. import java.util.Set;
  8. public class Testecc {
  9. public static void main(String args[]) throws Exception {
  10. String filePath = "C://teste//teste1.txt";
  11. String input = null;
  12. //Buffered reader
  13. BufferedReader br = new BufferedReader(new FileReader(filePath));
  14. while((input=br.readLine()) !=null){
  15. input=br.readLine();
  16. //FileWriter (criando arquivo)
  17. FileWriter writer = new FileWriter("C://teste//teste.txt");
  18. //hashset para elimitar duplicatas
  19. Set set = new HashSet();
  20. String line;
  21. //adicionando linhas no hashset
  22. while((line=br.readLine())!=null){
  23. String line1= line.substring(0,31);
  24. String line2=line.substring(31);
  25. System.out.println(line);
  26. if(set.add(line2)){
  27. writer.append(line1+line2+"\n");
  28. }
  29. }
  30. writer.flush();
  31. System.out.println("Pronto!");
  32. }
  33. }
  34. }

With this I removed the duplicated lines like this:

  1. 00:00:01,135|00:00:01,315|ISDB|>> FALAM QUE A GENTE COMBINA
  2. 00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
  3. 00:00:01,315|00:00:02,218|ISDB|BOBAS PARA
  4. 00:00:02,218|00:00:02,398|ISDB|BOBAS PARA AMIGOS
  5. 00:00:02,398|00:00:02,759|ISDB|BOBAS PARA AMIGOS E AO
  6. 00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INVS
  7. 00:00:03,274|00:00:04,357|ISDB|DISSO TROUXERAM ISSO A�.
  8. 00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A�. ELES
  9. 00:00:05,414|00:00:05,775|ISDB|COLOCARAM AS FOTOS
  10. 00:00:05,775|00:00:06,677|ISDB|COLOCARAM AS FOTOS COMO
  11. 00:00:06,677|00:00:06,858|ISDB|COLOCARAM AS FOTOS COMO PAPEL
  12. 00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
  13. 00:00:06,858|03:50:32,400|ISDB|PAREDE, PARECE AT QUE
  14. 00:00:07,914|00:00:07,916|ISDB|PAREDE, PARECE AT QUE EU
  15. 00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT QUE EU GOSTO

But I also need to remove the repeated words.

I'm really out of ideas.

How can I do that?

答案1

得分: 1

  1. 有一个地图其中将按某个键分组的行值保存在一起键将是行的开头从您感兴趣的单词开始比如前5个字母然后将这些行添加到地图中如果行比先前找到的行长则替换它
  2. try (BufferedReader br = new BufferedReader(new FileReader(filepath))) {
  3. final Map<String, String> map = new LinkedHashMap<>();
  4. br.lines().forEach(line -> {
  5. String message = line.substring(line.lastIndexOf("|") + 1);
  6. if (message.isEmpty()) {
  7. return;
  8. }
  9. String key = message.split(" ")[0];
  10. if (map.get(key) == null) {
  11. map.put(key, line);
  12. } else if (map.get(key).length() < line.length()) {
  13. map.remove(key);
  14. map.put(key, line);
  15. }
  16. }
  17. );
  18. map.forEach((k, v) -> System.out.println(v));
  19. }

上述代码将为您提供以下输出。

  1. 00:00:00,413|03:50:25,600|ISDB|>> FALAM QUE A GENTE COMBINA
  2. 00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
  3. 00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV?
  4. 00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A? ELES
  5. 00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
  6. 00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT?QUE EU GOSTO
英文:

Have a map which would hold line values grouped by a certain key. A key would a beginning of the line, starting from the words you are interested in, say, first 5 letters. Then add those lines to the map, and if the line is longer than the one found previously, replace it.

  1. try (BufferedReader br = new BufferedReader(new FileReader(filepath))) {
  2. final Map&lt;String, String&gt; map = new LinkedHashMap&lt;&gt;();
  3. br.lines().forEach(line -&gt; {
  4. String message = line.substring(line.lastIndexOf(&quot;|&quot;) + 1);
  5. if (message.isEmpty()) {
  6. return;
  7. }
  8. String key = message.split(&quot; &quot;)[0];
  9. if (map.get(key) == null) {
  10. map.put(key, line);
  11. } else if (map.get(key).length() &lt; line.length()) {
  12. map.remove(key);
  13. map.put(key, line);
  14. }
  15. }
  16. );
  17. map.forEach((k, v) -&gt; System.out.println(v));
  18. }

The above code will give you the following output.

  1. 00:00:00,413|03:50:25,600|ISDB|&gt;&gt; FALAM QUE A GENTE COMBINA
  2. 00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
  3. 00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV?
  4. 00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A? ELES
  5. 00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
  6. 00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT?QUE EU GOSTO

答案2

得分: 0

你可以使用每行日志的最终管道后部分作为键,然后将每行插入到LinkedHashMap中,以删除重复项:

  1. String filePath = "C:/log.txt";
  2. BufferedReader br = new BufferedReader(new FileReader(filePath));
  3. String input;
  4. Map<String, String> logMap = new LinkedHashMap<>();
  5. while ((input = br.readLine()) != null) {
  6. String key = input.replaceAll("^.*\\|", "");
  7. logMap.put(key, input);
  8. }
  9. // 现在打印出不包含重复项的映射
  10. for (String line : logMap.values()) {
  11. System.out.println(line);
  12. }

而不是打印到控制台,你同样可以将经过过滤的日志写入另一个文件中。请注意,这种方法会保留每个重复项中最后出现的行。

英文:

You could use the final post-pipe portion of each log line as a key, then insert each line into a LinkedHashMap, to remove duplicates:

  1. String filePath = &quot;C:/log.txt&quot;;
  2. BufferedReader br = new BufferedReader(new FileReader(filePath));
  3. String input;
  4. Map&lt;String, String&gt; logMap = new LinkedHashMap&lt;&gt;();
  5. while ((input = br.readLine()) != null) {
  6. input = br.readLine();
  7. String key = input.replaceAll(&quot;^.*\\|&quot;, &quot;&quot;);
  8. logMap.put(key, input);
  9. }
  10. // Now print out the map minus duplicates
  11. for (String line : logMap.values()) {
  12. System.out.println(line);
  13. }

Instead of printing to the console, you could just as easily write the filtered log out to another file. Note that this approach would retain the last occurring line of each duplicate.

huangapple
  • 本文由 发表于 2020年3月15日 20:11:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/60692709.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定