删除重复内容 Java

huangapple go评论74阅读模式
英文:

Remove repeated content Java

问题

以下是翻译好的部分:

我得到了这段文本,我需要过滤掉这些重复的行和单词。
我不知道是否有比我现在所做的更好的方法。

我使用了以下代码,将这些行放入一个 HashSet 中,以便它们不会重复。

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.util.HashSet;
import java.util.Set;
public class Testecc {
   public static void main(String args[]) throws Exception {
      String filePath = "C://teste//teste1.txt";
      String input = null;
      //Buffered reader
      BufferedReader br = new BufferedReader(new FileReader(filePath));
      while((input=br.readLine()) !=null){
                input=br.readLine();
      //FileWriter (creating the file)
      FileWriter writer = new FileWriter("C://teste//teste.txt");
      //hashset to eliminate duplicates
      Set set = new HashSet();
      String line;
      //adding lines to the hashset
      while((line=br.readLine())!=null){
          String line1= line.substring(0,31);
          String line2=line.substring(31);
          System.out.println(line);
          if(set.add(line2)){
          writer.append(line1+line2+"\n");
              }
          }
      writer.flush();
      System.out.println("Done!");
   }
}
}

使用这个代码,我移除了重复的行,如下所示:

(重复行的示例)

但我还需要移除重复的单词。
我真的没有主意了。
我该如何做到这一点?

英文:

I got this text, and I need to filter out these repeated lines and words.
I don't know if there's a better way than what I'm doing.

00:00:00,413|03:50:25,600|ISDB|>> FALAM QUE A GENTE COMBINA
00:00:00,413|03:50:25,600|ISDB|PERFEITAMENTE. EU
00:00:01,135|00:00:01,315|ISDB|>> FALAM QUE A GENTE COMBINA
00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:01,315|00:00:02,218|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:01,315|00:00:02,218|ISDB|BOBAS PARA
00:00:02,218|00:00:02,398|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:02,218|00:00:02,398|ISDB|BOBAS PARA AMIGOS
00:00:02,398|00:00:02,759|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:02,398|00:00:02,759|ISDB|BOBAS PARA AMIGOS E AO
00:00:02,759|00:00:03,274|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV?
00:00:03,274|00:00:04,357|ISDB|BOBAS PARA AMIGOS E AO INV?
00:00:03,274|00:00:04,357|ISDB|DISSO TROUXERAM ISSO A?
00:00:04,357|00:00:05,259|ISDB|BOBAS PARA AMIGOS E AO INV?
00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:05,259|00:00:05,414|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:05,414|00:00:05,775|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:05,414|00:00:05,775|ISDB|COLOCARAM AS FOTOS
00:00:05,775|00:00:06,677|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:05,775|00:00:06,677|ISDB|COLOCARAM AS FOTOS COMO
00:00:06,677|00:00:06,858|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:06,677|00:00:06,858|ISDB|COLOCARAM AS FOTOS COMO PAPEL
00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
00:00:06,858|03:50:32,400|ISDB|PAREDE, PARECE AT?QUE
00:00:07,914|00:00:07,916|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
00:00:07,914|00:00:07,916|ISDB|PAREDE, PARECE AT?QUE EU
00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT?QUE EU GOSTO
00:00:08,997|00:00:09,178|ISDB|PAREDE, PARECE AT?QUE EU GOSTO

And I'm using that code, to put these lines in a HashSet so they don't be repeated.

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.util.HashSet;
import java.util.Scanner;
import java.util.Set;
public class Testecc {
   public static void main(String args[]) throws Exception {
      String filePath = "C://teste//teste1.txt";
      String input = null;
      //Buffered reader
      BufferedReader br = new BufferedReader(new FileReader(filePath));
      while((input=br.readLine()) !=null){
                input=br.readLine();
                
      //FileWriter (criando arquivo)
      FileWriter writer = new FileWriter("C://teste//teste.txt");
      //hashset para elimitar duplicatas
      Set set = new HashSet();
      String line;
      //adicionando linhas no hashset
      while((line=br.readLine())!=null){
          String line1= line.substring(0,31);
          String line2=line.substring(31);
          System.out.println(line);
          if(set.add(line2)){
          
      writer.append(line1+line2+"\n");
          }
      }
      writer.flush();
      System.out.println("Pronto!");
   }
}
   }

With this I removed the duplicated lines like this:

00:00:01,135|00:00:01,315|ISDB|>> FALAM QUE A GENTE COMBINA
00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:01,315|00:00:02,218|ISDB|BOBAS PARA
00:00:02,218|00:00:02,398|ISDB|BOBAS PARA AMIGOS
00:00:02,398|00:00:02,759|ISDB|BOBAS PARA AMIGOS E AO
00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV�S
00:00:03,274|00:00:04,357|ISDB|DISSO TROUXERAM ISSO A�.
00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A�. ELES
00:00:05,414|00:00:05,775|ISDB|COLOCARAM AS FOTOS
00:00:05,775|00:00:06,677|ISDB|COLOCARAM AS FOTOS COMO
00:00:06,677|00:00:06,858|ISDB|COLOCARAM AS FOTOS COMO PAPEL
00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
00:00:06,858|03:50:32,400|ISDB|PAREDE, PARECE AT� QUE
00:00:07,914|00:00:07,916|ISDB|PAREDE, PARECE AT� QUE EU
00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT� QUE EU GOSTO

But I also need to remove the repeated words.

I'm really out of ideas.

How can I do that?

答案1

得分: 1

有一个地图其中将按某个键分组的行值保存在一起键将是行的开头从您感兴趣的单词开始比如前5个字母然后将这些行添加到地图中如果行比先前找到的行长则替换它

try (BufferedReader br = new BufferedReader(new FileReader(filepath))) {

  final Map<String, String> map = new LinkedHashMap<>();

  br.lines().forEach(line -> {
        String message = line.substring(line.lastIndexOf("|") + 1);
        if (message.isEmpty()) {
          return;
        }
        String key = message.split(" ")[0];
        if (map.get(key) == null) {
          map.put(key, line);
        } else if (map.get(key).length() < line.length()) {
          map.remove(key);
          map.put(key, line);
        }
      }
  );

  map.forEach((k, v) -> System.out.println(v));
}

上述代码将为您提供以下输出。

00:00:00,413|03:50:25,600|ISDB|>> FALAM QUE A GENTE COMBINA
00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV?
00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT?QUE EU GOSTO
英文:

Have a map which would hold line values grouped by a certain key. A key would a beginning of the line, starting from the words you are interested in, say, first 5 letters. Then add those lines to the map, and if the line is longer than the one found previously, replace it.

try (BufferedReader br = new BufferedReader(new FileReader(filepath))) {

  final Map&lt;String, String&gt; map = new LinkedHashMap&lt;&gt;();

  br.lines().forEach(line -&gt; {
        String message = line.substring(line.lastIndexOf(&quot;|&quot;) + 1);
        if (message.isEmpty()) {
          return;
        }
        String key = message.split(&quot; &quot;)[0];
        if (map.get(key) == null) {
          map.put(key, line);
        } else if (map.get(key).length() &lt; line.length()) {
          map.remove(key);
          map.put(key, line);
        }
      }
  );

  map.forEach((k, v) -&gt; System.out.println(v));
}

The above code will give you the following output.

00:00:00,413|03:50:25,600|ISDB|&gt;&gt; FALAM QUE A GENTE COMBINA
00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV?
00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT?QUE EU GOSTO

答案2

得分: 0

你可以使用每行日志的最终管道后部分作为键,然后将每行插入到LinkedHashMap中,以删除重复项:

String filePath = "C:/log.txt";
BufferedReader br = new BufferedReader(new FileReader(filePath));
String input;
Map<String, String> logMap = new LinkedHashMap<>();
while ((input = br.readLine()) != null) {
    String key = input.replaceAll("^.*\\|", "");
    logMap.put(key, input);
}

// 现在打印出不包含重复项的映射
for (String line : logMap.values()) {
    System.out.println(line);
}

而不是打印到控制台,你同样可以将经过过滤的日志写入另一个文件中。请注意,这种方法会保留每个重复项中最后出现的行。

英文:

You could use the final post-pipe portion of each log line as a key, then insert each line into a LinkedHashMap, to remove duplicates:

String filePath = &quot;C:/log.txt&quot;;
BufferedReader br = new BufferedReader(new FileReader(filePath));
String input;
Map&lt;String, String&gt; logMap = new LinkedHashMap&lt;&gt;();
while ((input = br.readLine()) != null) {
    input = br.readLine();
    String key = input.replaceAll(&quot;^.*\\|&quot;, &quot;&quot;);
    logMap.put(key, input);
}

// Now print out the map minus duplicates
for (String line : logMap.values()) {
    System.out.println(line);
}

Instead of printing to the console, you could just as easily write the filtered log out to another file. Note that this approach would retain the last occurring line of each duplicate.

huangapple
  • 本文由 发表于 2020年3月15日 20:11:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/60692709.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定