统计查询中每个字符串的频率。

huangapple go评论75阅读模式
英文:

Count frequency of a string individually from query

问题

我想从名为a.java的文件中搜索一个查询。如果我的查询是"String name",我想从文本文件中逐个获取字符串的频率。首先我必须计算"String"和"name"各自的频率,然后将两者的频率相加。我该如何在Java平台上实现这个程序?

public class Tf2 {
    Integer k;
    int totalword = 0;
    int totalfile, containwordfile = 0;
    Map<String, Integer> documentToCount = new HashMap<>();
    File file = new File("H:/java");
    File[] files = file.listFiles();
    public void Count(String word) {
       File[] files = file.listFiles();
        Integer count = 0;
        for (File f : files) {
            BufferedReader br = null;
            try {
                br = new BufferedReader(new FileReader(f));
                count = documentToCount.get(word);

                documentToCount.clear();

                String line;
                while ((line = br.readLine()) != null) {
                    String term[] = line.trim().replaceAll("[^a-zA-Z0-9 ]", " ").toLowerCase().split(" ");

                    for (String terms : term) {
                        totalword++;
                        if (count == null) {
                            count = 0;
                        }
                        if (documentToCount.containsKey(word)) {
                            count = documentToCount.get(word);
                            documentToCount.put(terms, count + 1);
                        } else {
                            documentToCount.put(terms, 1);
                        }
                    }
                }
              k = documentToCount.get(word);

                if (documentToCount.get(word) != null) {
                    containwordfile++;
                    System.out.println("" + k);
                }

            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

    public static void main(String[] args) throws IOException {
        Tf2 ob = new Tf2();
        String query = "String name";
        ob.Count(query);
    }
}

我尝试过使用哈希映射,但它无法逐个计算查询的频率。

英文:

I want to search for a query from a file named a.java. If my query is String name I want to get the frequency of a string individually from the query from the text file. First I have to count the frequency of String and then name individually and then add the frequency both. how can I implement this program in java platform?

public class Tf2 {
Integer k;
int totalword = 0;
int totalfile, containwordfile = 0;
Map&lt;String, Integer&gt; documentToCount = new HashMap&lt;&gt;();
File file = new File(&quot;H:/java&quot;);
File[] files = file.listFiles();
public void Count(String word) {
File[] files = file.listFiles();
Integer count = 0;
for (File f : files) {
BufferedReader br = null;
try {
br = new BufferedReader(new FileReader(f));
count = documentToCount.get(word);
documentToCount.clear();
String line;
while ((line = br.readLine()) != null) {
String term[] = line.trim().replaceAll(&quot;[^a-zA-Z0-9 ]&quot;, &quot; &quot;).toLowerCase().split(&quot; &quot;);
for (String terms : term) {
totalword++;
if (count == null) {
count = 0;
}
if (documentToCount.containsKey(word)) {
count = documentToCount.get(word);
documentToCount.put(terms, count + 1);
} else {
documentToCount.put(terms, 1);
}
}
}
k = documentToCount.get(word);
if (documentToCount.get(word) != null) {
containwordfile++;
System.out.println(&quot;&quot; + k);
}
} catch (Exception e) {
e.printStackTrace();
}
}
} public static void main(String[] args) throws IOException {Tf2  ob = new Tf2();String query=&quot;String name&quot;;ob.Count(query);
}}

I tried this with hashmap. but it cannot count the frequency of the query individually.

答案1

得分: 1

以下是使用Collections.frequency来获取文件中字符串计数的示例:

public void Count(String word) {
    File f = new File("/your/path/text.txt");
    BufferedReader br = null;
    List<String> list = new ArrayList<String>();
    try {
        if (f.exists() && f.isFile()) {
            br = new BufferedReader(new FileReader(f));
            String line;
            while ((line = br.readLine()) != null) {
                String[] arr = line.split(" ");
                for (String str : arr) {
                    list.add(str);
                }
            }
            System.out.println("Frequency = " + Collections.frequency(list, word));
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

以下是另一个示例,使用Java Streams API,在目录中进行多文件搜索也适用:

public class Test {

    public static void main(String[] args) {
        File file = new File("C:/path/to/your/files/");
        String targetWord = "stringtofind";
        long numOccurrences = 0;

        if (file.isFile() && file.getName().endsWith(".txt")) {
            numOccurrences = getLineStreamFromFile(file)
                    .flatMap(str -> Arrays.stream(str.split("\\s")))
                    .filter(str -> str.equals(targetWord))
                    .count();
        } else if (file.isDirectory()) {
            numOccurrences = Arrays.stream(file.listFiles(pathname -> pathname.toString().endsWith(".txt")))
                    .flatMap(Test::getLineStreamFromFile)
                    .flatMap(str -> Arrays.stream(str.split("\\s")))
                    .filter(str -> str.equals(targetWord))
                    .count();
        }

        System.out.println(numOccurrences);
    }

    public static Stream<String> getLineStreamFromFile(File file) {
        try {
            return Files.lines(file.toPath());
        } catch (IOException e) {
            e.printStackTrace();
        }
        return Stream.empty();
    }
}

此外,你还可以将输入字符串拆分为单独的单词,并循环获取每个单词的出现次数。

英文:

Here is an example using Collections.frequency to get the count of string in file:

public void Count(String word) {
File f = new File(&quot;/your/path/text.txt&quot;);
BufferedReader br = null;
List&lt;String&gt; list = new ArrayList&lt;String&gt;();
try {
if (f.exists() &amp;&amp; f.isFile()) {
br = new BufferedReader(new FileReader(f));
String line;
while ((line = br.readLine()) != null) {
String[] arr = line.split(&quot; &quot;);
for (String str : arr) {
list.add(str);
}
}
System.out.println(&quot;Frequency = &quot; + Collections.frequency(list, word));
}
} catch (IOException e) {
e.printStackTrace();
}
}

Here is another example using Java Streams API and also works for multifile search inside directory:

    public class Test {
public static void main(String[] args) {
File file = new File(&quot;C:/path/to/your/files/&quot;);
String targetWord = &quot;stringtofind&quot;;
long numOccurances = 0;
if(file.isFile() &amp;&amp; file.getName().endsWith(&quot;.txt&quot;)){
numOccurances = getLineStreamFromFile(file)
.flatMap(str -&gt; Arrays.stream(str.split(&quot;\\s&quot;)))
.filter(str -&gt; str.equals(targetWord))
.count();
} else if(file.isDirectory()) {
numOccurances = Arrays.stream(file.listFiles(pathname -&gt; pathname.toString().endsWith(&quot;.txt&quot;)))
.flatMap(Test::getLineStreamFromFile)
.flatMap(str -&gt; Arrays.stream(str.split(&quot;\\s&quot;)))
.filter(str -&gt; str.equals(targetWord))
.count();
}
System.out.println(numOccurances);
}
public static Stream&lt;String&gt; getLineStreamFromFile(File file){
try {
return Files.lines(file.toPath());
} catch (IOException e) {
e.printStackTrace();
}
return Stream.empty();
}
}

Also, you can break the input string into individual word and loop to get the occurrence for each.

答案2

得分: 0

你过于复杂化了事情。如果你只需要计算出现的次数,不需要哈希映射或类似的东西。你只需要遍历文档中的所有文本,然后计算你找到搜索字符串的次数。

基本上,你的工作流程应该是:

  1. 将计数器初始化为0。
  2. 读取文本。
  3. 遍历文本,寻找搜索字符串。
  4. 当找到搜索字符串时,增加计数器。
  5. 当完成文本的遍历时,打印计数器的结果。

如果你有一个非常长的文本,你可以逐行执行此操作,或者以其他方式批处理读取。

这是一个简单的示例。假设我有一个文件,我要查找单词 "dog"。

// 1. 将计数器初始化为0
int count = 0;

// 2. 读取文本
Path path = ...; // 指向我的输入文件的路径
String text = Files.readString(path, StandardCharsets.US_ASCII);

// 3-4. 在文本中查找字符串的实例
String searchString = "dog";

int lastIndex = 0;
while (lastIndex != -1) {
  lastIndex = text.indexOf(searchString, lastIndex); // 如果未找到 searchString,则会返回 -1
  if (lastIndex != -1) {
    count++; // 增加计数器
    lastIndex += searchString.length(); // 增加索引以匹配搜索词的长度
  }
}

// 5. 打印计数器的结果
System.out.println("找到 " + count + " 个实例的 " + searchString);

在你的具体示例中,你将读取 a.java 类的内容,然后找到 'String' 后面跟着 'name' 的实例数。你可以在需要的时候将它们相加。因此,你将为要搜索的每个单词重复执行步骤 3 和 4,然后在最后将所有计数相加。

当然,最简单的方法是将步骤 3 和 4 包装在一个返回计数的方法中。

int countOccurrences(String searchString, String text) {
  int count = 0;
  int lastIndex = 0;
  while (lastIndex != -1) {
    lastIndex = text.indexOf(searchString, lastIndex);
    if (lastIndex != -1) {
      count++;
      lastIndex += searchString.length();
    }
  }
  return count;
}

// 调用:
int nameCount = countOccurrences("name", text);
int stringCount = countOccurrences("String", text);

System.out.println("计算出 'name' 的实例数为 " + nameCount + ",'String' 的实例数为 " + stringCount + ",总共为 " + (nameCount + stringCount));

(是否对 text 执行 toLowerCase() 取决于是否需要区分大小写的匹配。)

当然,如果你只想要 'name' 而不是 'lastName',那么你将需要开始考虑词边界之类的问题(正则表达式字符类 \b 在这里很有用)。对于解析打印的文本,你需要考虑跨行的单词是否用连字符分隔。但是听起来你的用例只是要计算在以空格分隔的字符串中提供给你的各个单词的实例数。

如果你实际上只想要作为单个短语的 String name 的实例,只需使用第一个工作流程。

英文:

You're over-complicating things greatly. If all you need to do is count occurrences, you don't need hashmaps or anything like that. All you need to do is to iterate over all of the text in the document and count how many times you find your search string.

Basically, your workflow would be:

  1. Instantiate counter to 0
  2. Read text
  3. Iterate over text, looking for search string
  4. When search string is found, increment counter
  5. When finishes iterating over text, print result of counter

If you have a very long text, you could do this line-by-line or otherwise batch your reads.

Here is a simple example. Let's say I have a file and I'm looking for the word "dog".

// 1. instantiate counter to 0
int count = 0;
// 2. read text
Path path = ...; // path to my input file
String text = Files.readString(path, StandardCharsets.US_ASCII);
// 3-4. find instances of the string in the text
String searchString = &quot;dog&quot;;
int lastIndex = 0;
while (lastIndex != -1) {
lastIndex = text.indexOf(searchString, lastIndex); // will resolve -1 if the searchString is not found
if (lastIndex != -1) {
count++; // increment counter
lastIndex += searchString.length(); // increment index by length of search term
}
}
// 5. print result of counter
System.out.println(&quot;Found &quot; + count + &quot; instances of &quot; + searchString);

In your specific example, you would read the contents of the a.java class, and then find the number of instances of 'String' followed by the number of instances of 'name'. You can sum them together at your leisure. So you'd repeat steps 3 and 4 for each word you're searching for, and then sum up all of your counts at the end.

The easiest way, of course, would be to wrap steps 3 and 4 in a method that returns the count.

int countOccurrences(String searchString, String text) {
int count = 0;
int lastIndex = 0;
while (lastIndex != -1) {
lastIndex = text.indexOf(searchString, lastIndex);
if (lastIndex != -1) {
count++;
lastIndex += searchString.length();
}
}
return count;
}
// Call:
int nameCount = countOccurrences(&quot;name&quot;, text);
int stringCount = countOccurrences(&quot;String&quot;, text);
System.out.println(&quot;Counted &quot; + nameCount + &quot; instances of &#39;name&#39; and &quot; + stringCount + &quot; instances of &#39;String&#39;, for a total of &quot; + (nameCount + stringCount));

(Whether you do a toLowerCase() on the text depends on whether you need case-sensitive matches or not.)

Of course, if you only want 'name' and not 'lastName', then you'll start needing to consider things like word boundaries (regex character class \b comes in useful here.) For parsing printed text, you'll need to consider words broken across line endings with a hyphen. But it sounds like your use case is simply counting instances of individual words that happen to have been provided to you in a space-delimited string.

If you actually just want instances of String name as a single phrase like that, just use the first workflow.


Other useful Q&A's:

答案3

得分: 0

你可以使用一个以单词为键、计数为值的映射:

  public static void main(String[] args) {
    String corpus =
        "Wikipedia是一个由全球志愿者创建和编辑的免费在线百科全书";
    String query = "编辑 Wikipedia 志愿者";

    Map<String, Integer> word2count = new HashMap<>();
    for (String word : corpus.split(" ")) {
      if (!word2count.containsKey(word))
        word2count.put(word, 0);
      word2count.put(word, word2count.get(word) + 1);
    }

    for (String q : query.split(" "))
      System.out.println(q + ": " + word2count.get(q));
  }
英文:

You could use a map with the words as the key and the count as the value:

  public static void main(String[] args) {
String corpus =
&quot;Wikipedia is a free online encyclopedia, created and edited by volunteers around the world&quot;;
String query = &quot;edited Wikipedia volunteers&quot;;
Map&lt;String, Integer&gt; word2count = new HashMap&lt;&gt;();
for (String word : corpus.split(&quot; &quot;)) {
if (!word2count.containsKey(word))
word2count.put(word, 0);
word2count.put(word, word2count.get(word) + 1);
}
for (String q : query.split(&quot; &quot;))
System.out.println(q + &quot;: &quot; + word2count.get(q));
}

答案4

得分: 0

import java.util.HashMap;
import java.util.Map;

public class Main {
    public static void main(String[] args) {
        // 给定的字符串
        String str = "Wikipedia是一个由世界各地的志愿者创建和编辑的免费在线百科全书。";

        // 查询字符串
        String query = "编辑 Wikipedia 志愿者";

        // 在空格上拆分给定的字符串和查询字符串
        String[] strArr = str.split("\\s+");
        String[] queryArr = query.split("\\s+");

        // 用于保存字符串中每个查询词的频率的映射
        Map<String, Integer> map = new HashMap<>();

        for (String q : queryArr) {
            for (String s : strArr) {
                if (q.equals(s)) {
                    map.put(q, map.getOrDefault(q, 0) + 1);
                }
            }
        }

        // 显示映射
        System.out.println(map);

        // 获取所有频率的总和
        int sumFrequencies = map.values().stream().mapToInt(Integer::intValue).sum();

        System.out.println("频率总和:" + sumFrequencies);
    }
}

输出:

{Wikipedia=1, 编辑=1, 志愿者=1}
频率总和:3

请查阅Map#getOrDefault的文档以了解更多信息。

更新

在原始答案中,我使用了Java的Stream API来获取值的总和。以下是另一种做法:

// 获取所有频率的总和
int sumFrequencies = 0;
for (int value : map.values()) {
    sumFrequencies += value;
}

你的另一个问题是:

如果我有一个文件夹中有多个文件,我如何知道这个查询在哪个文件中出现了多少次?

你可以创建一个Map<String, Map<String, Integer>>,其中键将是文件的名称,值(即Map<String, Integer>)将是该文件的频率映射。我已经在上面展示了创建这个频率映射的算法。你需要做的就是循环遍历文件列表并填充这个映射(Map<String, Map<String, Integer>>)。


<details>
<summary>英文:</summary>
&gt; If I have a file that contains a line &quot;Wikipedia is a free online
&gt; encyclopedia, created and edited by volunteers around the world&quot;.I
&gt; want to search a query &quot;edited Wikipedia volunteers &quot;.then my program
&gt; first count the frequency edited from the text file, then count
&gt; Wikipedia frequency and then volunteers frequency, and at last it sum
&gt; up all the frequency. can I solve it by using hashmap?
You can do it as follows:
import java.util.HashMap;
import java.util.Map;
public class Main {
public static void main(String[] args) {
// The given string
String str = &quot;Wikipedia is a free online encyclopedia, created and edited by volunteers around the world.&quot;;
// The query string
String query = &quot;edited Wikipedia volunteers&quot;;
// Split the given string and the query string on space
String[] strArr = str.split(&quot;\\s+&quot;);
String[] queryArr = query.split(&quot;\\s+&quot;);
// Map to hold the frequency of each word of query in the string
Map&lt;String, Integer&gt; map = new HashMap&lt;&gt;();
for (String q : queryArr) {
for (String s : strArr) {
if (q.equals(s)) {
map.put(q, map.getOrDefault(q, 0) + 1);
}
}
}
// Display the map
System.out.println(map);
// Get the sum of all frequencies
int sumFrequencies = map.values().stream().mapToInt(Integer::intValue).sum();
System.out.println(&quot;Sum of frequencies: &quot; + sumFrequencies);
}
}
**Output:**
{edited=1, Wikipedia=1, volunteers=1}
Sum of frequencies: 3
Check [the documentation of `Map#getOrDefault`][1] to learn more about it.
# Update
In the original answer, I&#39;ve used the Java `Stream` API to get the sum of values. Given below is an alternative way of doing it:
// Get the sum of all frequencies
int sumFrequencies = 0;
for (int value : map.values()) {
sumFrequencies += value;
}
Your other question is:
&gt; if I have multiple files in a folder then how can i know of how many
&gt; times is this query os occurring in which file
You can create a `Map&lt;String, Map&lt;String, Integer&gt;&gt;` in which the key will be the name of the file and the value (i.e. `Map&lt;String, Integer&gt;`) will be the frequency map for the file. I&#39;ve already shown above the algorithm to create this frequency map. All you will have to do is to loop through the list of files and populate this map (`Map&lt;String, Map&lt;String, Integer&gt;&gt;`).
[1]: https://docs.oracle.com/javase/8/docs/api/java/util/Map.html#getOrDefault-java.lang.Object-V-
</details>

huangapple
  • 本文由 发表于 2020年8月16日 22:57:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/63438344.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定