为什么在使用多个线程来统计大文件的单词频率时会出现答案的变化?

huangapple go评论79阅读模式
英文:

Why there is variation in answer while using multiple threads to count word frequencies of a large file?

问题

以下是您要翻译的内容:

My objective is to count the frequencies of each word while reading a large file using multiple threads.
我的目标是在使用多线程读取大文件时统计每个单词的频率。

I am implementing Runnable interface to achieve multi-threading. But while executing the program I'm not getting the correct answer every time. Sometimes, it is giving correct output and sometimes not. But using Callable interface instead of Runnable, the program executes correctly without any error.
我正在实现Runnable接口来实现多线程。但在执行程序时,我并不是每次都得到正确的答案。有时候,它会给出正确的输出,有时候不会。但是使用Callable接口而不是Runnable,程序会正确执行,没有任何错误。

This is the main class:
这是主类:

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class WordFrequencyRunnableTest {

public static void main(String[] args) throws IOException {
    long startTime = System.currentTimeMillis();
    String filePath = "C:/Users/Mukesh Kumar/Desktop/data.txt";
    WordFrequencyRunnableTest runnableTest = new WordFrequencyRunnableTest();
    Map<String, Integer> wordFrequencies = runnableTest.parseLines(filePath);
    runnableTest.printResult(wordFrequencies);
    long elapsedTime = System.currentTimeMillis() - startTime;
    System.out.println("Total execution time in millis: " + elapsedTime);
}

public Map<String, Integer> parseLines(String filePath) throws IOException {
    Map<String, Integer> wordFrequencies = new HashMap<>();
    try (BufferedReader bufferedReader = new BufferedReader(new FileReader(filePath))) {
        String eachLine = bufferedReader.readLine();
        while (eachLine != null) {
            List<String> linesForEachThread = new ArrayList<>();
            while (linesForEachThread.size() != 100 && eachLine != null) {
                linesForEachThread.add(eachLine);
                eachLine = bufferedReader.readLine();
            }
            WordFrequencyUsingRunnable task = new WordFrequencyUsingRunnable(linesForEachThread, wordFrequencies);
            Thread thread = new Thread(task);
            thread.start();
        }
    }
    return wordFrequencies;
}

public void printResult(Map<String, Integer> wordFrequencies) {
    wordFrequencies.forEach((key, value) -> System.out.println(key + " " + value));
}

}

And this is the logic class:
这是逻辑类:

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class WordFrequencyUsingRunnable implements Runnable {

private final List<String> linesForEachThread;
private final Map<String, Integer> wordFrequencies;

public WordFrequencyUsingRunnable(List<String> linesForEachThread, Map<String, Integer> wordFrequencies) {
    this.linesForEachThread = linesForEachThread;
    this.wordFrequencies = wordFrequencies;
}

@Override
public void run() {
    List<String> currentThreadLines = new ArrayList<>(linesForEachThread);
    for (String eachLine : currentThreadLines) {
        String[] eachLineWords = eachLine.toLowerCase().split("([,.\\s]+)");
        synchronized (wordFrequencies) {
            for (String eachWord : eachLineWords) {
                if (wordFrequencies.containsKey(eachWord)) {
                    wordFrequencies.replace(eachWord, wordFrequencies.get(eachWord) + 1);
                }
                wordFrequencies.putIfAbsent(eachWord, 1);
            }
        }
    }
}

}

I am hoping for good responses and thanking in advance for the help.
我希望能得到良好的回应,并提前感谢您的帮助。

英文:

My objective is to count the frequencies of each word while reading a large file using multiple threads.
I am implementing Runnable interface to achieve multi-threading. But while executing the program I'm not getting the correct answer every time. Sometimes, it is giving correct output and sometimes not. But using Callable interface instead of Runnable, the program executes correctly without any error.

This is the main class:

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class WordFrequencyRunnableTest {

    public static void main(String[] args) throws IOException {
        long startTime = System.currentTimeMillis();
        String filePath = &quot;C:/Users/Mukesh Kumar/Desktop/data.txt&quot;;
        WordFrequencyRunnableTest runnableTest = new WordFrequencyRunnableTest();
        Map&lt;String, Integer&gt; wordFrequencies = runnableTest.parseLines(filePath);
        runnableTest.printResult(wordFrequencies);
        long elapsedTime = System.currentTimeMillis() - startTime;
        System.out.println(&quot;Total execution time in millis: &quot; + elapsedTime);
    }

    public Map&lt;String, Integer&gt; parseLines(String filePath) throws IOException {
        Map&lt;String, Integer&gt; wordFrequencies = new HashMap&lt;&gt;();
        try (BufferedReader bufferedReader = new BufferedReader(new FileReader(filePath))) {
            String eachLine = bufferedReader.readLine();
            while (eachLine != null) {
                List&lt;String&gt; linesForEachThread = new ArrayList&lt;&gt;();
                while (linesForEachThread.size() != 100 &amp;&amp; eachLine != null) {
                    linesForEachThread.add(eachLine);
                    eachLine = bufferedReader.readLine();
                }
                WordFrequencyUsingRunnable task = new WordFrequencyUsingRunnable(linesForEachThread, wordFrequencies);
                Thread thread = new Thread(task);
                thread.start();
            }
        }
        return wordFrequencies;
    }

    public void printResult(Map&lt;String, Integer&gt; wordFrequencies) {
        wordFrequencies.forEach((key, value) -&gt; System.out.println(key + &quot; &quot; + value));
    }
}

And this is the logic class:

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class WordFrequencyUsingRunnable implements Runnable {

    private final List&lt;String&gt; linesForEachThread;
    private final Map&lt;String, Integer&gt; wordFrequencies;

    public WordFrequencyUsingRunnable(List&lt;String&gt; linesForEachThread, Map&lt;String, Integer&gt; wordFrequencies) {
        this.linesForEachThread = linesForEachThread;
        this.wordFrequencies = wordFrequencies;
    }

    @Override
    public void run() {
        List&lt;String&gt; currentThreadLines = new ArrayList&lt;&gt;(linesForEachThread);
        for (String eachLine : currentThreadLines) {
            String[] eachLineWords = eachLine.toLowerCase().split(&quot;([,.\\s]+)&quot;);
            synchronized (wordFrequencies) {
                for (String eachWord : eachLineWords) {
                    if (wordFrequencies.containsKey(eachWord)) {
                        wordFrequencies.replace(eachWord, wordFrequencies.get(eachWord) + 1);
                    }
                    wordFrequencies.putIfAbsent(eachWord, 1);
                }
            }
        }
    }
}

I am hoping for good responses and thanking in advance for the help.

答案1

得分: 2

你应该在打印结果之前等待所有线程关闭。

public class WordFrequencyRunnableTest {

    List<Thread> threads = new ArrayList<>();
    public static void main(String[] args) throws IOException {
        ...
        ...
        Map<String, Integer> wordFrequencies = runnableTest.parseLines(filePath);
        for(Thread thread: threads)
        {
           thread.join();
        }
        runnableTest.printResult(wordFrequencies);
        ...
        ...
    }

    public Map<String, Integer> parseLines(String filePath) throws IOException {
        Map<String, Integer> wordFrequencies = new HashMap<>();
        try (BufferedReader bufferedReader = new BufferedReader(new FileReader(filePath))) {
            String eachLine = bufferedReader.readLine();
            while (eachLine != null) {
                List<String> linesForEachThread = new ArrayList<>();
                while (linesForEachThread.size() != 100 && eachLine != null) {
                    linesForEachThread.add(eachLine);
                    eachLine = bufferedReader.readLine();
                }
                WordFrequencyUsingRunnable task = new WordFrequencyUsingRunnable(linesForEachThread, wordFrequencies);
                Thread thread = new Thread(task);
                thread.start();
                threads.add(thread); // Add thread to the list.
            }
        }
        return wordFrequencies;
    }
}

PS - 你可以使用 ConcurrentHashMap<String, AtomicInteger> 来避免对哈希表的访问进行同步。这样程序将运行得更快。

英文:

You should wait for all threads to close before printing the results.

public class WordFrequencyRunnableTest {

    List&lt;Thread&gt; threads = new ArrayList&lt;&gt;();
    public static void main(String[] args) throws IOException {
        ...
        ...
        Map&lt;String, Integer&gt; wordFrequencies = runnableTest.parseLines(filePath);
        for(Thread thread: threads)
        {
           thread.join();
        }
        runnableTest.printResult(wordFrequencies);
        ...
        ...
    }

    public Map&lt;String, Integer&gt; parseLines(String filePath) throws IOException {
        Map&lt;String, Integer&gt; wordFrequencies = new HashMap&lt;&gt;();
        try (BufferedReader bufferedReader = new BufferedReader(new FileReader(filePath))) {
            String eachLine = bufferedReader.readLine();
            while (eachLine != null) {
                List&lt;String&gt; linesForEachThread = new ArrayList&lt;&gt;();
                while (linesForEachThread.size() != 100 &amp;&amp; eachLine != null) {
                    linesForEachThread.add(eachLine);
                    eachLine = bufferedReader.readLine();
                }
                WordFrequencyUsingRunnable task = new WordFrequencyUsingRunnable(linesForEachThread, wordFrequencies);
                Thread thread = new Thread(task);
                thread.start();
                threads.add(thread); // Add thread to the list.
            }
        }
        return wordFrequencies;
    }
}

PS - You can use ConcurrentHashMap&lt;String, AtomicInteger&gt; to avoid having to synchronize access to the hashmap. The program will run faster that way.

huangapple
  • 本文由 发表于 2020年8月4日 19:50:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/63246273.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定