问题

我一直在尝试在EMR中使用Java运行从https://spark.apache.org/examples.html 找到的Pi估算和wordCount示例。

Pi估算工作正常，所以我认为一切都设置正确了。但是我在执行wordCount时遇到了以下错误：

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: 输入路径不存在：hdfs://XXX/user/hadoop/input.txt

在运行以下命令之前，我已经从s3下载了我的input.txt和jar文件：

spark-submit --class "wordCount" --master local[4] Spark05-1.1.jar input.txt

这是我的wordCount代码：

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;

import java.util.Arrays;

public final class wordCount {

    public static void main(String[] args) {

        SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("JD Word Counter");

        JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);

        JavaRDD<String> textFile = sparkContext.textFile(args[0]);
        JavaPairRDD<String, Integer> counts = textFile
                .flatMap(s -> Arrays.asList(s.split(" ")).iterator())
                .mapToPair(word -> new Tuple2<>(word, 1))
                .reduceByKey((a, b) -> a + b);
        counts.saveAsTextFile("result.txt");
    }
}

我做错了什么吗？

英文:

I've been trying to run the Pi Estimation & the wordCount example found on https://spark.apache.org/examples.html in Java with EMR

The Pi estimation works fine so i assumed everything was set up properly.
But i get this error with the wordCount:

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://XXX/user/hadoop/input.txt

I've downloaded my input.txt & my jar from s3 before running this command:

spark-submit --class "wordCount" --master local[4] Spark05-1.1.jar input.txt

here's my wordCount code:

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;

import java.util.Arrays;

public final class wordCount {

    public static void main(String[] args) {

        SparkConf sparkConf = new SparkConf().setMaster(&quot;local&quot;).setAppName(&quot;JD Word Counter&quot;);

        JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);



        JavaRDD&lt;String&gt; textFile = sparkContext.textFile(args[0]);
        JavaPairRDD&lt;String, Integer&gt; counts = textFile
                .flatMap(s -&gt; Arrays.asList(s.split(&quot; &quot;)).iterator())
                .mapToPair(word -&gt; new Tuple2&lt;&gt;(word, 1))
                .reduceByKey((a, b) -&gt; a + b);
        counts.saveAsTextFile(&quot;result.txt&quot;);


    }
}

Am i doing anything wrong?

答案1

得分: 0

如果您没有将您的input.txt加载到hdfs上，请尝试将其放入hdfs中后再试。

或者，尝试使用完整路径和前缀'file'，例如) file://{YOUR_FILE_PATH}。
我相信这是因为spark配置中的'fs.defaultFS'是'hdfs'。

英文:

If you didn't load your input.txt on hdfs, please try after put it into the hdfs.

Or, try with full path with prefix 'file' e.g) file://{YOUR_FILE_PATH}.
I believe it because 'fs.defaultFS' from spark config is 'hdfs'.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Java + Spark在EMR上的wordCount

问题

答案1

如何在Java中使用Redisson客户端更新条目的存活时间？

每个Spring Batch作业调用都会打开一个新的数据库连接池吗？

Bean of type 'org.springframework.security.oauth2.client.registration.ClientRegistrationRepository' that could not be found. – Spring Security

如何获取调用静态方法的类？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论