英文:
Java+Spark wordCount with EMR
问题
我一直在尝试在EMR中使用Java运行从https://spark.apache.org/examples.html 找到的Pi估算和wordCount示例。
Pi估算工作正常,所以我认为一切都设置正确了。但是我在执行wordCount时遇到了以下错误:
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: 输入路径不存在:hdfs://XXX/user/hadoop/input.txt
在运行以下命令之前,我已经从s3下载了我的input.txt和jar文件:
spark-submit --class "wordCount" --master local[4] Spark05-1.1.jar input.txt
这是我的wordCount代码:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import java.util.Arrays;
public final class wordCount {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("JD Word Counter");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
JavaRDD<String> textFile = sparkContext.textFile(args[0]);
JavaPairRDD<String, Integer> counts = textFile
.flatMap(s -> Arrays.asList(s.split(" ")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b);
counts.saveAsTextFile("result.txt");
}
}
我做错了什么吗?
英文:
I've been trying to run the Pi Estimation & the wordCount example found on https://spark.apache.org/examples.html in Java with EMR
The Pi estimation works fine so i assumed everything was set up properly.
But i get this error with the wordCount:
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://XXX/user/hadoop/input.txt
I've downloaded my input.txt & my jar from s3 before running this command:
spark-submit --class "wordCount" --master local[4] Spark05-1.1.jar input.txt
here's my wordCount code:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import java.util.Arrays;
public final class wordCount {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("JD Word Counter");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
JavaRDD<String> textFile = sparkContext.textFile(args[0]);
JavaPairRDD<String, Integer> counts = textFile
.flatMap(s -> Arrays.asList(s.split(" ")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b);
counts.saveAsTextFile("result.txt");
}
}
Am i doing anything wrong?
答案1
得分: 0
如果您没有将您的input.txt加载到hdfs上,请尝试将其放入hdfs中后再试。
或者,尝试使用完整路径和前缀'file',例如) file://{YOUR_FILE_PATH}。
我相信这是因为spark配置中的'fs.defaultFS'是'hdfs'。
英文:
If you didn't load your input.txt on hdfs, please try after put it into the hdfs.
Or, try with full path with prefix 'file' e.g) file://{YOUR_FILE_PATH}.
I believe it because 'fs.defaultFS' from spark config is 'hdfs'.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论