英文:
spark-submit does not find class (while class is being contained in jar)
问题
我正在构建一个非常简单的HelloWorld Spark作业,使用Java和Gradle:
package com.example;
public class HelloWorld {
public static void main(String[] args) {
System.out.println("Hello World!");
}
}
我的Gradle配置非常简单:
def sparkVersion = "2.4.6"
def hadoopVersion = "2.7.3"
dependencies {
compile "org.apache.spark:spark-core_2.11:$sparkVersion"
compile "org.apache.spark:spark-sql_2.11:$sparkVersion"
compile 'org.slf4j:slf4j-simple:1.7.9'
compile "org.apache.hadoop:hadoop-aws:$hadoopVersion"
compile "org.apache.hadoop:hadoop-common:$hadoopVersion"
testCompile group: 'junit', name: 'junit', version: '4.12'
}
我还确保构建了一个包含所有依赖项的远程jar文件,就像Scala中的SBT assembly正在执行的操作一样:
jar {
zip64 = true
from {
configurations.runtimeClasspath.collect { it.isDirectory() ? it : zipTree(it) }
}
}
构建工作正常,我的类出现在jar文件中:
jar tvf build/libs/output.jar | grep -i hello
com/example/HelloWorld.class
然而,在运行spark-submit作业时:
spark-submit --class 'com.example.HelloWorld' --master=local build/libs/output.jar
我得到的只是调试日志:
20/09/21 13:07:46 WARN Utils: Your hostname, example.local resolves to a loopback address: 127.0.0.1; using 192.168.43.208 instead (on interface en0)
20/09/21 13:07:46 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/09/21 13:07:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.spark.deploy.SparkSubmit$$anon$2).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
我的本地Spark正确报告了Scala 2.11和Spark 2.4.6,构建于Hadoop 2.7.3之上。
我还尝试了一个更复杂的Spark作业,但输出日志是相同的。
然而,在IntelliJ IDEA中运行代码是正常的(选中了"包括具有'Provided'范围的依赖项"选项)。
我是否遗漏了什么?非常感谢。
英文:
I am building a very simple HelloWorld Spark job, in Java with Gradle:
package com.example;
public class HelloWorld {
public static void main(String[] args) {
System.out.println("Hello World!");
}
}
My gradle config is very straightforward:
def sparkVersion = "2.4.6"
def hadoopVersion = "2.7.3"
dependencies {
compile "org.apache.spark:spark-core_2.11:$sparkVersion"
compile "org.apache.spark:spark-sql_2.11:$sparkVersion"
compile 'org.slf4j:slf4j-simple:1.7.9'
compile "org.apache.hadoop:hadoop-aws:$hadoopVersion"
compile "org.apache.hadoop:hadoop-common:$hadoopVersion"
testCompile group: 'junit', name: 'junit', version: '4.12'
}
I also made sure I build a far jar to include all the dependencies, like SBT assembly is doing in Scala:
jar {
zip64 = true
from {
configurations.runtimeClasspath.collect { it.isDirectory() ? it : zipTree(it) }
}
}
The build works well and my class appears in the jar:
jar tvf build/libs/output.jar | grep -i hello
com/example/HelloWorld.class
However, when running spark-submit job:
spark-submit --class 'com.example.HelloWorld' --master=local build/libs/output.jar
All I am getting is debug logs:
20/09/21 13:07:46 WARN Utils: Your hostname, example.local resolves to a loopback address: 127.0.0.1; using 192.168.43.208 instead (on interface en0)
20/09/21 13:07:46 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/09/21 13:07:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.spark.deploy.SparkSubmit$$anon$2).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
My local spark is rightfully reporting Scala 2.11 and Spark 2.4.6 built for Hadoop 2.7.3.
I also tested with a more complexe Spark job but the output logs are the same.
The code is however running well in IntelliJ Idea (with option Include dependencies with "Provided" scope ticked).
Am I missing something? Thank you very much
答案1
得分: 0
问题可能来自于zip64 = true
,也可能来自于fat jar的生成(尽管shadowJar插件也没有解决这个问题)。
我决定改用Maven,并使用maven-assembly-plugin
来生成fat jar,使用maven-compiler-plugin
仅包含与我想要构建的Spark作业相关的某些文件,最后使用maven-jar-plugin
来避免构建一个包含所有Spark作业的jar(每个jar对应一个作业)。
英文:
The problem could have come from zip64 = true
or from the fat jar generation (although the shadowJar plugin did not fix this either).
I decided to go with Maven instead and use the maven-assembly-plugin
for the fat jar generation, the maven-compiler-plugin
to only include certain files related to the Spark job I want to build and finally maven-jar-plugin
to avoid building a jar containing all the spark jobs (1 job per jar).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论