spark-submit找不到类(虽然类包含在jar中)

huangapple go评论74阅读模式
英文:

spark-submit does not find class (while class is being contained in jar)

问题

我正在构建一个非常简单的HelloWorld Spark作业,使用Java和Gradle:

package com.example;

public class HelloWorld {
    public static void main(String[] args) {
        System.out.println("Hello World!");
    }
}

我的Gradle配置非常简单:

def sparkVersion = "2.4.6"
def hadoopVersion = "2.7.3"

dependencies {
    compile "org.apache.spark:spark-core_2.11:$sparkVersion"
    compile "org.apache.spark:spark-sql_2.11:$sparkVersion"
    compile 'org.slf4j:slf4j-simple:1.7.9'
    compile "org.apache.hadoop:hadoop-aws:$hadoopVersion"
    compile "org.apache.hadoop:hadoop-common:$hadoopVersion"
    testCompile group: 'junit', name: 'junit', version: '4.12'
}

我还确保构建了一个包含所有依赖项的远程jar文件,就像Scala中的SBT assembly正在执行的操作一样:

jar {
    zip64 = true
    from {
        configurations.runtimeClasspath.collect { it.isDirectory() ? it : zipTree(it) }
    }
}

构建工作正常,我的类出现在jar文件中:

jar tvf build/libs/output.jar | grep -i hello
com/example/HelloWorld.class

然而,在运行spark-submit作业时:

spark-submit --class 'com.example.HelloWorld' --master=local build/libs/output.jar

我得到的只是调试日志:

20/09/21 13:07:46 WARN Utils: Your hostname, example.local resolves to a loopback address: 127.0.0.1; using 192.168.43.208 instead (on interface en0)
20/09/21 13:07:46 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/09/21 13:07:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.spark.deploy.SparkSubmit$$anon$2).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

我的本地Spark正确报告了Scala 2.11和Spark 2.4.6,构建于Hadoop 2.7.3之上。
我还尝试了一个更复杂的Spark作业,但输出日志是相同的。

然而,在IntelliJ IDEA中运行代码是正常的(选中了"包括具有'Provided'范围的依赖项"选项)。

我是否遗漏了什么?非常感谢。

英文:

I am building a very simple HelloWorld Spark job, in Java with Gradle:

package com.example;

public class HelloWorld {
    public static void main(String[] args) {
        System.out.println("Hello World!");
    }
}

My gradle config is very straightforward:

def sparkVersion = "2.4.6"
def hadoopVersion = "2.7.3"

dependencies {
    compile "org.apache.spark:spark-core_2.11:$sparkVersion"
    compile "org.apache.spark:spark-sql_2.11:$sparkVersion"
    compile 'org.slf4j:slf4j-simple:1.7.9'
    compile "org.apache.hadoop:hadoop-aws:$hadoopVersion"
    compile "org.apache.hadoop:hadoop-common:$hadoopVersion"
    testCompile group: 'junit', name: 'junit', version: '4.12'
}

I also made sure I build a far jar to include all the dependencies, like SBT assembly is doing in Scala:

jar {
    zip64 = true
    from {
        configurations.runtimeClasspath.collect { it.isDirectory() ? it : zipTree(it) }
    }
}

The build works well and my class appears in the jar:

jar tvf build/libs/output.jar | grep -i hello
com/example/HelloWorld.class

However, when running spark-submit job:

 spark-submit --class 'com.example.HelloWorld' --master=local build/libs/output.jar

All I am getting is debug logs:

20/09/21 13:07:46 WARN Utils: Your hostname, example.local resolves to a loopback address: 127.0.0.1; using 192.168.43.208 instead (on interface en0)
20/09/21 13:07:46 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/09/21 13:07:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.spark.deploy.SparkSubmit$$anon$2).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

My local spark is rightfully reporting Scala 2.11 and Spark 2.4.6 built for Hadoop 2.7.3.
I also tested with a more complexe Spark job but the output logs are the same.

The code is however running well in IntelliJ Idea (with option Include dependencies with "Provided" scope ticked).

Am I missing something? Thank you very much

答案1

得分: 0

问题可能来自于zip64 = true,也可能来自于fat jar的生成(尽管shadowJar插件也没有解决这个问题)。

我决定改用Maven,并使用maven-assembly-plugin来生成fat jar,使用maven-compiler-plugin仅包含与我想要构建的Spark作业相关的某些文件,最后使用maven-jar-plugin来避免构建一个包含所有Spark作业的jar(每个jar对应一个作业)。

英文:

The problem could have come from zip64 = true or from the fat jar generation (although the shadowJar plugin did not fix this either).

I decided to go with Maven instead and use the maven-assembly-plugin for the fat jar generation, the maven-compiler-plugin to only include certain files related to the Spark job I want to build and finally maven-jar-plugin to avoid building a jar containing all the spark jobs (1 job per jar).

huangapple
  • 本文由 发表于 2020年9月21日 11:26:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/63985800.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定