2020年7月29日 16:14:53go评论115阅读模式

英文:

java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.DataSourceV2 for Spark 3.0.0

问题

当我尝试加载parquet和csv格式的数据集时，我会收到以下异常消息异常消息。Spark会话的初始化正常，但当我想要加载数据集时，出现了问题。

异常消息如下：

Py4JJavaError: 在调用o94.csv时发生错误。
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/DataSourceV2
...

请注意，这个异常消息表明缺少org/apache/spark/sql/sources/v2/DataSourceV2类，这可能是导致问题的原因之一。

英文:

Brief

What are possible paths that can make me process data by pyspark 3.0.0 with success from the pure pip installation, well, at least loading data without downgrading the version of Spark?

When I attempted to load datasets of parquet and csv, I would get the exception message as the content below Exception Message displays. The initialization of Spark session is fine, yet when I wanted to load datasets, it just went wrong.

Some Information

Java: openjdk 11
Python: 3.8.5
Mode: local mode
Operating System: Ubuntu 16.04.6 LTS
Notes:
1. I executed python3.8 -m pip install pyspark to install Spark.
2. When I looked up the jar of spark-sql_2.12-3.0.0.jar (which is under the Python site-package path, i.e., ~/.local/lib/python3.8/site-packages/pyspark/jars in my case), there is no v2 under spark.sql.sources, the most similar one I found is an interface called DatSourceRegister under the same package.
3. The most similar question I found on Stackoverflow is https://stackoverflow.com/questions/61362841/pyspark-structured-streaming-kafka-error-caused-by-java-lang-classnotfoundex where downgrading the Spark version is recommended throughout the information on that page.

Exception Message

Py4JJavaError: An error occurred while calling o94.csv.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/DataSourceV2
	at java.base/java.lang.ClassLoader.defineClass1(Native Method)
	at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1016)
	at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174)
	at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:800)
	at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:698)
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:621)
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:575)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:398)
	at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.nextProviderClass(ServiceLoader.java:1209)
	at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNextService(ServiceLoader.java:1220)
	at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNext(ServiceLoader.java:1264)
	at java.base/java.util.ServiceLoader$2.hasNext(ServiceLoader.java:1299)
	at java.base/java.util.ServiceLoader$3.hasNext(ServiceLoader.java:1384)
	at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:43)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
	at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
	at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
	at scala.collection.TraversableLike.filter(TraversableLike.scala:347)
	at scala.collection.TraversableLike.filter$(TraversableLike.scala:347)
	at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:644)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:728)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
	at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.DataSourceV2
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
	... 45 more

答案1

得分: 1

我在Spark 3中遇到了相同的问题，最终找到了原因。我包含了一个依赖于旧的数据源v2 API的自定义jar。

解决方法是移除自定义jar，然后Spark开始正常工作。

英文:

I had this same problem with spark 3 and finally figured out the cause. I was including a custom jar that relied on the old datasource v2 api.

The solution was to remove the custom jar then spark began working properly.

答案2

得分: 0

当前，我已经找到了一种通过Python函数API来操作Spark数据的方法。

解决方法

# 克隆特定分支
git clone -b branch-3.0 --single-branch https://github.com/apache/spark.git
## 也可以尝试以下命令
## git clone --branch v3.0.0 https://github.com/apache/spark.git

# 构建Spark分发
cd spark
./dev/make-distribution.sh --name spark3.0.1 --pip --r --tgz -e -PR -Phive -Phive-thriftserver -Pmesos -Pyarn -Dhadoop.version=3.0.0 -DskipTests -Pkubernetes
## 在`.bashrc_profile`中更改SPARK_HOME的值后
source ~/.bashrc_profile

# 在目录中下载所需的附加jar包
cd ${SPARK_HOME}/assembly/target/scala-2.12/jars
curl -O https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.0.0/hadoop-aws-3.0.0.jar
curl -O https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.828/aws-java-sdk-bundle-1.11.828.jar
cd ${SPARK_HOME}

# 为Spark添加相关配置
cp ${SPARK_HOME}/conf/spark-defaults.conf.template ${SPARK_HOME}/conf/spark-defaults.conf
## 将所需或期望的参数添加到`spark-defaults.conf`中
## 个人而言，我通过`vi`编辑了配置文件

# 启动交互式shell
pyspark
欢迎来到
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  &#39;_/
   /__ / .__/\_,_/_/ /_/\_\   版本 3.0.1-SNAPSHOT
      /_/

使用Python版本 3.8.5（默认，2020年7月24日 05:43:01）
SparkSession可用作“spark”。
## 启动后，我可以读取parquet和csv文件，不会出现异常

2
在设置了上述所有内容之后，将${SPARK_HOME}/python添加到环境变量PYTHONPATH中，然后记得来源于相关文件（我将其添加到.bashrc_profile中）。

from pyspark import SparkConf
from pyspark.sql import SparkSession
sc = SparkConf()
threads_max = 512
connection_max = 600
sc.set(&quot;spark.driver.memory&quot;, &quot;10g&quot;)
sc.set(&#39;spark.hadoop.fs.s3a.threads.max&#39;, threads_max)
sc.set(&#39;spark.hadoop.fs.s3a.connection.maximum&#39;, connection_max)
sc.set(&#39;spark.hadoop.fs.s3a.aws.credentials.provider&#39;,
           &#39;com.amazonaws.auth.EnvironmentVariableCredentialsProvider&#39;)
sc.set(&#39;spark.driver.maxResultSize&#39;, 0)
spark = SparkSession.builder.appName(&quot;cest-la-vie&quot;)\
    .master(&quot;local[*]&quot;).config(conf=sc).getOrCreate()
## 启动后，我可以读取parquet和csv文件，不会出现异常

注释

我还尝试使PySpark从源代码构建可通过pip安装，但卡在上传文件大小到testpypi上。这个尝试是为了让pyspark包出现在site package目录下。以下是我的尝试步骤：

cd ${SPARK_HOME}/python
# 步骤1
python3.8 -m pip install --user --upgrade setuptools wheel
# 步骤2
python3.8 setup.py sdist bdist_wheel ## /opt/spark/python
# 步骤3
python3.8 -m pip install --user --upgrade twine
# 步骤4
python3.8 -m twine upload --repository testpypi dist/*
## 已经注册了testpypi的帐户并获得了一个令牌
上传pyspark-3.0.1.dev0-py2.py3-none-any.whl

## 在此卡住
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 345M/345M [00:49&lt;00:00, 7.33MB/s]
Received &quot;503: first byte timeout&quot; Package upload appears to have failed.  Retry 1 of 5

英文:

currently, I have got a way out for manipulating data via Python function APIs for Spark.

workaround

# clone a specific branch 
git clone -b branch-3.0 --single-branch https://github.com/apache/spark.git
## could try the follwoing command
## git clone --branch v3.0.0 https://github.com/apache/spark.git

# build a Spark distribution
cd spark
./dev/make-distribution.sh --name spark3.0.1 --pip --r --tgz -e -PR -Phive -Phive-thriftserver -Pmesos -Pyarn -Dhadoop.version=3.0.0 -DskipTests -Pkubernetes
## after changing the value of SPARK_HOME in `.bashrc_profile`
source ~/.bashrc_profile

# downlaod needed additional jars into the directory
cd ${SPARK_HOME}/assembly/target/scala-2.12/jars
curl -O https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.0.0/hadoop-aws-3.0.0.jar
curl -O https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.828/aws-java-sdk-bundle-1.11.828.jar
cd ${SPARK_HOME}

# add related configuraionts for Spark
cp ${SPARK_HOME}/conf/spark-defaults.conf.template ${SPARK_HOME}/conf/spark-defaults.conf
## add required or desired parameters into the `spark-defaults.conf`
## as of me, I edited the configuraion file by `vi`

# launch an interactive shell
pyspark
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  &#39;_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.1-SNAPSHOT
      /_/

Using Python version 3.8.5 (default, Jul 24 2020 05:43:01)
SparkSession available as &#39;spark&#39;.
## after launching, I can read parquet and csv files without the exception

2
after setting up all the stuff mentioned above, add ${SPARK_HOME}/python to the environment variable PYTHONPATH, then remember to source the related file (I added it into .bashrc_profile).

from pyspark import SparkConf
from pyspark.sql import SparkSession
sc = SparkConf()
threads_max = 512
connection_max = 600
sc.set(&quot;spark.driver.memory&quot;, &quot;10g&quot;)
sc.set(&#39;spark.hadoop.fs.s3a.threads.max&#39;, threads_max)
sc.set(&#39;spark.hadoop.fs.s3a.connection.maximum&#39;, connection_max)
sc.set(&#39;spark.hadoop.fs.s3a.aws.credentials.provider&#39;,
           &#39;com.amazonaws.auth.EnvironmentVariableCredentialsProvider&#39;)
sc.set(&#39;spark.driver.maxResultSize&#39;, 0)
spark = SparkSession.builder.appName(&quot;cest-la-vie&quot;)\
    .master(&quot;local[*]&quot;).config(conf=sc).getOrCreate()
## after launching, I can read parquet and csv files without the exception

notes

I've also attempted to make PySpark pip installable from the sources' building, but I was stuck on the uploading file size to testpypi. This trying is that I want the pyspark package to be present under the site package directory. The following is my attempting steps:

cd ${SPARK_HOME}/python
# Step 1
python3.8 -m pip install --user --upgrade setuptools wheel
# Step 2
python3.8 setup.py sdist bdist_wheel ## /opt/spark/python
# Step 3
python3.8 -m pip install --user --upgrade twine
# Step 4
python3.8 -m twine upload --repository testpypi dist/*
## have registered an account for testpypi and got a token
Uploading pyspark-3.0.1.dev0-py2.py3-none-any.whl

## stuck here
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 345M/345M [00:49&lt;00:00, 7.33MB/s]
Received &quot;503: first byte timeout&quot; Package upload appears to have failed.  Retry 1 of 5

答案3

得分: 0

我使用的是独立安装的Spark 3.1.1版本。

我尝试了很多方法。

我排除了很多JAR文件。

经过很多痛苦之后，我决定删除我的Spark安装并安装（解压）一个新的。

我不知道为什么...但它正常工作。

英文:

I was using a standalone installation of Spark 3.1.1.

I have tried a lot of things.

I have excluded a lot of jar files.

After a lot of suffering, I decided to delete my Spark installation and install(unpack) a new one.

I don't know why... but it's working.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.DataSourceV2 for Spark 3.0.0

问题

Brief

Some Information

Exception Message

答案1

答案2

解决方法

注释

workaround

notes

答案3

AWS SQS – 消息消费者在一段时间后停止接收消息

如何比较一组整数并按升序进行排序？

关联 AWS CDK 的 EmailConfigurationProperty 与 UserPool

如何使用 WriteBatch 在列表中移除文档引用？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论