java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.DataSourceV2 for Spark 3.0.0

huangapple go评论18阅读模式
英文:

java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.DataSourceV2 for Spark 3.0.0

问题

当我尝试加载parquetcsv格式的数据集时,我会收到以下异常消息异常消息。Spark会话的初始化正常,但当我想要加载数据集时,出现了问题。

异常消息如下:

Py4JJavaError: 在调用o94.csv时发生错误。
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/DataSourceV2
...

请注意,这个异常消息表明缺少org/apache/spark/sql/sources/v2/DataSourceV2类,这可能是导致问题的原因之一。

英文:

Brief

What are possible paths that can make me process data by pyspark 3.0.0 with success from the pure pip installation, well, at least loading data without downgrading the version of Spark?

When I attempted to load datasets of parquet and csv, I would get the exception message as the content below Exception Message displays. The initialization of Spark session is fine, yet when I wanted to load datasets, it just went wrong.

Some Information

  • Java: openjdk 11
  • Python: 3.8.5
  • Mode: local mode
  • Operating System: Ubuntu 16.04.6 LTS
  • Notes:
    1. I executed python3.8 -m pip install pyspark to install Spark.
    2. When I looked up the jar of spark-sql_2.12-3.0.0.jar (which is under the Python site-package path, i.e., ~/.local/lib/python3.8/site-packages/pyspark/jars in my case), there is no v2 under spark.sql.sources, the most similar one I found is an interface called DatSourceRegister under the same package.
    3. The most similar question I found on Stackoverflow is https://stackoverflow.com/questions/61362841/pyspark-structured-streaming-kafka-error-caused-by-java-lang-classnotfoundex where downgrading the Spark version is recommended throughout the information on that page.

Exception Message

Py4JJavaError: An error occurred while calling o94.csv.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/DataSourceV2
	at java.base/java.lang.ClassLoader.defineClass1(Native Method)
	at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1016)
	at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174)
	at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:800)
	at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:698)
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:621)
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:575)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:398)
	at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.nextProviderClass(ServiceLoader.java:1209)
	at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNextService(ServiceLoader.java:1220)
	at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNext(ServiceLoader.java:1264)
	at java.base/java.util.ServiceLoader$2.hasNext(ServiceLoader.java:1299)
	at java.base/java.util.ServiceLoader$3.hasNext(ServiceLoader.java:1384)
	at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:43)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
	at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
	at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
	at scala.collection.TraversableLike.filter(TraversableLike.scala:347)
	at scala.collection.TraversableLike.filter$(TraversableLike.scala:347)
	at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:644)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:728)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
	at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.DataSourceV2
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
	... 45 more

答案1

得分: 1

我在Spark 3中遇到了相同的问题,最终找到了原因。我包含了一个依赖于旧的数据源v2 API的自定义jar。

解决方法是移除自定义jar,然后Spark开始正常工作。

英文:

I had this same problem with spark 3 and finally figured out the cause. I was including a custom jar that relied on the old datasource v2 api.

The solution was to remove the custom jar then spark began working properly.

答案2

得分: 0

当前,我已经找到了一种通过Python函数API来操作Spark数据的方法。

解决方法

1

# 克隆特定分支
git clone -b branch-3.0 --single-branch https://github.com/apache/spark.git
## 也可以尝试以下命令
## git clone --branch v3.0.0 https://github.com/apache/spark.git

# 构建Spark分发
cd spark
./dev/make-distribution.sh --name spark3.0.1 --pip --r --tgz -e -PR -Phive -Phive-thriftserver -Pmesos -Pyarn -Dhadoop.version=3.0.0 -DskipTests -Pkubernetes
## 在`.bashrc_profile`中更改SPARK_HOME的值后
source ~/.bashrc_profile

# 在目录中下载所需的附加jar包
cd ${SPARK_HOME}/assembly/target/scala-2.12/jars
curl -O https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.0.0/hadoop-aws-3.0.0.jar
curl -O https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.828/aws-java-sdk-bundle-1.11.828.jar
cd ${SPARK_HOME}

# 为Spark添加相关配置
cp ${SPARK_HOME}/conf/spark-defaults.conf.template ${SPARK_HOME}/conf/spark-defaults.conf
## 将所需或期望的参数添加到`spark-defaults.conf`中
## 个人而言,我通过`vi`编辑了配置文件

# 启动交互式shell
pyspark
欢迎来到
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   版本 3.0.1-SNAPSHOT
      /_/

使用Python版本 3.8.5(默认,2020年7月24日 05:43:01)
SparkSession可用作“spark”。
## 启动后,我可以读取parquet和csv文件,不会出现异常

2
在设置了上述所有内容之后,将${SPARK_HOME}/python添加到环境变量PYTHONPATH中,然后记得来源于相关文件(我将其添加到.bashrc_profile中)。

from pyspark import SparkConf
from pyspark.sql import SparkSession
sc = SparkConf()
threads_max = 512
connection_max = 600
sc.set("spark.driver.memory", "10g")
sc.set('spark.hadoop.fs.s3a.threads.max', threads_max)
sc.set('spark.hadoop.fs.s3a.connection.maximum', connection_max)
sc.set('spark.hadoop.fs.s3a.aws.credentials.provider',
           'com.amazonaws.auth.EnvironmentVariableCredentialsProvider')
sc.set('spark.driver.maxResultSize', 0)
spark = SparkSession.builder.appName("cest-la-vie")\
    .master("local[*]").config(conf=sc).getOrCreate()
## 启动后,我可以读取parquet和csv文件,不会出现异常

注释

我还尝试使PySpark从源代码构建可通过pip安装,但卡在上传文件大小到testpypi上。这个尝试是为了让pyspark包出现在site package目录下。以下是我的尝试步骤:

cd ${SPARK_HOME}/python
# 步骤1
python3.8 -m pip install --user --upgrade setuptools wheel
# 步骤2
python3.8 setup.py sdist bdist_wheel ## /opt/spark/python
# 步骤3
python3.8 -m pip install --user --upgrade twine
# 步骤4
python3.8 -m twine upload --repository testpypi dist/*
## 已经注册了testpypi的帐户并获得了一个令牌
上传pyspark-3.0.1.dev0-py2.py3-none-any.whl

## 在此卡住
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 345M/345M [00:49<00:00, 7.33MB/s]
Received "503: first byte timeout" Package upload appears to have failed.  Retry 1 of 5
英文:

currently, I have got a way out for manipulating data via Python function APIs for Spark.

workaround

1

# clone a specific branch 
git clone -b branch-3.0 --single-branch https://github.com/apache/spark.git
## could try the follwoing command
## git clone --branch v3.0.0 https://github.com/apache/spark.git

# build a Spark distribution
cd spark
./dev/make-distribution.sh --name spark3.0.1 --pip --r --tgz -e -PR -Phive -Phive-thriftserver -Pmesos -Pyarn -Dhadoop.version=3.0.0 -DskipTests -Pkubernetes
## after changing the value of SPARK_HOME in `.bashrc_profile`
source ~/.bashrc_profile

# downlaod needed additional jars into the directory
cd ${SPARK_HOME}/assembly/target/scala-2.12/jars
curl -O https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.0.0/hadoop-aws-3.0.0.jar
curl -O https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.828/aws-java-sdk-bundle-1.11.828.jar
cd ${SPARK_HOME}

# add related configuraionts for Spark
cp ${SPARK_HOME}/conf/spark-defaults.conf.template ${SPARK_HOME}/conf/spark-defaults.conf
## add required or desired parameters into the `spark-defaults.conf`
## as of me, I edited the configuraion file by `vi`

# launch an interactive shell
pyspark
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.1-SNAPSHOT
      /_/

Using Python version 3.8.5 (default, Jul 24 2020 05:43:01)
SparkSession available as 'spark'.
## after launching, I can read parquet and csv files without the exception

2
after setting up all the stuff mentioned above, add ${SPARK_HOME}/python to the environment variable PYTHONPATH, then remember to source the related file (I added it into .bashrc_profile).

from pyspark import SparkConf
from pyspark.sql import SparkSession
sc = SparkConf()
threads_max = 512
connection_max = 600
sc.set("spark.driver.memory", "10g")
sc.set('spark.hadoop.fs.s3a.threads.max', threads_max)
sc.set('spark.hadoop.fs.s3a.connection.maximum', connection_max)
sc.set('spark.hadoop.fs.s3a.aws.credentials.provider',
           'com.amazonaws.auth.EnvironmentVariableCredentialsProvider')
sc.set('spark.driver.maxResultSize', 0)
spark = SparkSession.builder.appName("cest-la-vie")\
    .master("local[*]").config(conf=sc).getOrCreate()
## after launching, I can read parquet and csv files without the exception

notes

I've also attempted to make PySpark pip installable from the sources' building, but I was stuck on the uploading file size to testpypi. This trying is that I want the pyspark package to be present under the site package directory. The following is my attempting steps:

cd ${SPARK_HOME}/python
# Step 1
python3.8 -m pip install --user --upgrade setuptools wheel
# Step 2
python3.8 setup.py sdist bdist_wheel ## /opt/spark/python
# Step 3
python3.8 -m pip install --user --upgrade twine
# Step 4
python3.8 -m twine upload --repository testpypi dist/*
## have registered an account for testpypi and got a token
Uploading pyspark-3.0.1.dev0-py2.py3-none-any.whl

## stuck here
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 345M/345M [00:49<00:00, 7.33MB/s]
Received "503: first byte timeout" Package upload appears to have failed.  Retry 1 of 5

答案3

得分: 0

我使用的是独立安装的Spark 3.1.1版本。

我尝试了很多方法。

我排除了很多JAR文件。

经过很多痛苦之后,我决定删除我的Spark安装并安装(解压)一个新的。

我不知道为什么...但它正常工作。

英文:

I was using a standalone installation of Spark 3.1.1.

I have tried a lot of things.

I have excluded a lot of jar files.

After a lot of suffering, I decided to delete my Spark installation and install(unpack) a new one.

I don't know why... but it's working.

huangapple
  • 本文由 发表于 2020年7月29日 16:14:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/63149228.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定