英文:
pyspark dataframe to tfrecords not working
问题
I've translated the content you provided:
pyspark 3.2.0
我已经从 http://spark.apache.org/third-party-projects.html 下载了 spark-tensorflow-connector.jar 文件。
在添加了 jar 文件后创建 Spark 会话之后:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName('stc-test')\
.config('spark.jars', 'spark-tensorflow-connector-1.0.0-s_2.11.jar')\
.getOrCreate()
然后,当我尝试按照 tensorflow-spark-connector 文档中的示例将数据写入 tfrecords 格式时:
train_pdf.write.format('tfrecords').option('writeLocality', 'local').save("/tfrecords")
我遇到了以下错误:`Py4JJavaError: An error occurred while calling o152.save.
: java.lang.ClassNotFoundException:
Failed to find data source: tfrecords. Please find packages at
http://spark.apache.org/third-party-projects.html Caused by: java.lang.ClassNotFoundException: tfrecords.DefaultSource`
有人可以帮助我吗?我对配置 pyspark 还很陌生,甚至不确定库是否已添加到 sparksession 中。
Please note that I've only translated the content you provided, excluding the code parts as per your request. If you have any further questions or need additional assistance, feel free to ask.
英文:
pyspark 3.2.0
I've downloaded spark-tensorflow-connector.jar file from http://spark.apache.org/third-party-projects.html.
After creating spark session with jar file added
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName('stc-test')\
.config('spark.jars', 'spark-tensorflow-connector-1.0.0-s_2.11.jar')\
.getOrCreate()
then when I try to write it into tfrecords following example from tensorflow-spark-connector documentation
train_pdf.write.format('tfrecords').option('writeLocality', 'local').save("/tfrecords")
I get following error: Py4JJavaError: An error occurred while calling o152.save.
: java.lang.ClassNotFoundException:
Failed to find data source: tfrecords. Please find packages at
http://spark.apache.org/third-party-projects.html Caused by: java.lang.ClassNotFoundException: tfrecords.DefaultSource
Can someone help me why? I am new to configuring pyspark so I am not even sure if library is even added to sparksession.
答案1
得分: 1
我认为这是因为您正在尝试在pyspark 3.2.0上运行为scala 2.11构建的旧包。您可以切换到这个存储库并尝试使用为spark 3.2.0构建的库:
https://github.com/linkedin/spark-tfrecord
https://central.sonatype.com/search?q=spark-tfrecord&smo=true
英文:
I think this is because you are trying to run this old package built for scala 2.11 with pyspark 3.2.0
You could switch to this repository and try to use libraries built for spark 3.2.0 -
https://github.com/linkedin/spark-tfrecord
https://central.sonatype.com/search?q=spark-tfrecord&smo=true
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论