pyspark dataframe 转换为 tfrecords 无法正常工作。

huangapple go评论61阅读模式
英文:

pyspark dataframe to tfrecords not working

问题

I've translated the content you provided:

pyspark 3.2.0

我已经从 http://spark.apache.org/third-party-projects.html 下载了 spark-tensorflow-connector.jar 文件。

在添加了 jar 文件后创建 Spark 会话之后:

from pyspark.sql import SparkSession

    spark = SparkSession.builder\
                .appName('stc-test')\
                .config('spark.jars', 'spark-tensorflow-connector-1.0.0-s_2.11.jar')\
                .getOrCreate()

然后,当我尝试按照 tensorflow-spark-connector 文档中的示例将数据写入 tfrecords 格式时:

    train_pdf.write.format('tfrecords').option('writeLocality', 'local').save("/tfrecords")

我遇到了以下错误:`Py4JJavaError: An error occurred while calling o152.save.
: java.lang.ClassNotFoundException: 
Failed to find data source: tfrecords. Please find packages at
http://spark.apache.org/third-party-projects.html Caused by: java.lang.ClassNotFoundException: tfrecords.DefaultSource`

有人可以帮助我吗?我对配置 pyspark 还很陌生,甚至不确定库是否已添加到 sparksession 中。

Please note that I've only translated the content you provided, excluding the code parts as per your request. If you have any further questions or need additional assistance, feel free to ask.

英文:

pyspark 3.2.0

I've downloaded spark-tensorflow-connector.jar file from http://spark.apache.org/third-party-projects.html.

After creating spark session with jar file added
from pyspark.sql import SparkSession

spark = SparkSession.builder\
            .appName('stc-test')\
            .config('spark.jars', 'spark-tensorflow-connector-1.0.0-s_2.11.jar')\
            .getOrCreate()

then when I try to write it into tfrecords following example from tensorflow-spark-connector documentation

train_pdf.write.format('tfrecords').option('writeLocality', 'local').save("/tfrecords")

I get following error: Py4JJavaError: An error occurred while calling o152.save.
: java.lang.ClassNotFoundException:
Failed to find data source: tfrecords. Please find packages at
http://spark.apache.org/third-party-projects.html Caused by: java.lang.ClassNotFoundException: tfrecords.DefaultSource

Can someone help me why? I am new to configuring pyspark so I am not even sure if library is even added to sparksession.

答案1

得分: 1

我认为这是因为您正在尝试在pyspark 3.2.0上运行为scala 2.11构建的旧包。您可以切换到这个存储库并尝试使用为spark 3.2.0构建的库:

https://github.com/linkedin/spark-tfrecord

https://central.sonatype.com/search?q=spark-tfrecord&smo=true

英文:

I think this is because you are trying to run this old package built for scala 2.11 with pyspark 3.2.0

You could switch to this repository and try to use libraries built for spark 3.2.0 -

https://github.com/linkedin/spark-tfrecord

https://central.sonatype.com/search?q=spark-tfrecord&smo=true

huangapple
  • 本文由 发表于 2023年5月6日 21:23:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76189129.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定