英文:
Pyspark use DocumentAssembler on array<string>
问题
I am trying to use DocumentAssembler for an array of strings. The documentation says: "DocumentAssembler可以读取String列或Array[String]。" But when I do a simple example:
data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
result = documentAssembler.transform(data)
result.select("document").show(truncate=False)
I am getting an error:
AnalysisException: [CANNOT_UP_CAST_DATATYPE] Cannot up cast input from "ARRAY<STRING>" to "STRING".
The type path of the target object is:
- root class: "java.lang.String"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object.
Maybe I don't understand something?
英文:
I am trying to use DocumentAssembler for array of strings. The documentation says: "The DocumentAssembler can read either a String column or an Array[String])".
But when I do a simple example:
data = spark.createDataFrame([[["Spark NLP is an open-source text processing library."]]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
result = documentAssembler.transform(data)
result.select("document").show(truncate=False)
I am getting an error
AnalysisException: [CANNOT_UP_CAST_DATATYPE] Cannot up cast input from "ARRAY<STRING>" to "STRING".
The type path of the target object is:
- root class: "java.lang.String"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object
Maybe I don't understand something?
答案1
得分: 0
I think you just added an extra []
around the input
This is working:
data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
result = documentAssembler.transform(data)
result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document |
+----------------------------------------------------------------------------------------------+
|[{document, 0, 51, Spark NLP is an open-source text processing library., {sentence -> 0}, []}]|
+----------------------------------------------------------------------------------------------+
英文:
I think you just added an extra []
around the input
This is working:
data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
result = documentAssembler.transform(data)
result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document |
+----------------------------------------------------------------------------------------------+
|[{document, 0, 51, Spark NLP is an open-source text processing library., {sentence -> 0}, []}]|
+----------------------------------------------------------------------------------------------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论