如何在Java中将Spark Dataset的所有列转换为String,而不使用withColumn?

huangapple go评论76阅读模式
英文:

How to cast all columns of Spark Dataset to String in Java without withColumn?

问题

我尝试了在这里指定的使用withColumn的解决方案:

链接地址

但是,对于大量列(1k-6k)的情况,这个解决方案会影响性能。处理需要超过6小时,然后被中止。

作为替代,我尝试使用类似下面的map来进行类型转换,但是我在这里遇到了错误:

MapFunction<Column, Column> mapFunction = (c) -> {
    return c.cast("string");
};		

dataset = dataset.map(mapFunction, Encoders.bean(Column.class));

上述代码片段出现错误:

类型 Dataset<Row> 中的方法 map(Function1<Row,U>, Encoder<U>) 对于参数 (MapFunction<Column,Column>, Encoder<Column>) 不适用。

所使用的导入语句:

import org.apache.spark.api.java.function.MapFunction;
英文:

I've tried the solution using withColumn specified here:

https://stackoverflow.com/questions/49826020/how-to-cast-all-columns-of-spark-dataset-to-string-using-java

But, the solution is taking a hit on performance for huge number of columns (1k-6k). It takes more than 6 hours and then gets aborted.

Alternatively, I'm trying to use map to cast like below, but I get error here:

MapFunction&lt;Column, Column&gt; mapFunction = (c) -&gt; {
	return c.cast(&quot;string&quot;);
};		

dataset = dataset.map(mapFunction, Encoders.bean(Column.class));

Error with above snippet:

The method map(Function1&lt;Row,U&gt;, Encoder&lt;U&gt;) in the type Dataset&lt;Row&gt; is not applicable for the arguments (MapFunction&lt;Column,Column&gt;, Encoder&lt;Column&gt;)

Import used:

import org.apache.spark.api.java.function.MapFunction;

答案1

得分: 0

你确定你指的是1k-6k列,还是指的行?

但无论如何,我会像这样通用地转换列:

import spark.implicits._

val df = Seq((1, 2), (2, 3), (3, 4)).toDF("a", "b")

val cols = for {
  a <- df.columns
} yield col(a).cast(StringType)

df.select(cols : _*)
英文:

Are you sure you mean 1k-6k columns or do you mean rows?

But in any case I cast columns genericly like this:

import spark.implicits._

val df = Seq((1, 2), (2, 3), (3, 4)).toDF(&quot;a&quot;, &quot;b&quot;)

val cols = for {
  a &lt;- df.columns
} yield col(a).cast(StringType)

df.select(cols : _*)

答案2

得分: 0

以下是要翻译的内容:

对于寻找解决方法的任何人找到了以下解决方案

    String[] strColNameArray = dataset.columns();
    List<Column> colNames = new ArrayList<>();
    for(String strColName : strColNameArray){
        colNames.add(new Column(strColName).cast("string"));
    }
    dataset = dataset.select(JavaConversions.asScalaBuffer(colNames));
英文:

Found the below solution for anyone looking for this:

String[] strColNameArray = dataset.columns();
List&lt;Column&gt; colNames = new ArrayList&lt;&gt;();
for(String strColName : strColNameArray){
    colNames.add(new Column(strColName).cast(&quot;string&quot;));
}
dataset = dataset.select(JavaConversions.asScalaBuffer(colNames));`

huangapple
  • 本文由 发表于 2020年10月22日 23:36:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/64485668.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定