Spark Java: 在向量汇聚器中转义列名称中的点号

huangapple go评论65阅读模式
英文:

Spark Java: Escape dot in column names for vector assembler

问题

我有一个数据集,其中一些列名带有点号。当涉及到向量装配器(Vector Assembler)时就会出问题。似乎它们不兼容,所以我尝试了许多方法来转义点号,但是没有任何改变。

String[] expincols = newfilenameavgpeaks.columns();

VectorAssembler assemblerexp = new VectorAssembler()
                    .setInputCols(expincols)
                    .setOutputCol("intensity");

Dataset<Row> filenameoutput = assemblerexp.transform(newfilenameavgpeaks);

我已经用"`", "``","```","````","'",'"'等来包装expincols中的每个元素,但是没有效果!我还尝试了将这些方法应用于newfilenameavgpeaks的列名,但仍然没有改变。有什么办法可以进行转义吗?

英文:

I have a Dataset where some column names have dots. The problem arises when it comes to Vector Assembler. It seems that they do not get along, so I tried to escape the dots in many ways but nothing changed.

String[] expincols = newfilenameavgpeaks.columns();

VectorAssembler assemblerexp = new VectorAssembler()
                    .setInputCols(expincols)
                    .setOutputCol(&quot;intensity&quot;);

Dataset&lt;Row&gt; filenameoutput = assemblerexp.transform(newfilenameavgpeaks);

I have wrapped every element in expincols with: "`", "``","```","````","'",'"', etc but nothing! I also tried these in the column names of newfilenameavgpeaks but still nothing. Any ideas how to escape?

答案1

得分: 0

如果数据集包含列 a.b,您仍然可以使用 df.col(`a.b`) 来选择一个带有 . 的列名。这是因为 Dataset.col 会尝试 解析 列名,并且能够处理反引号。

然而,VectorAssembler.transform 会使用所提供数据集的模式,并使用此 StructType 来处理 VectorAssembler.transformSchema 中的列名。然而,StructType 的 apply 方法 并不包含处理反引号的逻辑,如果列名不完全匹配,它会抛出 IllegalArgumentException

因此,唯一的选择是在将列提供给 VectorAssembler 之前对它们进行重命名:

Dataset&lt;Row&gt; newfilenameavgpeaks = ...

for( String col : newfilenameavgpeaks.columns()) {
    newfilenameavgpeaks = newfilenameavgpeaks
            .withColumnRenamed(col, col.replace(&#39;.&#39;, &#39;_&#39;));
}

VectorAssembler assemblerexp = new VectorAssembler()
    .setInputCols(newfilenameavgpeaks.columns()).setOutputCol(&quot;intensity&quot;);

Dataset&lt;Row&gt; filenameoutput = assemblerexp.transform(newfilenameavgpeaks);
英文:

If the dataset contains a column a.b you can still use df.col(`a.b`) to select a column with a . in its name. This works because Dataset.col tries to resolve the column name and can handle the backticks.

VectorAssembler.transform however takes the schema of the supplied dataset and uses this StructType to handle the column names in VectorAssembler.transformSchema. The apply method of StructType simply does not contain the logic to handle the backticks and throws an IllegalArgumentException if the column names do not match exactly.

Therefore the only option is to rename the columns before supplying them to the VectorAssembler:

Dataset&lt;Row&gt; newfilenameavgpeaks = ...

for( String col : newfilenameavgpeaks.columns()) {
    newfilenameavgpeaks = newfilenameavgpeaks
            .withColumnRenamed(col, col.replace(&#39;.&#39;, &#39;_&#39;));
}

VectorAssembler assemblerexp = new VectorAssembler()
    .setInputCols(newfilenameavgpeaks.columns()).setOutputCol(&quot;intensity&quot;);

Dataset&lt;Row&gt; filenameoutput = assemblerexp.transform(newfilenameavgpeaks);

huangapple
  • 本文由 发表于 2020年9月26日 14:26:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/64074564.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定