在Spark数据集中根据一列分组后的示例数据。

huangapple go评论125阅读模式
英文:

Sample data after grouping it based on a column in spark dataset

问题

我有一个格式为:

    Id 数据集

    id1,[name1,name2,name3...name100]
    id2,[name1,name2,name3,nam4....name100000]
    id3,[name1,name2,name3.....name1000]

我想要从每个id中获取总名称数50%的随机样本。
我知道Spark有一个sample函数。但我能否根据数据集中的每一行传递百分比给它。每行的名称数都不同。

以下是我尝试过的代码:

    WindowSpec window = Window.partitionBy(col("id")).orderBy(functions.rand());

    idDataset.select(col("name"), functions.rank().over(window).alias("rank"))
             .filter(functions.col("rank") <= 0.5).drop("rank");

这里的0.5表示我需要从每个Id中获取总名称数50%的随机样本。
英文:

I have a spark dataset which is of the format:

Id dataset


id1,[name1,name2,name3...name100]
id2,[name1,name2,name3,nam4....name100000]
id3,[name1,name2,name3.....name1000]

I want to get a random sample of 50% of total names for each id.
I know spark has a sample function .But can I pass percentage to it based on each row of my dataset .Count of names will be different for each row.

This is what I have tried :

    WindowSpec window = Window.partitionBy(col(&quot;id&quot;)).orderBy(functions.rand());

 idDataset.select(col(&quot;name&quot;), functions.rank().over(window).alias(&quot;rank&quot;))
                .filter(functions.col(&quot;rank&quot;) = .05 ).drop(&quot;rank&quot;);

The 0.05 here means that I need to get random samples of 50% of the total names for each Id.

答案1

得分: 1

以下是翻译好的内容:

一种选择是使用内置的 SQL 函数

您可以使用 transform 来为每个名称分配一个位于 0.0 到 1.0 之间的 随机 值,然后使用 filter 过滤掉所有分配到大于您的阈值(在此示例中为 0.5)的随机数的名称。

import static org.apache.spark.sql.functions.*;

df.withColumn("sampled",
       expr("filter(transform(names, n -> (n, rand())), n -> n.col2 <= 0.5).n"))
  .show(false);

输出:

+---+------------------------------------------+----------------------------+
|id |names                                     |sampled                     |
+---+------------------------------------------+----------------------------+
|id1|[name1, name2, name3, name4]              |[name3]                     |
|id2|[name1, name2, name3, name4, name5, name6]|[name1, name2, name5, name6]|
|id3|[name1, name2, name3]                     |[name1, name2]              |
+---+------------------------------------------+----------------------------+
英文:

One option is to use the built-in SQL functions.

You can use transform to assign to each name a random value between 0.0 and 1.0 and then filter out all names that got a random number assgined larger then your threshold (0.5 in this example).

import static org.apache.spark.sql.functions.*;

df.withColumn(&quot;sampled&quot;,
       expr(&quot;filter(transform(names, n -&gt; (n, rand())), n -&gt; n.col2 &lt;= 0.5).n&quot;))
  .show(false);

Output:

+---+------------------------------------------+----------------------------+
|id |names                                     |sampled                     |
+---+------------------------------------------+----------------------------+
|id1|[name1, name2, name3, name4]              |[name3]                     |
|id2|[name1, name2, name3, name4, name5, name6]|[name1, name2, name5, name6]|
|id3|[name1, name2, name3]                     |[name1, name2]              |
+---+------------------------------------------+----------------------------+

huangapple
  • 本文由 发表于 2020年9月29日 03:23:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/64108375.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定