英文:
Sample data after grouping it based on a column in spark dataset
问题
我有一个格式为:
Id 数据集
id1,[name1,name2,name3...name100]
id2,[name1,name2,name3,nam4....name100000]
id3,[name1,name2,name3.....name1000]
我想要从每个id中获取总名称数50%的随机样本。
我知道Spark有一个sample函数。但我能否根据数据集中的每一行传递百分比给它。每行的名称数都不同。
以下是我尝试过的代码:
WindowSpec window = Window.partitionBy(col("id")).orderBy(functions.rand());
idDataset.select(col("name"), functions.rank().over(window).alias("rank"))
.filter(functions.col("rank") <= 0.5).drop("rank");
这里的0.5表示我需要从每个Id中获取总名称数50%的随机样本。
英文:
I have a spark dataset which is of the format:
Id dataset
id1,[name1,name2,name3...name100]
id2,[name1,name2,name3,nam4....name100000]
id3,[name1,name2,name3.....name1000]
I want to get a random sample of 50% of total names for each id.
I know spark has a sample function .But can I pass percentage to it based on each row of my dataset .Count of names will be different for each row.
This is what I have tried :
WindowSpec window = Window.partitionBy(col("id")).orderBy(functions.rand());
idDataset.select(col("name"), functions.rank().over(window).alias("rank"))
.filter(functions.col("rank") = .05 ).drop("rank");
The 0.05 here means that I need to get random samples of 50% of the total names for each Id.
答案1
得分: 1
以下是翻译好的内容:
一种选择是使用内置的 SQL 函数。
您可以使用 transform 来为每个名称分配一个位于 0.0 到 1.0 之间的 随机 值,然后使用 filter 过滤掉所有分配到大于您的阈值(在此示例中为 0.5)的随机数的名称。
import static org.apache.spark.sql.functions.*;
df.withColumn("sampled",
expr("filter(transform(names, n -> (n, rand())), n -> n.col2 <= 0.5).n"))
.show(false);
输出:
+---+------------------------------------------+----------------------------+
|id |names |sampled |
+---+------------------------------------------+----------------------------+
|id1|[name1, name2, name3, name4] |[name3] |
|id2|[name1, name2, name3, name4, name5, name6]|[name1, name2, name5, name6]|
|id3|[name1, name2, name3] |[name1, name2] |
+---+------------------------------------------+----------------------------+
英文:
One option is to use the built-in SQL functions.
You can use transform to assign to each name a random value between 0.0 and 1.0 and then filter out all names that got a random number assgined larger then your threshold (0.5 in this example).
import static org.apache.spark.sql.functions.*;
df.withColumn("sampled",
expr("filter(transform(names, n -> (n, rand())), n -> n.col2 <= 0.5).n"))
.show(false);
Output:
+---+------------------------------------------+----------------------------+
|id |names |sampled |
+---+------------------------------------------+----------------------------+
|id1|[name1, name2, name3, name4] |[name3] |
|id2|[name1, name2, name3, name4, name5, name6]|[name1, name2, name5, name6]|
|id3|[name1, name2, name3] |[name1, name2] |
+---+------------------------------------------+----------------------------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论