2020年10月18日 19:55:21go评论84阅读模式

英文:

How to subsample windows of a DataSet in Spark?

问题

假设我有一个看起来像这样的 DataSet：

姓名    | 成绩
---------------
乔什    | 94
乔什    | 87
阿曼达  | 96
卡伦   | 78
阿曼达  | 90
乔什    | 88

我想创建一个新的 DataSet，使得每个姓名都有3行数据，其中额外的行（如果有的话）是从相同姓名的行中随机抽取的（例如，卡伦将有三行相同的数据）。

如何在不通过循环遍历每个姓名的情况下实现这一点呢？

英文:

Let's say I have a DataSet that look like this:

Name    | Grade
---------------
Josh    | 94
Josh    | 87
Amanda  | 96
Karen   | 78
Amanda  | 90
Josh    | 88

I would like to create a new DataSet where each name has 3 rows, where the additional rows (if any) are sampled from the ones of the same name (so Karen will have three identical rows, for example).

How do I do that without looping through each name?

答案1

得分: 1

Data preparation:

val df = Seq(("Josh", 94), ("Josh", 87), ("Amanda", 96), ("Karen", 78), ("Amanda", 90), ("Josh", 88)).toDF("Name", "Grade")

Perform the following, only if your Data is skewed for a Name:
Add a random number, and filter the top 3 random numbers for each Name:

val df2 = df.withColumn("random", round(rand()*10))

import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("Name").orderBy("random")

val df3 = df2.withColumn("row_number", row_number.over(windowSpec))
             .filter($"row_number" <= 3)

Now, aggregate the values for each Name and duplicate 3 times to ensure we have at least 3 records for each Name. Then finally take the 1st 3 values, and explode:

df4.groupBy("Name").agg(collect_list("Grade") as "grade_list")
   .withColumn("temp_list", slice(flatten(array_repeat($"grade_list", 3)), 1, 3))
   .select($"Name", explode($"temp_list") as "Grade").show

Notes:

Since the above code will have a maximum of 3 values in grade_list, duplicating it 3 times won't harm.
In case you don't use the Window step, you can have a combination of when(size($"grade_list") === n, ).otherwise() to avoid unnecessary duplication.

英文:

Data preparation :

 val df = Seq((&quot;Josh&quot;,94),(&quot;Josh&quot;,87),(&quot;Amanda&quot;,96),(&quot;Karen&quot;,78),(&quot;Amanda&quot;,90),(&quot;Josh&quot;,88)).toDF(&quot;Name&quot;,&quot;Grade&quot;)

Perform the following , only if your Data is skewed for a Name :
Add a random number, and filter the top 3 random numbers for each Name.

val df2 = df.withColumn(&quot;random&quot;, round(rand()*10))

import org.apache.spark.sql.expressions.Window
val windowSpec  = Window.partitionBy(&quot;Name&quot;).orderBy(&quot;random&quot;)

val df3 = df2.withColumn(&quot;row_number&quot;,row_number.over(windowSpec))
             .filter($&quot;row_number&quot; &lt;= 3)

Now, aggregate the values for each Name and duplicate 3 times to ensure we have atleast 3 records for each Name. Then finally take 1st 3 values, and explode

df4.groupBy(&quot;Name&quot;).agg(collect_list(&quot;Grade&quot;) as &quot;grade_list&quot;)
.withColumn(&quot;temp_list&quot;, slice( flatten(array_repeat($&quot;grade_list&quot;, 3)), 1,3))
.select($&quot;Name&quot;,explode($&quot;temp_list&quot;) as &quot;Grade&quot;).show

Notes :

Since the above code will have max 3 values in grade_list , hence Duplicating it 3 times won't harm.
Incase you don't use the Window step, you can have a combination of when( size($"grade_list") === n, ).otherwise() to above unnecessary duplication.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Spark中对DataSet进行窗口子采样？

问题

答案1

使用带有用户名和密码的代理在Selenium中。

Receiving an error in Java and don't understand the problem: Exception in Application init method java.lang.reflect.InvocationTargetException

确定一个字符串是否可以通过重新排列组合来形成回文字符串。

私钥未正确加密，或不是我们在生成应用签名密钥时支持的密钥类型。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论