问题

我需要一种方法来从数据集中获取一定数量的随机行，这些行必须是唯一的。我尝试过数据集类的sample方法，但有时会选择重复的行。

数据集的sample方法：

https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/Dataset.html#sample-boolean-double-

英文:

I need a way to get some x number of random rows from a dataset which are unique. I tried sample method of dataset class but it sometimes pick duplicate rows.

Dataset's sample method:

https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/Dataset.html#sample-boolean-double-

答案1

得分: 2

Sample Function with withReplacement=>'false' would always pick distinct rows df1.sample(false, 0.1).show()

> sample(boolean withReplacement, double fraction)

Consider below example:

where withReplacement => 'true' gave duplicate rows which can be verified by count, but withReplacement => 'false' did not.

import org.apache.spark.sql.functions._
val df1 = ((1 to 10000).toList).zip(((1 to 10000).map(x=&gt;x*2))).toDF(&quot;col1&quot;, &quot;col2&quot;)
// df1.sample(false, 0.1).show()

println(&quot;Sample Count for with Replacement : &quot; +  df1.sample(true, 0.1).count)
println(&quot;Sample Count for with Out Replacement : &quot; +  df1.sample(false, 0.1).count)
df1.sample(true, 0.1).groupBy($&quot;col1&quot;, $&quot;col2&quot;).count().filter($&quot;count&quot;&gt;1).show(5)
df1.sample(false, 0.1).groupBy($&quot;col1&quot;, $&quot;col2&quot;).count().filter($&quot;count&quot;&gt;1).show(5)

Sample Count for with Replacement : 978
Sample Count for with Out Replacement : 973
+----+-----+-----+
|col1| col2|count|
+----+-----+-----+
|7464|14928| 2|
|6080|12160| 2|
|6695|13390| 2|
|3393| 6786| 2|
|2137| 4274| 2|
+----+-----+-----+
only showing top 5 rows

+----+----+-----+
|col1|col2|count|
+----+----+-----+
+----+----+-----+


<details>
<summary>英文:</summary>

Sample Function with withReplacement=&gt;&#39;false&#39; would always pick distinct rows `df1.sample(false, 0.1).show()`

&gt; sample(boolean withReplacement, double fraction)

Consider below example:

where withReplacement =&gt; &#39;true&#39; gave duplicate rows which can be verified by count, but withReplacement =&gt; &#39;false&#39; did not.

    import org.apache.spark.sql.functions._
    val df1 = ((1 to 10000).toList).zip(((1 to 10000).map(x=&gt;x*2))).toDF(&quot;col1&quot;, &quot;col2&quot;)
    // df1.sample(false, 0.1).show()
    
    println(&quot;Sample Count for with Replacement : &quot; +  df1.sample(true, 0.1).count)
    println(&quot;Sample Count for with Out Replacement : &quot; +  df1.sample(false, 0.1).count)
    df1.sample(true, 0.1).groupBy($&quot;col1&quot;, $&quot;col2&quot;).count().filter($&quot;count&quot;&gt;1).show(5)
    df1.sample(false, 0.1).groupBy($&quot;col1&quot;, $&quot;col2&quot;).count().filter($&quot;count&quot;&gt;1).show(5)
    
    Sample Count for with Replacement : 978
    Sample Count for with Out Replacement : 973
    +----+-----+-----+
    |col1| col2|count|
    +----+-----+-----+
    |7464|14928|    2|
    |6080|12160|    2|
    |6695|13390|    2|
    |3393| 6786|    2|
    |2137| 4274|    2|
    +----+-----+-----+
    only showing top 5 rows
    
    +----+----+-----+
    |col1|col2|count|
    +----+----+-----+
    +----+----+-----+



</details>



# 答案2
**得分**: 1

你应该使用`sample`函数，其中`withReplacement`参数设置为`false`，例如，你可以使用：

```scala
val sampledData = df.sample(withReplacement = false, 0.5)

但是这不能保证提供与给定数据集的总计数的确切分数相匹配。要做到这一点，在使用sample函数获取样本数据后，提取样本数据的X个实体。

英文:

you should use sample function with withReplacement of false, for example, you can use:

val sampledData=df.sample(withReplacement=false,0.5)

but this is NOT guaranteed to provide exactly the fraction of the total count of your given Dataset.
for doing that, after you get your sampled data by sample function, take X entity of sampled data.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark – 如何获取随机唯一行

问题

答案1

Java Method(DiceProblem)

Why can't await and signal methods be called directly on object of ReentrantLock. Why do I need Condition?

将一个 JSON 占位符文件映射到 Java 对象

有没有办法从Scanner中获取输入，而不需要声明它？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论