从数据框中选择随机行。

huangapple go评论62阅读模式
英文:

select random rows from the dataframe

问题

我有三个数据框,已经连接在一起创建成一个单一的数据框。

df_1 = df_1.withColumn('idx', monotonically_increasing_id())
df_2 = df_2.withColumn('idx', monotonically_increasing_id())
df_3 = df_3.withColumn('DATE', to_timestamp('DATE')) \
            .withColumn('idx', monotonically_increasing_id())

merged_df = df_1.join(df_2, ['idx']).join(df_3, ['idx']).drop('idx')

我想要从中获取一组随机的行。

我没有对数据框进行任何排序,但是当我使用 limit 时,

random_choices_df = merged_df.limit(10)
random_choices_df.show()

show 函数总是以相同的顺序显示相同的项目。

如何获取一组随机的行?我以为 limit 应该能够做到这一点,但出于某种原因,它保留了项目的顺序。我在一台单机上运行。

英文:

I have three dataframes that I have joined together to create a single dataframe.

df_1 = df_1.withColumn('idx', monotonically_increasing_id())
df_2 = df_2.withColumn('idx', monotonically_increasing_id())
df_3 = df_3.withColumn('DATE', to_timestamp('DATE')) \
        .withColumn('idx', monotonically_increasing_id())

merged_df = df_1.join(df_2, ['idx']).join(df_3, ['idx']).drop('idx')

And I want to retrieve a set of random rows from it.

I don't do any sorting for the dataframes but when I use limit

random_choices_df = merged_df.limit(10)
random_choices_df.show()

show function always displays the same items in the same order.

How to retrieve a random set of rows? I thought limit supposed to do that but for some reason it preserves the order of items. I am running on a single machine.

答案1

得分: 1

你可以使用Sample函数从你的df中随机抽取样本,然后限制所需的行数。

df=spark.range(10000).sample(0.1).limit(10).show()
+---+
| id|
+---+
|  0|
|  4|
|  7|
| 13|
| 19|
| 36|
| 46|
| 82|
| 91|
|102|
+---+
df=spark.range(10000).sample(0.1).limit(10).show()
+---+
| id|
+---+
|  0|
|  6|
| 10|
| 11|
| 24|
| 28|
| 37|
| 49|
| 50|
| 72|
+---+
英文:

You could use the Sample function to take a random sample from your df and then limit the number of rows you need.

>>> df=spark.range(10000).sample(0.1).limit(10).show()
+---+
| id|
+---+
|  0|
|  4|
|  7|
| 13|
| 19|
| 36|
| 46|
| 82|
| 91|
|102|
+---+
>>> df=spark.range(10000).sample(0.1).limit(10).show()
+---+
| id|
+---+
|  0|
|  6|
| 10|
| 11|
| 24|
| 28|
| 37|
| 49|
| 50|
| 72|
+---+

huangapple
  • 本文由 发表于 2023年6月2日 05:24:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76385815.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定