英文:
select random rows from the dataframe
问题
我有三个数据框,已经连接在一起创建成一个单一的数据框。
df_1 = df_1.withColumn('idx', monotonically_increasing_id())
df_2 = df_2.withColumn('idx', monotonically_increasing_id())
df_3 = df_3.withColumn('DATE', to_timestamp('DATE')) \
.withColumn('idx', monotonically_increasing_id())
merged_df = df_1.join(df_2, ['idx']).join(df_3, ['idx']).drop('idx')
我想要从中获取一组随机的行。
我没有对数据框进行任何排序,但是当我使用 limit
时,
random_choices_df = merged_df.limit(10)
random_choices_df.show()
show
函数总是以相同的顺序显示相同的项目。
如何获取一组随机的行?我以为 limit
应该能够做到这一点,但出于某种原因,它保留了项目的顺序。我在一台单机上运行。
英文:
I have three dataframes that I have joined together to create a single dataframe.
df_1 = df_1.withColumn('idx', monotonically_increasing_id())
df_2 = df_2.withColumn('idx', monotonically_increasing_id())
df_3 = df_3.withColumn('DATE', to_timestamp('DATE')) \
.withColumn('idx', monotonically_increasing_id())
merged_df = df_1.join(df_2, ['idx']).join(df_3, ['idx']).drop('idx')
And I want to retrieve a set of random rows from it.
I don't do any sorting for the dataframes but when I use limit
random_choices_df = merged_df.limit(10)
random_choices_df.show()
show
function always displays the same items in the same order.
How to retrieve a random set of rows? I thought limit
supposed to do that but for some reason it preserves the order of items. I am running on a single machine.
答案1
得分: 1
你可以使用Sample函数从你的df中随机抽取样本,然后限制所需的行数。
df=spark.range(10000).sample(0.1).limit(10).show()
+---+
| id|
+---+
| 0|
| 4|
| 7|
| 13|
| 19|
| 36|
| 46|
| 82|
| 91|
|102|
+---+
df=spark.range(10000).sample(0.1).limit(10).show()
+---+
| id|
+---+
| 0|
| 6|
| 10|
| 11|
| 24|
| 28|
| 37|
| 49|
| 50|
| 72|
+---+
英文:
You could use the Sample function to take a random sample from your df and then limit the number of rows you need.
>>> df=spark.range(10000).sample(0.1).limit(10).show()
+---+
| id|
+---+
| 0|
| 4|
| 7|
| 13|
| 19|
| 36|
| 46|
| 82|
| 91|
|102|
+---+
>>> df=spark.range(10000).sample(0.1).limit(10).show()
+---+
| id|
+---+
| 0|
| 6|
| 10|
| 11|
| 24|
| 28|
| 37|
| 49|
| 50|
| 72|
+---+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论