如何在Polars中模仿Pandas的基于索引的查询?

huangapple go评论95阅读模式
英文:

how to imitate Pandas' index-based querying in Polars?

问题

  1. # 使用 Polars 模仿下面的 Pandas 代码,但保持相同的顺序
  2. df = pl.DataFrame(data=([21, 123], [132, 412], [23, 43]), schema=['c1', 'c2'])
  3. rows_to_select = [23, 132]
  4. selected_rows = df.filter(pl.col('c1').is_in(rows_to_select))
  5. selected_rows = selected_rows.with_column(pl.when(pl.col('c1') == rows_to_select[0], 1).otherwise(0).alias('order'))
  6. selected_rows = selected_rows.sort('order', reverse=True).drop_column('order')
  7. print(selected_rows)
英文:

Any idea what I can do to imitate the below pandas code using polars? Polars doesn't have indexes like pandas so I couldn't figure out what I can do .

  1. df = pd.DataFrame(data = ([21,123], [132,412], [23, 43]), columns = ['c1', 'c2']).set_index("c1")
  2. print(df.loc[[23, 132]])

and it prints

c1 c2
23 43
132 412

the only polars conversion I could figure out to do is

  1. df = pl.DataFrame(data = ([21,123], [132,412], [23, 43]), schema = ['c1', 'c2'])
  2. print(df.filter(pl.col("c1").is_in([23, 132])))

but it prints

c1 c2
132 412
23 43

which is okay but the rows are not in the order I gave. I gave [23, 132] and want the output rows to be in the same order, like how pandas' output has.

I can use a sort() later yes, but the original data I use this on has like 30Million rows so I'm looking for something that's as fast as possible.

答案1

得分: 6

我建议使用left join来实现这个目标。这将保持与你的索引值列表相对应的顺序。而且这种方法性能很好。

例如,让我们从这个打乱顺序的DataFrame开始。

  1. nbr_rows = 30_000_000
  2. df = pl.DataFrame({
  3. 'c1': pl.arange(0, nbr_rows, eager=True).shuffle(2),
  4. 'c2': pl.arange(0, nbr_rows, eager=True).shuffle(3),
  5. })
  6. df
  1. shape: (30000000, 2)
  2. ┌──────────┬──────────┐
  3. c1 c2
  4. --- ---
  5. i64 i64
  6. ╞══════════╪══════════╡
  7. 4052015 20642741
  8. 7787054 17007051
  9. 20246150 19445431
  10. 1309992 6495751
  11. ... ...
  12. 10371090 4791782
  13. 26281644 12350777
  14. 6740626 24888572
  15. 22573405 14885989
  16. └──────────┴──────────┘

以及这些索引值:

  1. nbr_index_values = 10_000
  2. s1 = pl.Series(name='c1', values=pl.arange(0, nbr_index_values, eager=True).shuffle())
  3. s1
  1. shape: (10000,)
  2. Series: 'c1' [i64]
  3. [
  4. 1754
  5. 6716
  6. 3485
  7. 7058
  8. 7216
  9. 1040
  10. 1832
  11. 3921
  12. 1639
  13. 6734
  14. 5560
  15. 7596
  16. ...
  17. 4243
  18. 4455
  19. 894
  20. 7806
  21. 9291
  22. 1883
  23. 9947
  24. 3309
  25. 2030
  26. 7731
  27. 4706
  28. 8528
  29. 8426
  30. ]

现在,我们执行一个left join来获取与索引值对应的行。(请注意,索引值列表是这个连接中的左侧DataFrame。)

  1. start = time.perf_counter()
  2. df2 = (
  3. s1.to_frame()
  4. .join(
  5. df,
  6. on='c1',
  7. how='left'
  8. )
  9. )
  10. print(time.perf_counter() - start)
  11. df2
  1. >>> print(time.perf_counter() - start)
  2. 0.8427023889998964
  1. shape: (10000, 2)
  2. ┌──────┬──────────┐
  3. c1 c2
  4. --- ---
  5. i64 i64
  6. ╞══════╪══════════╡
  7. 1754 15734441
  8. 6716 20631535
  9. 3485 20199121
  10. 7058 15881128
  11. ... ...
  12. 7731 19420197
  13. 4706 16918008
  14. 8528 5278904
  15. 8426 18927935
  16. └──────┴──────────┘

注意行的顺序与索引值相同。我们可以验证这一点:

  1. s1.series_equal(df2.get_column('c1'), strict=True)
  1. >>> s1.series_equal(df2.get_column('c1'), strict=True)
  2. True

而且性能相当不错。在我的32核系统上,这不到一秒钟。

英文:

I suggest using a left join to accomplish this. This will maintain the order corresponding to your list of index values. (And it is quite performant.)

For example, let's start with this shuffled DataFrame.

  1. nbr_rows = 30_000_000
  2. df = pl.DataFrame({
  3. 'c1': pl.arange(0, nbr_rows, eager=True).shuffle(2),
  4. 'c2': pl.arange(0, nbr_rows, eager=True).shuffle(3),
  5. })
  6. df
  1. shape: (30000000, 2)
  2. ┌──────────┬──────────┐
  3. c1 c2
  4. --- ---
  5. i64 i64
  6. ╞══════════╪══════════╡
  7. 4052015 20642741
  8. 7787054 17007051
  9. 20246150 19445431
  10. 1309992 6495751
  11. ... ...
  12. 10371090 4791782
  13. 26281644 12350777
  14. 6740626 24888572
  15. 22573405 14885989
  16. └──────────┴──────────┘

And these index values:

  1. nbr_index_values = 10_000
  2. s1 = pl.Series(name='c1', values=pl.arange(0, nbr_index_values, eager=True).shuffle())
  3. s1
  1. shape: (10000,)
  2. Series: 'c1' [i64]
  3. [
  4. 1754
  5. 6716
  6. 3485
  7. 7058
  8. 7216
  9. 1040
  10. 1832
  11. 3921
  12. 1639
  13. 6734
  14. 5560
  15. 7596
  16. ...
  17. 4243
  18. 4455
  19. 894
  20. 7806
  21. 9291
  22. 1883
  23. 9947
  24. 3309
  25. 2030
  26. 7731
  27. 4706
  28. 8528
  29. 8426
  30. ]

We now perform a left join to obtain the rows corresponding to the index values. (Note that the list of index values is the left DataFrame in this join.)

  1. start = time.perf_counter()
  2. df2 = (
  3. s1.to_frame()
  4. .join(
  5. df,
  6. on='c1',
  7. how='left'
  8. )
  9. )
  10. print(time.perf_counter() - start)
  11. df2
  1. >>> print(time.perf_counter() - start)
  2. 0.8427023889998964
  1. shape: (10000, 2)
  2. ┌──────┬──────────┐
  3. c1 c2
  4. --- ---
  5. i64 i64
  6. ╞══════╪══════════╡
  7. 1754 15734441
  8. 6716 20631535
  9. 3485 20199121
  10. 7058 15881128
  11. ... ...
  12. 7731 19420197
  13. 4706 16918008
  14. 8528 5278904
  15. 8426 18927935
  16. └──────┴──────────┘

Notice how the rows are in the same order as the index values. We can verify this:

  1. s1.series_equal(df2.get_column('c1'), strict=True)
  1. >>> s1.series_equal(df2.get_column('c1'), strict=True)
  2. True

And the performance is quite good. On my 32-core system, this takes less than a second.

huangapple
  • 本文由 发表于 2023年2月8日 13:12:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/75381589.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定