Polars从字典值的有条件查找中的列

huangapple go评论83阅读模式
英文:

Polars column from conditioned look up of dictionary values

问题

我想在一个Polars DataFrame中根据列之间的关系将一个键映射到另一个Polars DF。这只是一个示例,完整的DF1和DF2要大得多(分别为250万和150万行)。

  1. DF1 = pl.DataFrame({
  2. 'chr' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
  3. 'start': [14516,17380,17381,20177,22254,24357],
  4. 'end': [14534,17399,17399,20195,22274,24377]
  5. })
  6. DF2 = pl.DataFrame({
  7. 'key' : [1,2,3,4,5,6],
  8. 'chrom' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
  9. 'start': [14516,15377,17376,20177,22254, 24357],
  10. 'end': [14534,15403,17399,20195,22274,24377]})

我想要的是:

  1. DF1 = pl.DataFrame({
  2. 'chr' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
  3. 'start': [14516,17380,17381,20177,22254,24357],
  4. 'end': [14534,17399,17399,20195,22274,24377],
  5. 'key': [1,3,3,4,5,6]
  6. })

chrom匹配chr并且DF1中的startend包含在DF2beginend中时,我想将DF2中的key赋给DF1

我首先尝试遍历DF1的行,查找DF2中的匹配条目:

  1. sz = len(DF1[:,0])
  2. for i in range(sz):
  3. DF1[i,"key"] = DF2.filter(
  4. (pl.col("chrom") == DF1[i,"chr"])\
  5. & (pl.col("begin") <= DF1[i,"start"])\
  6. & (pl.col("end") >= DF1[i,"end"])
  7. ).select('key')[0,0]

通过DF的行迭代非常慢。这需要大约10小时。

我还尝试使用np.array而不是直接进入DF。这样稍微快一些,但仍然非常慢。

我正在寻找一种使用原生Polar数据结构完成此操作的方法。由于我没有用于连接的键,因此“join”和“join_asof”策略似乎不起作用。

英文:

I want to map a key in one Polars DataFrame to another Polars DF base on the relationships between columns. This is just a sample, the full DF1 and DF2 is much larger (2.5 million and 1.5 million rows respectively.

  1. DF1 = = pl.DataFrame({
  2. &#39;chr&#39; : [&quot;GL000008.2&quot;, &quot;GL000008.2&quot;, &quot;GL000008.2&quot;, &quot;GL000008.2&quot;,&quot;GL000008.2&quot;, &quot;GL000008.2&quot;],
  3. &#39;start&#39;: [14516,17380,17381,20177,22254,24357],
  4. &#39;end&#39;: [14534,17399,17399,20195,22274,24377]
  5. })
  6. DF2 = = pl.DataFrame({
  7. &#39;key&#39; : [1,2,3,4,5,6],
  8. &#39;chrom&#39; : [&quot;GL000008.2&quot;, &quot;GL000008.2&quot;, &quot;GL000008.2&quot;, &quot;GL000008.2&quot;,&quot;GL000008.2&quot;, &quot;GL000008.2&quot;],
  9. &#39;start&#39;: [14516,15377,17376,20177,22254, 24357],
  10. &#39;end&#39;: [14534,15403,17399,20195,22274,24377]})

What I want is:

  1. DF1 = = pl.DataFrame({
  2. &#39;chr&#39; : [&quot;GL000008.2&quot;, &quot;GL000008.2&quot;, &quot;GL000008.2&quot;, &quot;GL000008.2&quot;,&quot;GL000008.2&quot;, &quot;GL000008.2&quot;],
  3. &#39;start&#39;: [14516,17380,17381,20177,22254,24357],
  4. &#39;end&#39;: [14534,17399,17399,20195,22274,24377],
  5. &#39;key&#39;: [1,3,3,4,5,6]
  6. })

I'd like to assign the key from DF2 to DF1 when chrom matches chr and the start and end in DF1 are contained within the begin and end in DF2.

I first attempted to iterate through the rows of DF1, looking up the matching entry in DF2:

  1. sz = len(DF1[:,0])
  2. for i in range(sz):
  3. DF1[i,&quot;key&quot;] = DF2.filter(
  4. (pl.col(&quot;chrom&quot;) == DF1[i,&quot;chr&quot;])\
  5. &amp; (pl.col(&quot;begin&quot;) &lt;= DF1[i,&quot;start&quot;])\
  6. &amp; (pl.col(&quot;end&quot;) &gt;= DF1[i,&quot;end&quot;])
  7. ).select(&#39;key&#39;)[0,0]

Row iteration through a DF is incredibly slow. This takes about 10 hours.

I also tried using a np.array instead of directly into the df. thats a little faster, but still very slow.

I'm looking for a way to accomplish this using the native Polar's data structure. I don't have key to join on so the join and join_asof strategies don't seem to work.

答案1

得分: 0

join 和 filter 应该可以满足您的需求:

  1. (
  2. df1.join(df2, left_on="chr", right_on="chrom")
  3. .filter(
  4. (pl.col("start") >= pl.col("start_right"))
  5. & (pl.col("end") <= pl.col("end_right"))
  6. )
  7. .drop(["start_right", "end_right"])
  8. )
  1. shape: (6, 4)
  2. ┌────────────┬───────┬───────┬─────┐
  3. chr start end key
  4. --- --- --- ---
  5. str i64 i64 i64
  6. ╞════════════╪═══════╪═══════╪═════╡
  7. GL000008.2 14516 14534 1
  8. GL000008.2 17380 17399 3
  9. GL000008.2 17381 17399 3
  10. GL000008.2 20177 20195 4
  11. GL000008.2 22254 22274 5
  12. GL000008.2 24357 24377 6
  13. └────────────┴───────┴───────┴─────┘
英文:

join and filter should give you what you need:

  1. (
  2. df1.join(df2, left_on=&quot;chr&quot;, right_on=&quot;chrom&quot;)
  3. .filter(
  4. (pl.col(&quot;start&quot;) &gt;= pl.col(&quot;start_right&quot;))
  5. &amp; (pl.col(&quot;end&quot;) &lt;= pl.col(&quot;end_right&quot;))
  6. )
  7. .drop([&quot;start_right&quot;, &quot;end_right&quot;])
  8. )
  1. shape: (6, 4)
  2. ┌────────────┬───────┬───────┬─────┐
  3. chr start end key
  4. --- --- --- ---
  5. str i64 i64 i64
  6. ╞════════════╪═══════╪═══════╪═════╡
  7. GL000008.2 14516 14534 1
  8. GL000008.2 17380 17399 3
  9. GL000008.2 17381 17399 3
  10. GL000008.2 20177 20195 4
  11. GL000008.2 22254 22274 5
  12. GL000008.2 24357 24377 6
  13. └────────────┴───────┴───────┴─────┘

答案2

得分: 0

使用 join_asof 可能会提供一个高效的解决方案:

  1. (
  2. DF1
  3. .sort('start')
  4. .join_asof(
  5. DF2.sort('start'),
  6. by_left="chr",
  7. by_right="chrom",
  8. on="start",
  9. strategy="backward")
  10. .filter(
  11. pl.col('end') <= pl.col('end_right')
  12. )
  13. )
  1. 形状(6, 5)
  2. ┌────────────┬───────┬───────┬─────┬───────────┐
  3. chr start end key end_right
  4. --- --- --- --- ---
  5. str i64 i64 i64 i64
  6. ╞════════════╪═══════╪═══════╪═════╪═══════════╡
  7. GL000008.2 14516 14534 1 14534
  8. GL000008.2 17380 17399 3 17399
  9. GL000008.2 17381 17399 3 17399
  10. GL000008.2 20177 20195 4 20195
  11. GL000008.2 22254 22274 5 22274
  12. GL000008.2 24357 24377 6 24377
  13. └────────────┴───────┴───────┴─────┴───────────┘

注意:这假设 DF2 中的起止区间不重叠。

英文:

Using a join_asof might provide a performant solution:

  1. (
  2. DF1
  3. .sort(&#39;start&#39;)
  4. .join_asof(
  5. DF2.sort(&#39;start&#39;),
  6. by_left=&quot;chr&quot;,
  7. by_right=&quot;chrom&quot;,
  8. on=&quot;start&quot;,
  9. strategy=&quot;backward&quot;)
  10. .filter(
  11. pl.col(&#39;end&#39;) &lt;= pl.col(&#39;end_right&#39;)
  12. )
  13. )
  1. shape: (6, 5)
  2. ┌────────────┬───────┬───────┬─────┬───────────┐
  3. chr start end key end_right
  4. --- --- --- --- ---
  5. str i64 i64 i64 i64
  6. ╞════════════╪═══════╪═══════╪═════╪═══════════╡
  7. GL000008.2 14516 14534 1 14534
  8. GL000008.2 17380 17399 3 17399
  9. GL000008.2 17381 17399 3 17399
  10. GL000008.2 20177 20195 4 20195
  11. GL000008.2 22254 22274 5 22274
  12. GL000008.2 24357 24377 6 24377
  13. └────────────┴───────┴───────┴─────┴───────────┘

Note: this assumes that your start-end intervals in DF2 do not overlap.

huangapple
  • 本文由 发表于 2023年2月16日 04:55:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/75465358.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定