英文:
Polars column from conditioned look up of dictionary values
问题
我想在一个Polars DataFrame中根据列之间的关系将一个键映射到另一个Polars DF。这只是一个示例,完整的DF1和DF2要大得多(分别为250万和150万行)。
DF1 = pl.DataFrame({
'chr' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
'start': [14516,17380,17381,20177,22254,24357],
'end': [14534,17399,17399,20195,22274,24377]
})
DF2 = pl.DataFrame({
'key' : [1,2,3,4,5,6],
'chrom' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
'start': [14516,15377,17376,20177,22254, 24357],
'end': [14534,15403,17399,20195,22274,24377]})
我想要的是:
DF1 = pl.DataFrame({
'chr' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
'start': [14516,17380,17381,20177,22254,24357],
'end': [14534,17399,17399,20195,22274,24377],
'key': [1,3,3,4,5,6]
})
当chrom
匹配chr
并且DF1中的start
和end
包含在DF2的begin
和end
中时,我想将DF2中的key
赋给DF1。
我首先尝试遍历DF1的行,查找DF2中的匹配条目:
sz = len(DF1[:,0])
for i in range(sz):
DF1[i,"key"] = DF2.filter(
(pl.col("chrom") == DF1[i,"chr"])\
& (pl.col("begin") <= DF1[i,"start"])\
& (pl.col("end") >= DF1[i,"end"])
).select('key')[0,0]
通过DF的行迭代非常慢。这需要大约10小时。
我还尝试使用np.array而不是直接进入DF。这样稍微快一些,但仍然非常慢。
我正在寻找一种使用原生Polar数据结构完成此操作的方法。由于我没有用于连接的键,因此“join”和“join_asof”策略似乎不起作用。
英文:
I want to map a key in one Polars DataFrame to another Polars DF base on the relationships between columns. This is just a sample, the full DF1 and DF2 is much larger (2.5 million and 1.5 million rows respectively.
DF1 = = pl.DataFrame({
'chr' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
'start': [14516,17380,17381,20177,22254,24357],
'end': [14534,17399,17399,20195,22274,24377]
})
DF2 = = pl.DataFrame({
'key' : [1,2,3,4,5,6],
'chrom' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
'start': [14516,15377,17376,20177,22254, 24357],
'end': [14534,15403,17399,20195,22274,24377]})
What I want is:
DF1 = = pl.DataFrame({
'chr' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
'start': [14516,17380,17381,20177,22254,24357],
'end': [14534,17399,17399,20195,22274,24377],
'key': [1,3,3,4,5,6]
})
I'd like to assign the key
from DF2 to DF1 when chrom
matches chr
and the start
and end
in DF1 are contained within the begin
and end
in DF2.
I first attempted to iterate through the rows of DF1, looking up the matching entry in DF2:
sz = len(DF1[:,0])
for i in range(sz):
DF1[i,"key"] = DF2.filter(
(pl.col("chrom") == DF1[i,"chr"])\
& (pl.col("begin") <= DF1[i,"start"])\
& (pl.col("end") >= DF1[i,"end"])
).select('key')[0,0]
Row iteration through a DF is incredibly slow. This takes about 10 hours.
I also tried using a np.array instead of directly into the df. thats a little faster, but still very slow.
I'm looking for a way to accomplish this using the native Polar's data structure. I don't have key to join on so the join
and join_asof
strategies don't seem to work.
答案1
得分: 0
join 和 filter 应该可以满足您的需求:
(
df1.join(df2, left_on="chr", right_on="chrom")
.filter(
(pl.col("start") >= pl.col("start_right"))
& (pl.col("end") <= pl.col("end_right"))
)
.drop(["start_right", "end_right"])
)
shape: (6, 4)
┌────────────┬───────┬───────┬─────┐
│ chr ┆ start ┆ end ┆ key │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═══════╪═══════╪═════╡
│ GL000008.2 ┆ 14516 ┆ 14534 ┆ 1 │
│ GL000008.2 ┆ 17380 ┆ 17399 ┆ 3 │
│ GL000008.2 ┆ 17381 ┆ 17399 ┆ 3 │
│ GL000008.2 ┆ 20177 ┆ 20195 ┆ 4 │
│ GL000008.2 ┆ 22254 ┆ 22274 ┆ 5 │
│ GL000008.2 ┆ 24357 ┆ 24377 ┆ 6 │
└────────────┴───────┴───────┴─────┘
英文:
join and filter should give you what you need:
(
df1.join(df2, left_on="chr", right_on="chrom")
.filter(
(pl.col("start") >= pl.col("start_right"))
& (pl.col("end") <= pl.col("end_right"))
)
.drop(["start_right", "end_right"])
)
shape: (6, 4)
┌────────────┬───────┬───────┬─────┐
│ chr ┆ start ┆ end ┆ key │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═══════╪═══════╪═════╡
│ GL000008.2 ┆ 14516 ┆ 14534 ┆ 1 │
│ GL000008.2 ┆ 17380 ┆ 17399 ┆ 3 │
│ GL000008.2 ┆ 17381 ┆ 17399 ┆ 3 │
│ GL000008.2 ┆ 20177 ┆ 20195 ┆ 4 │
│ GL000008.2 ┆ 22254 ┆ 22274 ┆ 5 │
│ GL000008.2 ┆ 24357 ┆ 24377 ┆ 6 │
└────────────┴───────┴───────┴─────┘
答案2
得分: 0
使用 join_asof
可能会提供一个高效的解决方案:
(
DF1
.sort('start')
.join_asof(
DF2.sort('start'),
by_left="chr",
by_right="chrom",
on="start",
strategy="backward")
.filter(
pl.col('end') <= pl.col('end_right')
)
)
形状:(6, 5)
┌────────────┬───────┬───────┬─────┬───────────┐
│ chr ┆ start ┆ end ┆ key ┆ end_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═══════╪═══════╪═════╪═══════════╡
│ GL000008.2 ┆ 14516 ┆ 14534 ┆ 1 ┆ 14534 │
│ GL000008.2 ┆ 17380 ┆ 17399 ┆ 3 ┆ 17399 │
│ GL000008.2 ┆ 17381 ┆ 17399 ┆ 3 ┆ 17399 │
│ GL000008.2 ┆ 20177 ┆ 20195 ┆ 4 ┆ 20195 │
│ GL000008.2 ┆ 22254 ┆ 22274 ┆ 5 ┆ 22274 │
│ GL000008.2 ┆ 24357 ┆ 24377 ┆ 6 ┆ 24377 │
└────────────┴───────┴───────┴─────┴───────────┘
注意:这假设 DF2 中的起止区间不重叠。
英文:
Using a join_asof
might provide a performant solution:
(
DF1
.sort('start')
.join_asof(
DF2.sort('start'),
by_left="chr",
by_right="chrom",
on="start",
strategy="backward")
.filter(
pl.col('end') <= pl.col('end_right')
)
)
shape: (6, 5)
┌────────────┬───────┬───────┬─────┬───────────┐
│ chr ┆ start ┆ end ┆ key ┆ end_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═══════╪═══════╪═════╪═══════════╡
│ GL000008.2 ┆ 14516 ┆ 14534 ┆ 1 ┆ 14534 │
│ GL000008.2 ┆ 17380 ┆ 17399 ┆ 3 ┆ 17399 │
│ GL000008.2 ┆ 17381 ┆ 17399 ┆ 3 ┆ 17399 │
│ GL000008.2 ┆ 20177 ┆ 20195 ┆ 4 ┆ 20195 │
│ GL000008.2 ┆ 22254 ┆ 22274 ┆ 5 ┆ 22274 │
│ GL000008.2 ┆ 24357 ┆ 24377 ┆ 6 ┆ 24377 │
└────────────┴───────┴───────┴─────┴───────────┘
Note: this assumes that your start-end intervals in DF2 do not overlap.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论