2023年2月16日 04:55:59go评论83阅读模式

英文:

Polars column from conditioned look up of dictionary values

问题

我想在一个Polars DataFrame中根据列之间的关系将一个键映射到另一个Polars DF。这只是一个示例，完整的DF1和DF2要大得多（分别为250万和150万行）。

DF1 = pl.DataFrame({
'chr' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"], 
'start': [14516,17380,17381,20177,22254,24357], 
'end': [14534,17399,17399,20195,22274,24377]
})
DF2 = pl.DataFrame({ 
'key' : [1,2,3,4,5,6],
'chrom' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"], 
'start': [14516,15377,17376,20177,22254, 24357], 
'end': [14534,15403,17399,20195,22274,24377]})

我想要的是：

DF1 = pl.DataFrame({
'chr' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"], 
'start': [14516,17380,17381,20177,22254,24357], 
'end': [14534,17399,17399,20195,22274,24377],
'key': [1,3,3,4,5,6]
})

当chrom匹配chr并且DF1中的start和end包含在DF2的begin和end中时，我想将DF2中的key赋给DF1。

我首先尝试遍历DF1的行，查找DF2中的匹配条目：

sz = len(DF1[:,0])
for i in range(sz):
    DF1[i,"key"] = DF2.filter(
        (pl.col("chrom") == DF1[i,"chr"])\
        & (pl.col("begin") <= DF1[i,"start"])\
        & (pl.col("end") >= DF1[i,"end"])
        ).select('key')[0,0]

通过DF的行迭代非常慢。这需要大约10小时。

我还尝试使用np.array而不是直接进入DF。这样稍微快一些，但仍然非常慢。

我正在寻找一种使用原生Polar数据结构完成此操作的方法。由于我没有用于连接的键，因此“join”和“join_asof”策略似乎不起作用。

英文:

I want to map a key in one Polars DataFrame to another Polars DF base on the relationships between columns. This is just a sample, the full DF1 and DF2 is much larger (2.5 million and 1.5 million rows respectively.

DF1 = = pl.DataFrame({
&#39;chr&#39; : [&quot;GL000008.2&quot;, &quot;GL000008.2&quot;, &quot;GL000008.2&quot;, &quot;GL000008.2&quot;,&quot;GL000008.2&quot;, &quot;GL000008.2&quot;], 
&#39;start&#39;: [14516,17380,17381,20177,22254,24357], 
&#39;end&#39;: [14534,17399,17399,20195,22274,24377]
})
DF2 = = pl.DataFrame({ 
&#39;key&#39; : [1,2,3,4,5,6],
&#39;chrom&#39; : [&quot;GL000008.2&quot;, &quot;GL000008.2&quot;, &quot;GL000008.2&quot;, &quot;GL000008.2&quot;,&quot;GL000008.2&quot;, &quot;GL000008.2&quot;], 
&#39;start&#39;: [14516,15377,17376,20177,22254, 24357], 
&#39;end&#39;: [14534,15403,17399,20195,22274,24377]})

What I want is:

DF1 = = pl.DataFrame({
&#39;chr&#39; : [&quot;GL000008.2&quot;, &quot;GL000008.2&quot;, &quot;GL000008.2&quot;, &quot;GL000008.2&quot;,&quot;GL000008.2&quot;, &quot;GL000008.2&quot;], 
&#39;start&#39;: [14516,17380,17381,20177,22254,24357], 
&#39;end&#39;: [14534,17399,17399,20195,22274,24377],
&#39;key&#39;: [1,3,3,4,5,6]
})

I'd like to assign the key from DF2 to DF1 when chrom matches chr and the start and end in DF1 are contained within the begin and end in DF2.

I first attempted to iterate through the rows of DF1, looking up the matching entry in DF2:

sz = len(DF1[:,0])
for i in range(sz):
    DF1[i,&quot;key&quot;] = DF2.filter(
        (pl.col(&quot;chrom&quot;) == DF1[i,&quot;chr&quot;])\
        &amp; (pl.col(&quot;begin&quot;) &lt;= DF1[i,&quot;start&quot;])\
        &amp; (pl.col(&quot;end&quot;) &gt;= DF1[i,&quot;end&quot;])
        ).select(&#39;key&#39;)[0,0]

Row iteration through a DF is incredibly slow. This takes about 10 hours.

I also tried using a np.array instead of directly into the df. thats a little faster, but still very slow.

I'm looking for a way to accomplish this using the native Polar's data structure. I don't have key to join on so the join and join_asof strategies don't seem to work.

答案1

得分: 0

join 和 filter 应该可以满足您的需求：

(
    df1.join(df2, left_on="chr", right_on="chrom")
    .filter(
        (pl.col("start") >= pl.col("start_right"))
        & (pl.col("end") <= pl.col("end_right"))
    )
    .drop(["start_right", "end_right"])
)

shape: (6, 4)
┌────────────┬───────┬───────┬─────┐
│ chr        ┆ start ┆ end   ┆ key │
│ ---        ┆ ---   ┆ ---   ┆ --- │
│ str        ┆ i64   ┆ i64   ┆ i64 │
╞════════════╪═══════╪═══════╪═════╡
│ GL000008.2 ┆ 14516 ┆ 14534 ┆ 1   │
│ GL000008.2 ┆ 17380 ┆ 17399 ┆ 3   │
│ GL000008.2 ┆ 17381 ┆ 17399 ┆ 3   │
│ GL000008.2 ┆ 20177 ┆ 20195 ┆ 4   │
│ GL000008.2 ┆ 22254 ┆ 22274 ┆ 5   │
│ GL000008.2 ┆ 24357 ┆ 24377 ┆ 6   │
└────────────┴───────┴───────┴─────┘

英文:

join and filter should give you what you need:

(
    df1.join(df2, left_on=&quot;chr&quot;, right_on=&quot;chrom&quot;)
    .filter(
        (pl.col(&quot;start&quot;) &gt;= pl.col(&quot;start_right&quot;))
        &amp; (pl.col(&quot;end&quot;) &lt;= pl.col(&quot;end_right&quot;))
    )
    .drop([&quot;start_right&quot;, &quot;end_right&quot;])
)

shape: (6, 4)
┌────────────┬───────┬───────┬─────┐
│ chr        ┆ start ┆ end   ┆ key │
│ ---        ┆ ---   ┆ ---   ┆ --- │
│ str        ┆ i64   ┆ i64   ┆ i64 │
╞════════════╪═══════╪═══════╪═════╡
│ GL000008.2 ┆ 14516 ┆ 14534 ┆ 1   │
│ GL000008.2 ┆ 17380 ┆ 17399 ┆ 3   │
│ GL000008.2 ┆ 17381 ┆ 17399 ┆ 3   │
│ GL000008.2 ┆ 20177 ┆ 20195 ┆ 4   │
│ GL000008.2 ┆ 22254 ┆ 22274 ┆ 5   │
│ GL000008.2 ┆ 24357 ┆ 24377 ┆ 6   │
└────────────┴───────┴───────┴─────┘

答案2

得分: 0

使用 join_asof 可能会提供一个高效的解决方案：

(
    DF1
    .sort('start')
    .join_asof(
        DF2.sort('start'),
        by_left="chr",
        by_right="chrom",
        on="start",
        strategy="backward")
    .filter(
        pl.col('end') <= pl.col('end_right')
    )
)

形状：(6, 5)
┌────────────┬───────┬───────┬─────┬───────────┐
│ chr        ┆ start ┆ end   ┆ key ┆ end_right │
│ ---        ┆ ---   ┆ ---   ┆ --- ┆ ---       │
│ str        ┆ i64   ┆ i64   ┆ i64 ┆ i64       │
╞════════════╪═══════╪═══════╪═════╪═══════════╡
│ GL000008.2 ┆ 14516 ┆ 14534 ┆ 1   ┆ 14534     │
│ GL000008.2 ┆ 17380 ┆ 17399 ┆ 3   ┆ 17399     │
│ GL000008.2 ┆ 17381 ┆ 17399 ┆ 3   ┆ 17399     │
│ GL000008.2 ┆ 20177 ┆ 20195 ┆ 4   ┆ 20195     │
│ GL000008.2 ┆ 22254 ┆ 22274 ┆ 5   ┆ 22274     │
│ GL000008.2 ┆ 24357 ┆ 24377 ┆ 6   ┆ 24377     │
└────────────┴───────┴───────┴─────┴───────────┘

注意：这假设 DF2 中的起止区间不重叠。

英文:

Using a join_asof might provide a performant solution:

(
    DF1
    .sort(&#39;start&#39;)
    .join_asof(
        DF2.sort(&#39;start&#39;),
        by_left=&quot;chr&quot;,
        by_right=&quot;chrom&quot;,
        on=&quot;start&quot;,
        strategy=&quot;backward&quot;)
    .filter(
        pl.col(&#39;end&#39;) &lt;= pl.col(&#39;end_right&#39;)
    )
)

shape: (6, 5)
┌────────────┬───────┬───────┬─────┬───────────┐
│ chr        ┆ start ┆ end   ┆ key ┆ end_right │
│ ---        ┆ ---   ┆ ---   ┆ --- ┆ ---       │
│ str        ┆ i64   ┆ i64   ┆ i64 ┆ i64       │
╞════════════╪═══════╪═══════╪═════╪═══════════╡
│ GL000008.2 ┆ 14516 ┆ 14534 ┆ 1   ┆ 14534     │
│ GL000008.2 ┆ 17380 ┆ 17399 ┆ 3   ┆ 17399     │
│ GL000008.2 ┆ 17381 ┆ 17399 ┆ 3   ┆ 17399     │
│ GL000008.2 ┆ 20177 ┆ 20195 ┆ 4   ┆ 20195     │
│ GL000008.2 ┆ 22254 ┆ 22274 ┆ 5   ┆ 22274     │
│ GL000008.2 ┆ 24357 ┆ 24377 ┆ 6   ┆ 24377     │
└────────────┴───────┴───────┴─────┴───────────┘

Note: this assumes that your start-end intervals in DF2 do not overlap.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Polars从字典值的有条件查找中的列

问题

答案1

答案2

How can I use pandas.query() to check if a string exists in a list within the dataframe?

如何检查 pandas 数据框的列是否包含列表中的任何值，并返回该值。

尝试在GoLang中传递标签运行BQ查询，但出现恐慌错误。

Concatenate results of `apply` in pandas.

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。