2023年5月11日 15:08:07go评论196阅读模式

英文:

Fastest way to get a dictionary of counts of objects between 2 pandas series

问题

Here's the translated code portion:

假设我有两个相等长度的列表：ss 和 ee。每个列表包含值，使得对于所有 i，ss[i] >= ee[i] 并且 ss[i+1] >= ee[i] 成立。
例如：
```python
ss = [0,10,20,30]
ee = [3,15,23,40]
vals = [0,1,2,5,7,10,11,16,21,22,23,29,31,35,45]

我想要返回一个字典，其中键是 ss 的值，值是落在 ss 和其对应值之间的 vals 的计数。

对于第一次迭代，值在 0 和 3（包括 3）之间的 vals 为 0,1,2，因此键为 0 的值为 3。

这是我的示例的期望输出：{0: 3, 10: 2, 20: 3, 30: 2}

这是我的最新尝试：

dCounts = {}
iv = 0
for i,e in enumerate(ss):
    count = 0
    s1 = e
    s2 = ee[i]
    while vals[iv] < s1:
        iv += 1
    while vals[iv] <= s2:
        iv += 1
        count += 1
    dCounts[s1] = count

如果 len(ss) = n，len(vals) = m，那么我认为这大致运行在 O(m + n) 的时间复杂度。

我无法想到比这更快的方法。对于大型列表，我认为我受限于 Python 的索引。

我以简单的 Python 形式提供了这个问题，以便更清晰，但实际上我正在使用具有日期时间索引的 Pandas 系列。我一直试图利用 Pandas 通常提供的速度，但似乎无法在合理的时间内处理我的大型（~200,000）向量。

我无法想到一个好的方法来避免在每次循环中解释每个时间。我尝试将 ss 和 ee 放入数据帧中，并在 apply 函数内使用 vals 的 .loc 方法，但这表现得比我尝试过的其他任何方法都差。


If you have any further questions or need additional assistance, please feel free to ask.
<details>
<summary>英文:</summary>
Lets say I have two equal length lists: ss and ee. Each contain values such that ss[i] &gt;= ee[i] and ss[i+1] &gt;= ee[i] is true for all i. 
For example:
```python
ss = [0,10,20,30]
ee = [3,15,23,40]
vals = [0,1,2,5,7,10,11,16,21,22,23,29,31,35,45]

I want to return a dictionary of counts where the keys are the values of ss, and the values are the counts of vals that fall between ss and its corresponding values.

for the first iteration, the values of vals between 0 and 3 (inclusive) are 0,1,2 so the key of 0 would have a value of 3.

Here is the desired output for my examples: {0: 3, 10: 2, 20: 3, 30: 2}

Here is my latest attempt:

dCounts = {}
iv = 0
for i,e in enumerate(ss):
    count = 0
    s1 = e
    s2 = ee[i]
    while vals[iv] &lt; s1:
        iv += 1
    while vals[iv] &lt;= s2:
        iv += 1
        count += 1
    dCounts[s1] = count

if len(ss) = n, and len(vals) = m, then I think this runs in roughly O(m + n) time.

I can't think of a faster way to do it than that. I think I'm at the mercy of python's indexing for large lists though.

I've given this problem as simple python for clarity, but I'm really working with pandas series with datetime indices. I've been trying to leverage the speed I usually get out of pandas but can't seem to get anything fast enough to parse my large (~200,000) vectors in reasonable time.

I can't think of a good way to not have to interpret each time through the loop. I tried putting ss and ee into a data frame and using the .loc method on vals inside an apply function, but that performed worse than anything else I tried.

答案1

得分: 3

I would use pandas.cut with an IntervalIndex and value_counts:

out = pd.cut(vals, bins=pd.IntervalIndex([pd.Interval(s, e, closed='both')
                                          for s, e in zip(ss, ee)])
             ).value_counts()

Output:

[0, 3]      3
[10, 15]    2
[20, 23]    3
[30, 40]    2
Name: count, dtype: int64

Or with [tag:numpy]'s searchsorted:

# ee must be sorted!
ss_arr = np.array(ss)
idx = np.searchsorted(ee, vals)
# remove 
m = idx < len(ss)
m2 = np.array(vals)[m] >= ss_arr[idx[m]]
idx2, cnt = np.unique(idx[m][m2], return_counts=True)
out = dict(zip(ss_arr[idx2], cnt))

Output: {0: 3, 10: 2, 20: 3, 30: 2}

英文:

I would use pandas.cut with an IntervalIndex and value_counts:

out = pd.cut(vals, bins=pd.IntervalIndex([pd.Interval(s, e, closed=&#39;both&#39;)
                                          for s, e in zip(ss, ee)])
             ).value_counts()

Output:

[0, 3]      3
[10, 15]    2
[20, 23]    3
[30, 40]    2
Name: count, dtype: int64

Or with [tag:numpy]'s searchsorted:

# ee must be sorted!
ss_arr = np.array(ss)
idx = np.searchsorted(ee, vals)
# remove 
m = idx&lt;len(ss)
m2 = np.array(vals)[m] &gt;= ss_arr[idx[m]]
idx2, cnt = np.unique(idx[m][m2], return_counts=True)
out = dict(zip(ss_arr[idx2], cnt))

Output: {0: 3, 10: 2, 20: 3, 30: 2}

答案2

得分: 0

以下是翻译好的部分：

你可以尝试这个，
count_list = [sum(i in range(ele1,ele2+1) for i in vals) for ele1, ele2 in zip(ss, ee)]
result = dict(zip(ss, count_list))
count_list 是，
[3, 2, 3, 2]
result 是，
{0: 3, 10: 2, 20: 3, 30: 2}

英文:

You can try this well,

count_list = [sum(i in range(ele1,ele2+1) for i in vals) for ele1, ele2 in zip(ss, ee)]
result = dict(zip(ss, count_list))

the count_list is,

[3, 2, 3, 2]

result is ,

{0: 3, 10: 2, 20: 3, 30: 2}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

获取两个Pandas系列之间对象计数字典的最快方法

问题

答案1

答案2

获取格式化的回溯信息，当覆盖 sys.excepthook 时

导入一个带有其依赖项的类

ModuleNotFoundError: 找不到模块名 ‘langchain’

Telegram Telethon: 在多个不同客户端之间共享媒体下载

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。