获取两个Pandas系列之间对象计数字典的最快方法

huangapple go评论156阅读模式
英文:

Fastest way to get a dictionary of counts of objects between 2 pandas series

问题

Here's the translated code portion:

假设我有两个相等长度的列表ss 和 ee每个列表包含值使得对于所有 iss[i] >= ee[i] 并且 ss[i+1] >= ee[i] 成立

例如
```python
ss = [0,10,20,30]
ee = [3,15,23,40]

vals = [0,1,2,5,7,10,11,16,21,22,23,29,31,35,45]

我想要返回一个字典,其中键是 ss 的值,值是落在 ss 和其对应值之间的 vals 的计数。

对于第一次迭代,值在 0 和 3(包括 3)之间的 vals 为 0,1,2,因此键为 0 的值为 3。

这是我的示例的期望输出:{0: 3, 10: 2, 20: 3, 30: 2}

这是我的最新尝试:

dCounts = {}
iv = 0
for i,e in enumerate(ss):
    count = 0
    s1 = e
    s2 = ee[i]
    while vals[iv] < s1:
        iv += 1
    while vals[iv] <= s2:
        iv += 1
        count += 1
    dCounts[s1] = count

如果 len(ss) = n,len(vals) = m,那么我认为这大致运行在 O(m + n) 的时间复杂度。

我无法想到比这更快的方法。对于大型列表,我认为我受限于 Python 的索引。

我以简单的 Python 形式提供了这个问题,以便更清晰,但实际上我正在使用具有日期时间索引的 Pandas 系列。我一直试图利用 Pandas 通常提供的速度,但似乎无法在合理的时间内处理我的大型(~200,000)向量。

我无法想到一个好的方法来避免在每次循环中解释每个时间。我尝试将 ss 和 ee 放入数据帧中,并在 apply 函数内使用 vals 的 .loc 方法,但这表现得比我尝试过的其他任何方法都差。


If you have any further questions or need additional assistance, please feel free to ask.

<details>
<summary>英文:</summary>

Lets say I have two equal length lists: ss and ee. Each contain values such that ss[i] &gt;= ee[i] and ss[i+1] &gt;= ee[i] is true for all i. 

For example:
```python
ss = [0,10,20,30]
ee = [3,15,23,40]

vals = [0,1,2,5,7,10,11,16,21,22,23,29,31,35,45]

I want to return a dictionary of counts where the keys are the values of ss, and the values are the counts of vals that fall between ss and its corresponding values.

for the first iteration, the values of vals between 0 and 3 (inclusive) are 0,1,2 so the key of 0 would have a value of 3.

Here is the desired output for my examples: {0: 3, 10: 2, 20: 3, 30: 2}

Here is my latest attempt:

dCounts = {}
iv = 0
for i,e in enumerate(ss):
    count = 0
    s1 = e
    s2 = ee[i]
    while vals[iv] &lt; s1:
        iv += 1
    while vals[iv] &lt;= s2:
        iv += 1
        count += 1
    dCounts[s1] = count

if len(ss) = n, and len(vals) = m, then I think this runs in roughly O(m + n) time.

I can't think of a faster way to do it than that. I think I'm at the mercy of python's indexing for large lists though.

I've given this problem as simple python for clarity, but I'm really working with pandas series with datetime indices. I've been trying to leverage the speed I usually get out of pandas but can't seem to get anything fast enough to parse my large (~200,000) vectors in reasonable time.

I can't think of a good way to not have to interpret each time through the loop. I tried putting ss and ee into a data frame and using the .loc method on vals inside an apply function, but that performed worse than anything else I tried.

答案1

得分: 3

I would use pandas.cut with an IntervalIndex and value_counts:

out = pd.cut(vals, bins=pd.IntervalIndex([pd.Interval(s, e, closed='both')
                                          for s, e in zip(ss, ee)])
             ).value_counts()

Output:

[0, 3]      3
[10, 15]    2
[20, 23]    3
[30, 40]    2
Name: count, dtype: int64

Or with [tag:numpy]'s searchsorted:

# ee must be sorted!
ss_arr = np.array(ss)
idx = np.searchsorted(ee, vals)

# remove 
m = idx < len(ss)
m2 = np.array(vals)[m] >= ss_arr[idx[m]]
idx2, cnt = np.unique(idx[m][m2], return_counts=True)
out = dict(zip(ss_arr[idx2], cnt))

Output: {0: 3, 10: 2, 20: 3, 30: 2}

英文:

I would use pandas.cut with an IntervalIndex and value_counts:

out = pd.cut(vals, bins=pd.IntervalIndex([pd.Interval(s, e, closed=&#39;both&#39;)
                                          for s, e in zip(ss, ee)])
             ).value_counts()

Output:

[0, 3]      3
[10, 15]    2
[20, 23]    3
[30, 40]    2
Name: count, dtype: int64

Or with [tag:numpy]'s searchsorted:

# ee must be sorted!
ss_arr = np.array(ss)
idx = np.searchsorted(ee, vals)

# remove 
m = idx&lt;len(ss)
m2 = np.array(vals)[m] &gt;= ss_arr[idx[m]]
idx2, cnt = np.unique(idx[m][m2], return_counts=True)
out = dict(zip(ss_arr[idx2], cnt))

Output: {0: 3, 10: 2, 20: 3, 30: 2}

答案2

得分: 0

以下是翻译好的部分:

你可以尝试这个

count_list = [sum(i in range(ele1,ele2+1) for i in vals) for ele1, ele2 in zip(ss, ee)]
result = dict(zip(ss, count_list))

count_list 是

[3, 2, 3, 2]

result 是

{0: 3, 10: 2, 20: 3, 30: 2}
英文:

You can try this well,

count_list = [sum(i in range(ele1,ele2+1) for i in vals) for ele1, ele2 in zip(ss, ee)]
result = dict(zip(ss, count_list))

the count_list is,

[3, 2, 3, 2]

result is ,

{0: 3, 10: 2, 20: 3, 30: 2}

huangapple
  • 本文由 发表于 2023年5月11日 15:08:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76224945.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定