英文:
Fastest way to get a dictionary of counts of objects between 2 pandas series
问题
Here's the translated code portion:
假设我有两个相等长度的列表:ss 和 ee。每个列表包含值,使得对于所有 i,ss[i] >= ee[i] 并且 ss[i+1] >= ee[i] 成立。
例如:
```python
ss = [0,10,20,30]
ee = [3,15,23,40]
vals = [0,1,2,5,7,10,11,16,21,22,23,29,31,35,45]
我想要返回一个字典,其中键是 ss 的值,值是落在 ss 和其对应值之间的 vals 的计数。
对于第一次迭代,值在 0 和 3(包括 3)之间的 vals 为 0,1,2,因此键为 0 的值为 3。
这是我的示例的期望输出:{0: 3, 10: 2, 20: 3, 30: 2}
这是我的最新尝试:
dCounts = {}
iv = 0
for i,e in enumerate(ss):
count = 0
s1 = e
s2 = ee[i]
while vals[iv] < s1:
iv += 1
while vals[iv] <= s2:
iv += 1
count += 1
dCounts[s1] = count
如果 len(ss) = n,len(vals) = m,那么我认为这大致运行在 O(m + n) 的时间复杂度。
我无法想到比这更快的方法。对于大型列表,我认为我受限于 Python 的索引。
我以简单的 Python 形式提供了这个问题,以便更清晰,但实际上我正在使用具有日期时间索引的 Pandas 系列。我一直试图利用 Pandas 通常提供的速度,但似乎无法在合理的时间内处理我的大型(~200,000)向量。
我无法想到一个好的方法来避免在每次循环中解释每个时间。我尝试将 ss 和 ee 放入数据帧中,并在 apply 函数内使用 vals 的 .loc 方法,但这表现得比我尝试过的其他任何方法都差。
If you have any further questions or need additional assistance, please feel free to ask.
<details>
<summary>英文:</summary>
Lets say I have two equal length lists: ss and ee. Each contain values such that ss[i] >= ee[i] and ss[i+1] >= ee[i] is true for all i.
For example:
```python
ss = [0,10,20,30]
ee = [3,15,23,40]
vals = [0,1,2,5,7,10,11,16,21,22,23,29,31,35,45]
I want to return a dictionary of counts where the keys are the values of ss, and the values are the counts of vals that fall between ss and its corresponding values.
for the first iteration, the values of vals between 0 and 3 (inclusive) are 0,1,2 so the key of 0 would have a value of 3.
Here is the desired output for my examples: {0: 3, 10: 2, 20: 3, 30: 2}
Here is my latest attempt:
dCounts = {}
iv = 0
for i,e in enumerate(ss):
count = 0
s1 = e
s2 = ee[i]
while vals[iv] < s1:
iv += 1
while vals[iv] <= s2:
iv += 1
count += 1
dCounts[s1] = count
if len(ss) = n, and len(vals) = m, then I think this runs in roughly O(m + n) time.
I can't think of a faster way to do it than that. I think I'm at the mercy of python's indexing for large lists though.
I've given this problem as simple python for clarity, but I'm really working with pandas series with datetime indices. I've been trying to leverage the speed I usually get out of pandas but can't seem to get anything fast enough to parse my large (~200,000) vectors in reasonable time.
I can't think of a good way to not have to interpret each time through the loop. I tried putting ss and ee into a data frame and using the .loc method on vals inside an apply function, but that performed worse than anything else I tried.
答案1
得分: 3
I would use pandas.cut
with an IntervalIndex
and value_counts
:
out = pd.cut(vals, bins=pd.IntervalIndex([pd.Interval(s, e, closed='both')
for s, e in zip(ss, ee)])
).value_counts()
Output:
[0, 3] 3
[10, 15] 2
[20, 23] 3
[30, 40] 2
Name: count, dtype: int64
Or with [tag:numpy]'s searchsorted
:
# ee must be sorted!
ss_arr = np.array(ss)
idx = np.searchsorted(ee, vals)
# remove
m = idx < len(ss)
m2 = np.array(vals)[m] >= ss_arr[idx[m]]
idx2, cnt = np.unique(idx[m][m2], return_counts=True)
out = dict(zip(ss_arr[idx2], cnt))
Output: {0: 3, 10: 2, 20: 3, 30: 2}
英文:
I would use pandas.cut
with an IntervalIndex
and value_counts
:
out = pd.cut(vals, bins=pd.IntervalIndex([pd.Interval(s, e, closed='both')
for s, e in zip(ss, ee)])
).value_counts()
Output:
[0, 3] 3
[10, 15] 2
[20, 23] 3
[30, 40] 2
Name: count, dtype: int64
Or with [tag:numpy]'s searchsorted
:
# ee must be sorted!
ss_arr = np.array(ss)
idx = np.searchsorted(ee, vals)
# remove
m = idx<len(ss)
m2 = np.array(vals)[m] >= ss_arr[idx[m]]
idx2, cnt = np.unique(idx[m][m2], return_counts=True)
out = dict(zip(ss_arr[idx2], cnt))
Output: {0: 3, 10: 2, 20: 3, 30: 2}
答案2
得分: 0
以下是翻译好的部分:
你可以尝试这个,
count_list = [sum(i in range(ele1,ele2+1) for i in vals) for ele1, ele2 in zip(ss, ee)]
result = dict(zip(ss, count_list))
count_list 是,
[3, 2, 3, 2]
result 是,
{0: 3, 10: 2, 20: 3, 30: 2}
英文:
You can try this well,
count_list = [sum(i in range(ele1,ele2+1) for i in vals) for ele1, ele2 in zip(ss, ee)]
result = dict(zip(ss, count_list))
the count_list is,
[3, 2, 3, 2]
result is ,
{0: 3, 10: 2, 20: 3, 30: 2}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论