2023年8月10日 22:41:35go评论191阅读模式

英文:

How to drop (delete) consecutive values in a Dataframe

问题

如何删除粗体部分的值？
您可以使用以下方法删除粗体部分的值：

import numpy as np
# 原始列表
t = [74536, 74537, 74538, 74540, 74542, 74543, 74544, 74545, 74547, 74551, 74554, 74555, 74559, 74560, 74561, 74562, 74563, 74566, 74567, 74568, 74569, 74571, 74572, 74573, 74574, 74575, 74578, 74579, 74580, 74582, 74584, 74585, 74586, 74587, 74588, 74589, 74590, 74591, 74592, 74595, 74596, 74597, 74598, 74599, 74601, 74602, 74603, 74604, 74605, 74606, 74607, 74608, 74609, 74610, 74612, 74613, 74614, 74615, 74616, 74617, 74618, 74619, 74620, 74621, 74622, 74623, 74624, 74625, 74626, 74627, 74628, 74629, 74630, 74631, 74632, 74633, 74634, 74635, 74636, 74637, 74638, 74639, 74640, 74641, 74642, 74643, 74644, 74645, 74646, 74647, 74648, 74649, 74650, 74651, 74652, 74653, 74654, 74655, 74656, 74657, 74658, 74659, 74660, 74661, 74662, 74663, 74664, 74665, 74666, 74667, 74668, 74669, 74670, 74671, 74672, 74673, 74674, 74675, 74676, 74677, 74678, 74679, 74680, 74681, 74682, 74683, 74684, 74685, 74686, 74687, 74688, 74689, 74690, 74691, 74692, 74693, 74694, 74695, 74696, 74697, 74698, 74699, 74700, 74701, 74702, 74703, 74704, 74705, 74706, 74707, 74708, 74709, 74710, 74711, 74712, 74713, 74714, 74715, 74716, 74717, 74718, 74719, 74720, 74721, 74722, 74723, 74724, 74725, 74726, 74727, 74728, 74729, 74730, 74731, 74732, 74733, 74734, 74735, 74736, 74737, 74738, 74739, 74740, 74741, 74742, 74743, 74744, 74745, 74746, 74747, 74748, 74749, 74750, 74751, 74752, 74753, 74754, 74755, 74756, 74757, 74758, 74759, 74760, 74761, 74762, 74763, 74764, 74765, 74766, 74767, 74768, 74769, 74770, 74771, 74772, 74773, 74774, 74775, 74776, 74777, 74778, 74779, 74780, 74781, 74782, 74783, 74784, 74785, 74786, 74787, 74788, 74789, 74790, 74791, 74792, 74793, 74794, 74795, 74796, 74797, 74798, 74799, 74800, 74801, 74802, 74803, 74804, 74805, 74806, 74807, 74808, 74809, 74810, 74811, 74812, 74813, 74814, 74815, 74816, 74817, 74818, 74819, 74820, 74821, 74822, 74823, 74824, 74825, 74826, 74827, 74828, 74829, 74830, 74831, 74832, 74833, 74834, 74835, 74836, 74837, 74838, 74839, 74840, 74841, 74842]
# 计算差分
diff_t = np.diff(t)
# 找到要删除的索引
indexes_to_remove = np.where(diff_t == 1)[0]
# 删除粗体部分的值
filtered_t = [t[i] for i in range(len(t)) if i not in indexes_to_remove]

如何提高效率？
为了提高效率，您可以尝试使用 Pandas 的矢量化操作来替代循环，以减少计算时间。以下是一个可能的优化示例：

import pandas as pd
# 假设您的数据存储在一个名为 df 的 Pandas DataFrame 中
# 添加一列，表示当前行与上一行是否连续
df['is_continuous'] = df['DSCD'].diff() == 1
# 找到需要删除的连续行
rows_to_remove = df[df['is_continuous']].index
# 删除这些行
df.drop(rows_to_remove, inplace=True)
# 删除添加的列
df.drop(columns=['is_continuous'], inplace=True)

这种方法可以帮助您更有效地删除粗体部分的值，而不需要使用循环。

英文:

I have a dataframe with a column that has 0 Values. I wish to find those 0 values and check if till the end they are 0, drop only those at the end and not in the middle.

this is how the Data in secondary_df looks like:

      DSCD        date    year    month  RI    RIu   RIu1    RIe  
203  1316    1/29/2010  2010.0    1.0  66.39  66.30  6.21    6.39  \
275  1316    1/29/2016  2016.0    1.0  66.97  166.84  6.89   6.32   
131  1316    1/30/2004  2004.0    1.0  66.01  66.15   6.36   6.60   
191  1316    1/30/2009  2009.0    1.0  66.36  6.54  685.25   6.71   
263  1316    1/30/2015  2015.0    1.0  66.43  6.94  114.14   6.33   
..    ...       ...      ...      ...    ...   ...     ...     ...   
250  1316   12/31/2013  2013.0   12.0  99.98   5.24    59.91   5.07   
262  1316   12/31/2014  2014.0   12.0  99.33   54.14   54.64   55.96   
274  1316   12/31/2015  2015.0   12.0  55.32   5.89    15.19  54.34   
310  1316   12/31/2018  2018.0   12.0  55.56   55.23   5.40   5.49   
322  1316   12/31/2019  2019.0   12.0  55.39   55.98   5.69   5.88  
RIu Pct Return  RIe_Pct_Return   
203        -0.05        0.0255  \
275        -0.0358      -0.059   
131         0.058        0.05106   
191         0.0055       0.0535   
263        -0.035        0.053   
..               ...             ...   
250         0.01092       -0.05   
262        -0.001        0.02572   
274        -0.003       -0.0512   
310        -0.000       -0.05274   
322         0.004        0.039

This is what I got so far.

for DSCD in FirmReturnIndexValues[&#39;DSCD&#39;].unique():
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues[&#39;DSCD&#39;]==DSCD]
t=secondary_df[(secondary_df[&#39;RIe Pct Return&#39;].values == 0)].index.values.tolist()
t.sort()
if len(t)&gt;=1:
print(np.diff((t)))

for example this part is t:

[69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572, 69573, 69574, 69575, 69576, 69577, 69578, 69579, 69580, 69581, 69582, 69583, 69584, 69585, 69586, 69587, 69588, 69589, 69590, 69591, 69592, 69593, 69594, 69595, 69596, 69597, 69598, 69599, 69600, 69601, 69602, 69603, 69604, 69605, 69606, 69607, 69608, 69609, 69610, 69611, 69612, 69613, 69614, 69615, 69616, 69617, 69618, 69619, 69620, 69621, 69622, 69623, 69624, 69625, 69626, 69627, 69628, 69629, 69630, 69631, 69632, 69633, 69634, 69635, 69636, 69637, 69638, 69639, 69640, 69641, 69642, 69643, 69644, 69645, 69646, 69647, 69648, 69649, 69650, 69651, 69652, 69653, 69654, 69655, 69656, 69657, 69658]

this is the Indexes that I get from my code and when I use the np.diff() method I get this values and the values I wish to drop (delete) are being bolded:

> [ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1 2 80 22 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1]

so I have 2 Questions.

how can I delete the bolded one's?
the first for loop contains 8000 DSCD's is there anyway it can be more efficient?

Another Example:
list t:

[74536, 74537, 74538, 74540, 74542, 74543, 74544, 74545, 74547, 74551, 74554, 74555, 74559, 74560, 74561, 74562, 74563, 74566, 74567, 74568, 74569, 74571, 74572, 74573, 74574, 74575, 74578, 74579, 74580, 74582, 74584, 74585, 74586, 74587, 74588, 74589, 74590, 74591, 74592, 74595, 74596, 74597, 74598, 74599, 74601, 74602, 74603, 74604, 74605, 74606, 74607, 74608, 74609, 74610, 74612, 74613, 74614, 74615, 74616, 74617, 74618, 74619, 74620, 74621, 74622, 74623, 74624, 74625, 74626, 74627, 74628, 74629, 74630, 74631, 74632, 74633, 74634, 74635, 74636, 74637, 74638, 74639, 74640, 74641, 74642, 74643, 74644, 74645, 74646, 74647, 74648, 74649, 74650, 74651, 74652, 74653, 74654, 74655, 74656, 74657, 74658, 74659, 74660, 74661, 74662, 74663, 74664, 74665, 74666, 74667, 74668, 74669, 74670, 74671, 74672, 74673, 74674, 74675, 74676, 74677, 74678, 74679, 74680, 74681, 74682, 74683, 74684, 74685, 74686, 74687, 74688, 74689, 74690, 74691, 74692, 74693, 74694, 74695, 74696, 74697, 74698, 74699, 74700, 74701, 74702, 74703, 74704, 74705, 74706, 74707, 74708, 74709, 74710, 74711, 74712, 74713, 74714, 74715, 74716, 74717, 74718, 74719, 74720, 74721, 74722, 74723, 74724, 74725, 74726, 74727, 74728, 74729, 74730, 74731, 74732, 74733, 74734, 74735, 74736, 74737, 74738, 74739, 74740, 74741, 74742, 74743, 74744, 74745, 74746, 74747, 74748, 74749, 74750, 74751, 74752, 74753, 74754, 74755, 74756, 74757, 74758, 74759, 74760, 74761, 74762, 74763, 74764, 74765, 74766, 74767, 74768, 74769, 74770, 74771, 74772, 74773, 74774, 74775, 74776, 74777, 74778, 74779, 74780, 74781, 74782, 74783, 74784, 74785, 74786, 74787, 74788, 74789, 74790, 74791, 74792, 74793, 74794, 74795, 74796, 74797, 74798, 74799, 74800, 74801, 74802, 74803, 74804, 74805, 74806, 74807, 74808, 74809, 74810, 74811, 74812, 74813, 74814, 74815, 74816, 74817, 74818, 74819, 74820, 74821, 74822, 74823, 74824, 74825, 74826, 74827, 74828, 74829, 74830, 74831, 74832, 74833, 74834, 74835, 74836, 74837, 74838, 74839, 74840, 74841, 74842]

result of np.diff():

> [1 1 2 2 1 1 1 2 4 3 1 4 1 1 1 1 3 1 1 1 2 1 1 1 1 3 1 1 2 2 1 1 1 1 1 1 1
1 3 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Hard one:

[6359, 6431, 6287, 6347, 6419, 6263, 6275, 6299, 6311, 6323, 6335, 6371, 6383, 6395, 6407, 6443, 6455, 6467, 6360, 6288, 6348, 6420, 6252, 6264, 6276, 6300, 6312, 6324, 6372, 6396, 6408, 6444, 6456, 6468, 6336, 6384, 6432, 6265, 6397, 6469, 6253, 6325, 6385, 6457, 6277, 6289, 6301, 6313, 6337, 6349, 6361, 6373, 6409, 6421, 6433, 6445, 6314, 6446, 6302, 6374, 6434, 6254, 6266, 6278, 6290, 6326, 6338, 6350, 6362, 6386, 6398, 6410, 6422, 6458, 6470, 6423, 6279, 6339, 6411, 6255, 6267, 6291, 6303, 6315, 6327, 6363, 6375, 6387, 6399, 6435, 6447, 6459, 6471, 6268, 6400, 6472, 6256, 6328, 6388, 6460, 6280, 6292, 6304, 6316, 6340, 6351, 6364, 6376, 6412, 6424, 6436, 6448, 6305, 6377, 6437, 6293, 6365, 6257, 6269, 6281, 6317, 6329, 6341, 6389, 6401, 6413, 6425, 6449, 6461, 6473, 6282, 6342, 6414, 6270, 6402, 6474, 6258, 6294, 6306, 6318, 6330, 6353, 6366, 6378, 6390, 6426, 6438, 6450, 6462, 6259, 6331, 6391, 6463, 6319, 6451, 6271, 6283, 6295, 6307, 6343, 6355, 6367, 6379, 6403, 6415, 6427, 6439, 6475, 6296, 6368, 6356, 6428, 6260, 6272, 6284, 6308, 6320, 6332, 6344, 6380, 6392, 6404, 6416, 6440, 6452, 6464, 6476, 6285, 6345, 6417, 6273, 6405, 6477, 6261, 6297, 6309, 6321, 6333, 6357, 6369, 6381, 6393, 6429, 6441, 6453, 6465, 6322, 6454, 6310, 6382, 6442, 6262, 6274, 6286, 6298, 6334, 6346, 6358, 6370, 6394, 6406, 6418, 6430, 6466, 6478]

after sorting the previous list you'll get:

> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1]

as the same logic, the bold ones need to be removed

答案1

得分: 1

我认为这是你正在寻找的内容（完整示例）

Pandas版本：

s = pd.Series([69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572, 69573, 69574, 69575, 69576, 69577, 69578, 69579, 69580, 69581, 69582, 69583, 69584, 69585, 69586, 69587, 69588, 69589, 69590, 69591, 69592, 69593, 69594, 69595, 69596, 69597, 69598, 69599, 69600, 69601, 69602, 69603, 69604, 69605, 69606, 69607, 69608, 69609, 69610, 69611, 69612, 69613, 69614, 69615, 69616, 69617, 69618, 69619, 69620, 69621, 69622, 69623, 69624, 69625, 69626, 69627, 69628, 69629, 69630, 69631, 69632, 69633, 69634, 69635, 69636, 69637, 69638, 69639, 69640, 69641, 69642, 69643, 69644, 69645, 69646, 69647, 69648, 69649, 69650, 69651, 69652, 69653, 69654, 69655, 69656, 69657, 69658])
s[:s.index展开收缩
[-1]+1]
返回：
[69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572]
逻辑解释。
首先，diff函数获取到了你想要的位置。
ne(1) - 仅获取diff不等于1的位置，然后我们返回这些点的索引。
[-1] 获取最后一个索引，再加上2，然后切片原始数组。
Numpy版本：
```python
# 尽量保持在pandas系列/数据框和numpy数组中，避免使用列表
for DSCD in FirmReturnIndexValues['DSCD'].unique():
    secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD'].eq(DSCD)]
    t=secondary_df[secondary_df['RIe Pct Return'].eq(0)].index.values
    t.sort()
    if len(t)>=1:
        
        # 获取最后一个diff不等于1的索引
        ind = np.where(np.not_equal(np.diff(t), 1))[0][-1]
        
        # 在索引上加2，以移除你示例中的粗体数字
        result = t[:ind+2]

答案继续：

总的来说，这里有一些简单的代码优化。但不清楚你的输入数据框结构以及期望的输出是什么。如果你能提供完整的工作示例和期望的输出，我们可以进行性能测试。

# 使用直接的numpy方法调用（eq,lt,gt,ge,le,ne），而不是==, <, >等等。
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD'].eq(DSCD)]
        t=secondary_df[secondary_df['RIe Pct Return'].eq(0)].index.values  # 如果不需要，不要转换为列表
        t.sort()  # 使用numpy数组排序
        if len(t)>=1:
            print(np.diff((t)))

这些是快速的优化方法。你可能会从重构中获得结构性的性能优化。但你需要提供一个完整的可运行示例和期望的输出。

英文:

I think this is what you are looking for (Full example)

Pandas Version:

s = pd.Series([69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572, 69573, 69574, 69575, 69576, 69577, 69578, 69579, 69580, 69581, 69582, 69583, 69584, 69585, 69586, 69587, 69588, 69589, 69590, 69591, 69592, 69593, 69594, 69595, 69596, 69597, 69598, 69599, 69600, 69601, 69602, 69603, 69604, 69605, 69606, 69607, 69608, 69609, 69610, 69611, 69612, 69613, 69614, 69615, 69616, 69617, 69618, 69619, 69620, 69621, 69622, 69623, 69624, 69625, 69626, 69627, 69628, 69629, 69630, 69631, 69632, 69633, 69634, 69635, 69636, 69637, 69638, 69639, 69640, 69641, 69642, 69643, 69644, 69645, 69646, 69647, 69648, 69649, 69650, 69651, 69652, 69653, 69654, 69655, 69656, 69657, 69658])
s[:s.index展开收缩
[-1]+1]
returns:
[69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572]

Logic explanation.
First diff gets to where you were.

ne(1) - only get locations where diff is not equal to 1 and then we return the indices for those points.

[-1] get the last index add 2 and then slice the original array.

Numpy Version:

# stay in pandas series/df and numpy arrays as much as possible avoid lists
for DSCD in FirmReturnIndexValues[&#39;DSCD&#39;].unique():
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues[&#39;DSCD&#39;].eq(DSCD)]
t=secondary_df[secondary_df[&#39;RIe Pct Return&#39;].eq(0)].index.values
t.sort()
if len(t)&gt;=1:
# get last index where diff is not 1
ind = np.where(np.not_equal(np.diff(t), 1))[0][-1]
# add 2 to index to remove the bolded numbers per your example
result = t[:ind+2]

Answer Continued:

In general here are some simple speed ups for the code you have shown. But it is unclear what exactly you have as an input df and desire as and result. If you can provide that we can benchmark it.

# use direct numpy method calls (eq,lt,gt,ge,le,ne) instead of ==, &lt;, &gt;, etc..
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues[&#39;DSCD&#39;].eq(DSCD)]
t=secondary_df[secondary_df[&#39;RIe Pct Return&#39;].eq(0)].index.values  # don&#39;t convert to list if you do not need to
t.sort()  # use numpy array sort
if len(t)&gt;=1:
print(np.diff((t)))

These are quick wins. You likely have structural speed ups from a refactor. But you need to provide a full working example and expected output.

答案2

得分: 0

如果我正确理解您的问题，这段代码将取代编写for循环并删除不需要的数据。

# 生成一个虚拟数据框
df = pd.DataFrame({"col1": np.random.randint(-5, 5, 20)})
# 最后一个0值的索引
idx_end = df[df.col1 == 0]['col1'].index[-1]
# 获取开始最后一个连续的0值序列的0值的索引
idx_start = df.iloc[:idx_end][df.iloc[:idx_end].col1 != 0]['col1'].index[-1] + 1
# 删除数据
df.drop(range(idx_start, idx_end + 1), inplace=True)

英文:

If I have understood your problem correctly, this code will supersede the need to write a for loop and drop the unrequired data.

# generate a dummy dataframe
df = pd.DataFrame({&quot;col1&quot;: np.random.randint(-5,5,20)})
# index of the last 0 value
idx_end = df[df.col1 == 0][&#39;col1&#39;].index[-1]
# get the index of the 0 value that starts the last continuous sequence of 0 values
idx_start = df.iloc[:idx_end][df.iloc[:idx_end].col1 != 0][&#39;col1&#39;].index[-1] + 1
# drop data
df.drop(range(idx_start, idx_end+1), inplace=True)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在数据框中删除连续数值

问题

答案1

答案2

`.get()`函数未被识别。

Keep on getting this error message when trying to run my django server: [Errno 2] No such file or directory

在fuglu插件中，我如何向退信的邮件发送自定义消息给发件人？

计算列表中每个元素的长度

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。