如何在数据框中删除连续数值

huangapple go评论114阅读模式
英文:

How to drop (delete) consecutive values in a Dataframe

问题

  1. 如何删除粗体部分的值?
    您可以使用以下方法删除粗体部分的值:
import numpy as np

# 原始列表
t = [74536, 74537, 74538, 74540, 74542, 74543, 74544, 74545, 74547, 74551, 74554, 74555, 74559, 74560, 74561, 74562, 74563, 74566, 74567, 74568, 74569, 74571, 74572, 74573, 74574, 74575, 74578, 74579, 74580, 74582, 74584, 74585, 74586, 74587, 74588, 74589, 74590, 74591, 74592, 74595, 74596, 74597, 74598, 74599, 74601, 74602, 74603, 74604, 74605, 74606, 74607, 74608, 74609, 74610, 74612, 74613, 74614, 74615, 74616, 74617, 74618, 74619, 74620, 74621, 74622, 74623, 74624, 74625, 74626, 74627, 74628, 74629, 74630, 74631, 74632, 74633, 74634, 74635, 74636, 74637, 74638, 74639, 74640, 74641, 74642, 74643, 74644, 74645, 74646, 74647, 74648, 74649, 74650, 74651, 74652, 74653, 74654, 74655, 74656, 74657, 74658, 74659, 74660, 74661, 74662, 74663, 74664, 74665, 74666, 74667, 74668, 74669, 74670, 74671, 74672, 74673, 74674, 74675, 74676, 74677, 74678, 74679, 74680, 74681, 74682, 74683, 74684, 74685, 74686, 74687, 74688, 74689, 74690, 74691, 74692, 74693, 74694, 74695, 74696, 74697, 74698, 74699, 74700, 74701, 74702, 74703, 74704, 74705, 74706, 74707, 74708, 74709, 74710, 74711, 74712, 74713, 74714, 74715, 74716, 74717, 74718, 74719, 74720, 74721, 74722, 74723, 74724, 74725, 74726, 74727, 74728, 74729, 74730, 74731, 74732, 74733, 74734, 74735, 74736, 74737, 74738, 74739, 74740, 74741, 74742, 74743, 74744, 74745, 74746, 74747, 74748, 74749, 74750, 74751, 74752, 74753, 74754, 74755, 74756, 74757, 74758, 74759, 74760, 74761, 74762, 74763, 74764, 74765, 74766, 74767, 74768, 74769, 74770, 74771, 74772, 74773, 74774, 74775, 74776, 74777, 74778, 74779, 74780, 74781, 74782, 74783, 74784, 74785, 74786, 74787, 74788, 74789, 74790, 74791, 74792, 74793, 74794, 74795, 74796, 74797, 74798, 74799, 74800, 74801, 74802, 74803, 74804, 74805, 74806, 74807, 74808, 74809, 74810, 74811, 74812, 74813, 74814, 74815, 74816, 74817, 74818, 74819, 74820, 74821, 74822, 74823, 74824, 74825, 74826, 74827, 74828, 74829, 74830, 74831, 74832, 74833, 74834, 74835, 74836, 74837, 74838, 74839, 74840, 74841, 74842]

# 计算差分
diff_t = np.diff(t)

# 找到要删除的索引
indexes_to_remove = np.where(diff_t == 1)[0]

# 删除粗体部分的值
filtered_t = [t[i] for i in range(len(t)) if i not in indexes_to_remove]
  1. 如何提高效率?
    为了提高效率,您可以尝试使用 Pandas 的矢量化操作来替代循环,以减少计算时间。以下是一个可能的优化示例:
import pandas as pd

# 假设您的数据存储在一个名为 df 的 Pandas DataFrame 中

# 添加一列,表示当前行与上一行是否连续
df['is_continuous'] = df['DSCD'].diff() == 1

# 找到需要删除的连续行
rows_to_remove = df[df['is_continuous']].index

# 删除这些行
df.drop(rows_to_remove, inplace=True)

# 删除添加的列
df.drop(columns=['is_continuous'], inplace=True)

这种方法可以帮助您更有效地删除粗体部分的值,而不需要使用循环。

英文:

I have a dataframe with a column that has 0 Values. I wish to find those 0 values and check if till the end they are 0, drop only those at the end and not in the middle.

this is how the Data in secondary_df looks like:

      DSCD        date    year    month  RI    RIu   RIu1    RIe  
203  1316    1/29/2010  2010.0    1.0  66.39  66.30  6.21    6.39  \
275  1316    1/29/2016  2016.0    1.0  66.97  166.84  6.89   6.32   
131  1316    1/30/2004  2004.0    1.0  66.01  66.15   6.36   6.60   
191  1316    1/30/2009  2009.0    1.0  66.36  6.54  685.25   6.71   
263  1316    1/30/2015  2015.0    1.0  66.43  6.94  114.14   6.33   
..    ...       ...      ...      ...    ...   ...     ...     ...   
250  1316   12/31/2013  2013.0   12.0  99.98   5.24    59.91   5.07   
262  1316   12/31/2014  2014.0   12.0  99.33   54.14   54.64   55.96   
274  1316   12/31/2015  2015.0   12.0  55.32   5.89    15.19  54.34   
310  1316   12/31/2018  2018.0   12.0  55.56   55.23   5.40   5.49   
322  1316   12/31/2019  2019.0   12.0  55.39   55.98   5.69   5.88  
RIu Pct Return  RIe_Pct_Return   
203        -0.05        0.0255  \
275        -0.0358      -0.059   
131         0.058        0.05106   
191         0.0055       0.0535   
263        -0.035        0.053   
..               ...             ...   
250         0.01092       -0.05   
262        -0.001        0.02572   
274        -0.003       -0.0512   
310        -0.000       -0.05274   
322         0.004        0.039   

This is what I got so far.

for DSCD in FirmReturnIndexValues['DSCD'].unique():
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD']==DSCD]
t=secondary_df[(secondary_df['RIe Pct Return'].values == 0)].index.values.tolist()
t.sort()
if len(t)>=1:
print(np.diff((t)))

for example this part is t:

[69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572, 69573, 69574, 69575, 69576, 69577, 69578, 69579, 69580, 69581, 69582, 69583, 69584, 69585, 69586, 69587, 69588, 69589, 69590, 69591, 69592, 69593, 69594, 69595, 69596, 69597, 69598, 69599, 69600, 69601, 69602, 69603, 69604, 69605, 69606, 69607, 69608, 69609, 69610, 69611, 69612, 69613, 69614, 69615, 69616, 69617, 69618, 69619, 69620, 69621, 69622, 69623, 69624, 69625, 69626, 69627, 69628, 69629, 69630, 69631, 69632, 69633, 69634, 69635, 69636, 69637, 69638, 69639, 69640, 69641, 69642, 69643, 69644, 69645, 69646, 69647, 69648, 69649, 69650, 69651, 69652, 69653, 69654, 69655, 69656, 69657, 69658]

this is the Indexes that I get from my code and when I use the np.diff() method I get this values and the values I wish to drop (delete) are being bolded:

> [ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1 2 80 22 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1]

so I have 2 Questions.

  1. how can I delete the bolded one's?
  2. the first for loop contains 8000 DSCD's is there anyway it can be more efficient?

Another Example:
list t:

[74536, 74537, 74538, 74540, 74542, 74543, 74544, 74545, 74547, 74551, 74554, 74555, 74559, 74560, 74561, 74562, 74563, 74566, 74567, 74568, 74569, 74571, 74572, 74573, 74574, 74575, 74578, 74579, 74580, 74582, 74584, 74585, 74586, 74587, 74588, 74589, 74590, 74591, 74592, 74595, 74596, 74597, 74598, 74599, 74601, 74602, 74603, 74604, 74605, 74606, 74607, 74608, 74609, 74610, 74612, 74613, 74614, 74615, 74616, 74617, 74618, 74619, 74620, 74621, 74622, 74623, 74624, 74625, 74626, 74627, 74628, 74629, 74630, 74631, 74632, 74633, 74634, 74635, 74636, 74637, 74638, 74639, 74640, 74641, 74642, 74643, 74644, 74645, 74646, 74647, 74648, 74649, 74650, 74651, 74652, 74653, 74654, 74655, 74656, 74657, 74658, 74659, 74660, 74661, 74662, 74663, 74664, 74665, 74666, 74667, 74668, 74669, 74670, 74671, 74672, 74673, 74674, 74675, 74676, 74677, 74678, 74679, 74680, 74681, 74682, 74683, 74684, 74685, 74686, 74687, 74688, 74689, 74690, 74691, 74692, 74693, 74694, 74695, 74696, 74697, 74698, 74699, 74700, 74701, 74702, 74703, 74704, 74705, 74706, 74707, 74708, 74709, 74710, 74711, 74712, 74713, 74714, 74715, 74716, 74717, 74718, 74719, 74720, 74721, 74722, 74723, 74724, 74725, 74726, 74727, 74728, 74729, 74730, 74731, 74732, 74733, 74734, 74735, 74736, 74737, 74738, 74739, 74740, 74741, 74742, 74743, 74744, 74745, 74746, 74747, 74748, 74749, 74750, 74751, 74752, 74753, 74754, 74755, 74756, 74757, 74758, 74759, 74760, 74761, 74762, 74763, 74764, 74765, 74766, 74767, 74768, 74769, 74770, 74771, 74772, 74773, 74774, 74775, 74776, 74777, 74778, 74779, 74780, 74781, 74782, 74783, 74784, 74785, 74786, 74787, 74788, 74789, 74790, 74791, 74792, 74793, 74794, 74795, 74796, 74797, 74798, 74799, 74800, 74801, 74802, 74803, 74804, 74805, 74806, 74807, 74808, 74809, 74810, 74811, 74812, 74813, 74814, 74815, 74816, 74817, 74818, 74819, 74820, 74821, 74822, 74823, 74824, 74825, 74826, 74827, 74828, 74829, 74830, 74831, 74832, 74833, 74834, 74835, 74836, 74837, 74838, 74839, 74840, 74841, 74842]

result of np.diff():

> [1 1 2 2 1 1 1 2 4 3 1 4 1 1 1 1 3 1 1 1 2 1 1 1 1 3 1 1 2 2 1 1 1 1 1 1 1
1 3 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
]

Hard one:

[6359, 6431, 6287, 6347, 6419, 6263, 6275, 6299, 6311, 6323, 6335, 6371, 6383, 6395, 6407, 6443, 6455, 6467, 6360, 6288, 6348, 6420, 6252, 6264, 6276, 6300, 6312, 6324, 6372, 6396, 6408, 6444, 6456, 6468, 6336, 6384, 6432, 6265, 6397, 6469, 6253, 6325, 6385, 6457, 6277, 6289, 6301, 6313, 6337, 6349, 6361, 6373, 6409, 6421, 6433, 6445, 6314, 6446, 6302, 6374, 6434, 6254, 6266, 6278, 6290, 6326, 6338, 6350, 6362, 6386, 6398, 6410, 6422, 6458, 6470, 6423, 6279, 6339, 6411, 6255, 6267, 6291, 6303, 6315, 6327, 6363, 6375, 6387, 6399, 6435, 6447, 6459, 6471, 6268, 6400, 6472, 6256, 6328, 6388, 6460, 6280, 6292, 6304, 6316, 6340, 6351, 6364, 6376, 6412, 6424, 6436, 6448, 6305, 6377, 6437, 6293, 6365, 6257, 6269, 6281, 6317, 6329, 6341, 6389, 6401, 6413, 6425, 6449, 6461, 6473, 6282, 6342, 6414, 6270, 6402, 6474, 6258, 6294, 6306, 6318, 6330, 6353, 6366, 6378, 6390, 6426, 6438, 6450, 6462, 6259, 6331, 6391, 6463, 6319, 6451, 6271, 6283, 6295, 6307, 6343, 6355, 6367, 6379, 6403, 6415, 6427, 6439, 6475, 6296, 6368, 6356, 6428, 6260, 6272, 6284, 6308, 6320, 6332, 6344, 6380, 6392, 6404, 6416, 6440, 6452, 6464, 6476, 6285, 6345, 6417, 6273, 6405, 6477, 6261, 6297, 6309, 6321, 6333, 6357, 6369, 6381, 6393, 6429, 6441, 6453, 6465, 6322, 6454, 6310, 6382, 6442, 6262, 6274, 6286, 6298, 6334, 6346, 6358, 6370, 6394, 6406, 6418, 6430, 6466, 6478]

after sorting the previous list you'll get:

> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1
]

as the same logic, the bold ones need to be removed

答案1

得分: 1

我认为这是你正在寻找的内容(完整示例)

Pandas版本:

s = pd.Series([69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572, 69573, 69574, 69575, 69576, 69577, 69578, 69579, 69580, 69581, 69582, 69583, 69584, 69585, 69586, 69587, 69588, 69589, 69590, 69591, 69592, 69593, 69594, 69595, 69596, 69597, 69598, 69599, 69600, 69601, 69602, 69603, 69604, 69605, 69606, 69607, 69608, 69609, 69610, 69611, 69612, 69613, 69614, 69615, 69616, 69617, 69618, 69619, 69620, 69621, 69622, 69623, 69624, 69625, 69626, 69627, 69628, 69629, 69630, 69631, 69632, 69633, 69634, 69635, 69636, 69637, 69638, 69639, 69640, 69641, 69642, 69643, 69644, 69645, 69646, 69647, 69648, 69649, 69650, 69651, 69652, 69653, 69654, 69655, 69656, 69657, 69658])

s[:s.index
展开收缩
[-1]+1]
返回 [69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572] 逻辑解释 首先diff函数获取到了你想要的位置 ne(1) - 仅获取diff不等于1的位置然后我们返回这些点的索引 [-1] 获取最后一个索引再加上2然后切片原始数组 Numpy版本 ```python # 尽量保持在pandas系列/数据框和numpy数组中,避免使用列表 for DSCD in FirmReturnIndexValues['DSCD'].unique(): secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD'].eq(DSCD)] t=secondary_df[secondary_df['RIe Pct Return'].eq(0)].index.values t.sort() if len(t)>=1: # 获取最后一个diff不等于1的索引 ind = np.where(np.not_equal(np.diff(t), 1))[0][-1] # 在索引上加2,以移除你示例中的粗体数字 result = t[:ind+2]

答案继续:

总的来说,这里有一些简单的代码优化。但不清楚你的输入数据框结构以及期望的输出是什么。如果你能提供完整的工作示例和期望的输出,我们可以进行性能测试。

# 使用直接的numpy方法调用(eq,lt,gt,ge,le,ne),而不是==, <, >等等。
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD'].eq(DSCD)]
        t=secondary_df[secondary_df['RIe Pct Return'].eq(0)].index.values  # 如果不需要,不要转换为列表
        t.sort()  # 使用numpy数组排序
        if len(t)>=1:
            print(np.diff((t)))

这些是快速的优化方法。你可能会从重构中获得结构性的性能优化。但你需要提供一个完整的可运行示例和期望的输出。

英文:

I think this is what you are looking for (Full example)

Pandas Version:

s = pd.Series([69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572, 69573, 69574, 69575, 69576, 69577, 69578, 69579, 69580, 69581, 69582, 69583, 69584, 69585, 69586, 69587, 69588, 69589, 69590, 69591, 69592, 69593, 69594, 69595, 69596, 69597, 69598, 69599, 69600, 69601, 69602, 69603, 69604, 69605, 69606, 69607, 69608, 69609, 69610, 69611, 69612, 69613, 69614, 69615, 69616, 69617, 69618, 69619, 69620, 69621, 69622, 69623, 69624, 69625, 69626, 69627, 69628, 69629, 69630, 69631, 69632, 69633, 69634, 69635, 69636, 69637, 69638, 69639, 69640, 69641, 69642, 69643, 69644, 69645, 69646, 69647, 69648, 69649, 69650, 69651, 69652, 69653, 69654, 69655, 69656, 69657, 69658])
s[:s.index
展开收缩
[-1]+1] returns: [69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572]

Logic explanation.
First diff gets to where you were.

ne(1) - only get locations where diff is not equal to 1 and then we return the indices for those points.

[-1] get the last index add 2 and then slice the original array.

Numpy Version:

# stay in pandas series/df and numpy arrays as much as possible avoid lists
for DSCD in FirmReturnIndexValues[&#39;DSCD&#39;].unique():
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues[&#39;DSCD&#39;].eq(DSCD)]
t=secondary_df[secondary_df[&#39;RIe Pct Return&#39;].eq(0)].index.values
t.sort()
if len(t)&gt;=1:
# get last index where diff is not 1
ind = np.where(np.not_equal(np.diff(t), 1))[0][-1]
# add 2 to index to remove the bolded numbers per your example
result = t[:ind+2]

Answer Continued:

In general here are some simple speed ups for the code you have shown. But it is unclear what exactly you have as an input df and desire as and result. If you can provide that we can benchmark it.

# use direct numpy method calls (eq,lt,gt,ge,le,ne) instead of ==, &lt;, &gt;, etc..
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues[&#39;DSCD&#39;].eq(DSCD)]
t=secondary_df[secondary_df[&#39;RIe Pct Return&#39;].eq(0)].index.values  # don&#39;t convert to list if you do not need to
t.sort()  # use numpy array sort
if len(t)&gt;=1:
print(np.diff((t)))

These are quick wins. You likely have structural speed ups from a refactor. But you need to provide a full working example and expected output.

答案2

得分: 0

如果我正确理解您的问题,这段代码将取代编写for循环并删除不需要的数据。

# 生成一个虚拟数据框
df = pd.DataFrame({"col1": np.random.randint(-5, 5, 20)})

# 最后一个0值的索引
idx_end = df[df.col1 == 0]['col1'].index[-1]

# 获取开始最后一个连续的0值序列的0值的索引
idx_start = df.iloc[:idx_end][df.iloc[:idx_end].col1 != 0]['col1'].index[-1] + 1

# 删除数据
df.drop(range(idx_start, idx_end + 1), inplace=True)
英文:

If I have understood your problem correctly, this code will supersede the need to write a for loop and drop the unrequired data.

# generate a dummy dataframe
df = pd.DataFrame({&quot;col1&quot;: np.random.randint(-5,5,20)})
# index of the last 0 value
idx_end = df[df.col1 == 0][&#39;col1&#39;].index[-1]
# get the index of the 0 value that starts the last continuous sequence of 0 values
idx_start = df.iloc[:idx_end][df.iloc[:idx_end].col1 != 0][&#39;col1&#39;].index[-1] + 1
# drop data
df.drop(range(idx_start, idx_end+1), inplace=True)

huangapple
  • 本文由 发表于 2023年8月10日 22:41:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76876814.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定