英文:
How to drop (delete) consecutive values in a Dataframe
问题
- 如何删除粗体部分的值?
您可以使用以下方法删除粗体部分的值:
import numpy as np
# 原始列表
t = [74536, 74537, 74538, 74540, 74542, 74543, 74544, 74545, 74547, 74551, 74554, 74555, 74559, 74560, 74561, 74562, 74563, 74566, 74567, 74568, 74569, 74571, 74572, 74573, 74574, 74575, 74578, 74579, 74580, 74582, 74584, 74585, 74586, 74587, 74588, 74589, 74590, 74591, 74592, 74595, 74596, 74597, 74598, 74599, 74601, 74602, 74603, 74604, 74605, 74606, 74607, 74608, 74609, 74610, 74612, 74613, 74614, 74615, 74616, 74617, 74618, 74619, 74620, 74621, 74622, 74623, 74624, 74625, 74626, 74627, 74628, 74629, 74630, 74631, 74632, 74633, 74634, 74635, 74636, 74637, 74638, 74639, 74640, 74641, 74642, 74643, 74644, 74645, 74646, 74647, 74648, 74649, 74650, 74651, 74652, 74653, 74654, 74655, 74656, 74657, 74658, 74659, 74660, 74661, 74662, 74663, 74664, 74665, 74666, 74667, 74668, 74669, 74670, 74671, 74672, 74673, 74674, 74675, 74676, 74677, 74678, 74679, 74680, 74681, 74682, 74683, 74684, 74685, 74686, 74687, 74688, 74689, 74690, 74691, 74692, 74693, 74694, 74695, 74696, 74697, 74698, 74699, 74700, 74701, 74702, 74703, 74704, 74705, 74706, 74707, 74708, 74709, 74710, 74711, 74712, 74713, 74714, 74715, 74716, 74717, 74718, 74719, 74720, 74721, 74722, 74723, 74724, 74725, 74726, 74727, 74728, 74729, 74730, 74731, 74732, 74733, 74734, 74735, 74736, 74737, 74738, 74739, 74740, 74741, 74742, 74743, 74744, 74745, 74746, 74747, 74748, 74749, 74750, 74751, 74752, 74753, 74754, 74755, 74756, 74757, 74758, 74759, 74760, 74761, 74762, 74763, 74764, 74765, 74766, 74767, 74768, 74769, 74770, 74771, 74772, 74773, 74774, 74775, 74776, 74777, 74778, 74779, 74780, 74781, 74782, 74783, 74784, 74785, 74786, 74787, 74788, 74789, 74790, 74791, 74792, 74793, 74794, 74795, 74796, 74797, 74798, 74799, 74800, 74801, 74802, 74803, 74804, 74805, 74806, 74807, 74808, 74809, 74810, 74811, 74812, 74813, 74814, 74815, 74816, 74817, 74818, 74819, 74820, 74821, 74822, 74823, 74824, 74825, 74826, 74827, 74828, 74829, 74830, 74831, 74832, 74833, 74834, 74835, 74836, 74837, 74838, 74839, 74840, 74841, 74842]
# 计算差分
diff_t = np.diff(t)
# 找到要删除的索引
indexes_to_remove = np.where(diff_t == 1)[0]
# 删除粗体部分的值
filtered_t = [t[i] for i in range(len(t)) if i not in indexes_to_remove]
- 如何提高效率?
为了提高效率,您可以尝试使用 Pandas 的矢量化操作来替代循环,以减少计算时间。以下是一个可能的优化示例:
import pandas as pd
# 假设您的数据存储在一个名为 df 的 Pandas DataFrame 中
# 添加一列,表示当前行与上一行是否连续
df['is_continuous'] = df['DSCD'].diff() == 1
# 找到需要删除的连续行
rows_to_remove = df[df['is_continuous']].index
# 删除这些行
df.drop(rows_to_remove, inplace=True)
# 删除添加的列
df.drop(columns=['is_continuous'], inplace=True)
这种方法可以帮助您更有效地删除粗体部分的值,而不需要使用循环。
英文:
I have a dataframe with a column that has 0 Values. I wish to find those 0 values and check if till the end they are 0, drop only those at the end and not in the middle.
this is how the Data in secondary_df
looks like:
DSCD date year month RI RIu RIu1 RIe
203 1316 1/29/2010 2010.0 1.0 66.39 66.30 6.21 6.39 \
275 1316 1/29/2016 2016.0 1.0 66.97 166.84 6.89 6.32
131 1316 1/30/2004 2004.0 1.0 66.01 66.15 6.36 6.60
191 1316 1/30/2009 2009.0 1.0 66.36 6.54 685.25 6.71
263 1316 1/30/2015 2015.0 1.0 66.43 6.94 114.14 6.33
.. ... ... ... ... ... ... ... ...
250 1316 12/31/2013 2013.0 12.0 99.98 5.24 59.91 5.07
262 1316 12/31/2014 2014.0 12.0 99.33 54.14 54.64 55.96
274 1316 12/31/2015 2015.0 12.0 55.32 5.89 15.19 54.34
310 1316 12/31/2018 2018.0 12.0 55.56 55.23 5.40 5.49
322 1316 12/31/2019 2019.0 12.0 55.39 55.98 5.69 5.88
RIu Pct Return RIe_Pct_Return
203 -0.05 0.0255 \
275 -0.0358 -0.059
131 0.058 0.05106
191 0.0055 0.0535
263 -0.035 0.053
.. ... ...
250 0.01092 -0.05
262 -0.001 0.02572
274 -0.003 -0.0512
310 -0.000 -0.05274
322 0.004 0.039
This is what I got so far.
for DSCD in FirmReturnIndexValues['DSCD'].unique():
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD']==DSCD]
t=secondary_df[(secondary_df['RIe Pct Return'].values == 0)].index.values.tolist()
t.sort()
if len(t)>=1:
print(np.diff((t)))
for example this part is t:
[69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572, 69573, 69574, 69575, 69576, 69577, 69578, 69579, 69580, 69581, 69582, 69583, 69584, 69585, 69586, 69587, 69588, 69589, 69590, 69591, 69592, 69593, 69594, 69595, 69596, 69597, 69598, 69599, 69600, 69601, 69602, 69603, 69604, 69605, 69606, 69607, 69608, 69609, 69610, 69611, 69612, 69613, 69614, 69615, 69616, 69617, 69618, 69619, 69620, 69621, 69622, 69623, 69624, 69625, 69626, 69627, 69628, 69629, 69630, 69631, 69632, 69633, 69634, 69635, 69636, 69637, 69638, 69639, 69640, 69641, 69642, 69643, 69644, 69645, 69646, 69647, 69648, 69649, 69650, 69651, 69652, 69653, 69654, 69655, 69656, 69657, 69658]
this is the Indexes that I get from my code and when I use the np.diff()
method I get this values and the values I wish to drop (delete) are being bolded:
> [ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1 2 80 22 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1]
so I have 2 Questions.
- how can I delete the bolded one's?
- the first for loop contains 8000 DSCD's is there anyway it can be more efficient?
Another Example:
list t:
[74536, 74537, 74538, 74540, 74542, 74543, 74544, 74545, 74547, 74551, 74554, 74555, 74559, 74560, 74561, 74562, 74563, 74566, 74567, 74568, 74569, 74571, 74572, 74573, 74574, 74575, 74578, 74579, 74580, 74582, 74584, 74585, 74586, 74587, 74588, 74589, 74590, 74591, 74592, 74595, 74596, 74597, 74598, 74599, 74601, 74602, 74603, 74604, 74605, 74606, 74607, 74608, 74609, 74610, 74612, 74613, 74614, 74615, 74616, 74617, 74618, 74619, 74620, 74621, 74622, 74623, 74624, 74625, 74626, 74627, 74628, 74629, 74630, 74631, 74632, 74633, 74634, 74635, 74636, 74637, 74638, 74639, 74640, 74641, 74642, 74643, 74644, 74645, 74646, 74647, 74648, 74649, 74650, 74651, 74652, 74653, 74654, 74655, 74656, 74657, 74658, 74659, 74660, 74661, 74662, 74663, 74664, 74665, 74666, 74667, 74668, 74669, 74670, 74671, 74672, 74673, 74674, 74675, 74676, 74677, 74678, 74679, 74680, 74681, 74682, 74683, 74684, 74685, 74686, 74687, 74688, 74689, 74690, 74691, 74692, 74693, 74694, 74695, 74696, 74697, 74698, 74699, 74700, 74701, 74702, 74703, 74704, 74705, 74706, 74707, 74708, 74709, 74710, 74711, 74712, 74713, 74714, 74715, 74716, 74717, 74718, 74719, 74720, 74721, 74722, 74723, 74724, 74725, 74726, 74727, 74728, 74729, 74730, 74731, 74732, 74733, 74734, 74735, 74736, 74737, 74738, 74739, 74740, 74741, 74742, 74743, 74744, 74745, 74746, 74747, 74748, 74749, 74750, 74751, 74752, 74753, 74754, 74755, 74756, 74757, 74758, 74759, 74760, 74761, 74762, 74763, 74764, 74765, 74766, 74767, 74768, 74769, 74770, 74771, 74772, 74773, 74774, 74775, 74776, 74777, 74778, 74779, 74780, 74781, 74782, 74783, 74784, 74785, 74786, 74787, 74788, 74789, 74790, 74791, 74792, 74793, 74794, 74795, 74796, 74797, 74798, 74799, 74800, 74801, 74802, 74803, 74804, 74805, 74806, 74807, 74808, 74809, 74810, 74811, 74812, 74813, 74814, 74815, 74816, 74817, 74818, 74819, 74820, 74821, 74822, 74823, 74824, 74825, 74826, 74827, 74828, 74829, 74830, 74831, 74832, 74833, 74834, 74835, 74836, 74837, 74838, 74839, 74840, 74841, 74842]
result of np.diff():
> [1 1 2 2 1 1 1 2 4 3 1 4 1 1 1 1 3 1 1 1 2 1 1 1 1 3 1 1 2 2 1 1 1 1 1 1 1
1 3 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
Hard one:
[6359, 6431, 6287, 6347, 6419, 6263, 6275, 6299, 6311, 6323, 6335, 6371, 6383, 6395, 6407, 6443, 6455, 6467, 6360, 6288, 6348, 6420, 6252, 6264, 6276, 6300, 6312, 6324, 6372, 6396, 6408, 6444, 6456, 6468, 6336, 6384, 6432, 6265, 6397, 6469, 6253, 6325, 6385, 6457, 6277, 6289, 6301, 6313, 6337, 6349, 6361, 6373, 6409, 6421, 6433, 6445, 6314, 6446, 6302, 6374, 6434, 6254, 6266, 6278, 6290, 6326, 6338, 6350, 6362, 6386, 6398, 6410, 6422, 6458, 6470, 6423, 6279, 6339, 6411, 6255, 6267, 6291, 6303, 6315, 6327, 6363, 6375, 6387, 6399, 6435, 6447, 6459, 6471, 6268, 6400, 6472, 6256, 6328, 6388, 6460, 6280, 6292, 6304, 6316, 6340, 6351, 6364, 6376, 6412, 6424, 6436, 6448, 6305, 6377, 6437, 6293, 6365, 6257, 6269, 6281, 6317, 6329, 6341, 6389, 6401, 6413, 6425, 6449, 6461, 6473, 6282, 6342, 6414, 6270, 6402, 6474, 6258, 6294, 6306, 6318, 6330, 6353, 6366, 6378, 6390, 6426, 6438, 6450, 6462, 6259, 6331, 6391, 6463, 6319, 6451, 6271, 6283, 6295, 6307, 6343, 6355, 6367, 6379, 6403, 6415, 6427, 6439, 6475, 6296, 6368, 6356, 6428, 6260, 6272, 6284, 6308, 6320, 6332, 6344, 6380, 6392, 6404, 6416, 6440, 6452, 6464, 6476, 6285, 6345, 6417, 6273, 6405, 6477, 6261, 6297, 6309, 6321, 6333, 6357, 6369, 6381, 6393, 6429, 6441, 6453, 6465, 6322, 6454, 6310, 6382, 6442, 6262, 6274, 6286, 6298, 6334, 6346, 6358, 6370, 6394, 6406, 6418, 6430, 6466, 6478]
after sorting the previous list you'll get:
> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1]
as the same logic, the bold ones need to be removed
答案1
得分: 1
我认为这是你正在寻找的内容(完整示例)
Pandas版本:
s = pd.Series([69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572, 69573, 69574, 69575, 69576, 69577, 69578, 69579, 69580, 69581, 69582, 69583, 69584, 69585, 69586, 69587, 69588, 69589, 69590, 69591, 69592, 69593, 69594, 69595, 69596, 69597, 69598, 69599, 69600, 69601, 69602, 69603, 69604, 69605, 69606, 69607, 69608, 69609, 69610, 69611, 69612, 69613, 69614, 69615, 69616, 69617, 69618, 69619, 69620, 69621, 69622, 69623, 69624, 69625, 69626, 69627, 69628, 69629, 69630, 69631, 69632, 69633, 69634, 69635, 69636, 69637, 69638, 69639, 69640, 69641, 69642, 69643, 69644, 69645, 69646, 69647, 69648, 69649, 69650, 69651, 69652, 69653, 69654, 69655, 69656, 69657, 69658])
s[:s.index展开收缩[-1]+1]
返回:
[69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572]
逻辑解释。
首先,diff函数获取到了你想要的位置。
ne(1) - 仅获取diff不等于1的位置,然后我们返回这些点的索引。
[-1] 获取最后一个索引,再加上2,然后切片原始数组。
Numpy版本:
```python
# 尽量保持在pandas系列/数据框和numpy数组中,避免使用列表
for DSCD in FirmReturnIndexValues['DSCD'].unique():
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD'].eq(DSCD)]
t=secondary_df[secondary_df['RIe Pct Return'].eq(0)].index.values
t.sort()
if len(t)>=1:
# 获取最后一个diff不等于1的索引
ind = np.where(np.not_equal(np.diff(t), 1))[0][-1]
# 在索引上加2,以移除你示例中的粗体数字
result = t[:ind+2]
答案继续:
总的来说,这里有一些简单的代码优化。但不清楚你的输入数据框结构以及期望的输出是什么。如果你能提供完整的工作示例和期望的输出,我们可以进行性能测试。
# 使用直接的numpy方法调用(eq,lt,gt,ge,le,ne),而不是==, <, >等等。
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD'].eq(DSCD)]
t=secondary_df[secondary_df['RIe Pct Return'].eq(0)].index.values # 如果不需要,不要转换为列表
t.sort() # 使用numpy数组排序
if len(t)>=1:
print(np.diff((t)))
这些是快速的优化方法。你可能会从重构中获得结构性的性能优化。但你需要提供一个完整的可运行示例和期望的输出。
英文:
I think this is what you are looking for (Full example)
Pandas Version:
s = pd.Series([69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572, 69573, 69574, 69575, 69576, 69577, 69578, 69579, 69580, 69581, 69582, 69583, 69584, 69585, 69586, 69587, 69588, 69589, 69590, 69591, 69592, 69593, 69594, 69595, 69596, 69597, 69598, 69599, 69600, 69601, 69602, 69603, 69604, 69605, 69606, 69607, 69608, 69609, 69610, 69611, 69612, 69613, 69614, 69615, 69616, 69617, 69618, 69619, 69620, 69621, 69622, 69623, 69624, 69625, 69626, 69627, 69628, 69629, 69630, 69631, 69632, 69633, 69634, 69635, 69636, 69637, 69638, 69639, 69640, 69641, 69642, 69643, 69644, 69645, 69646, 69647, 69648, 69649, 69650, 69651, 69652, 69653, 69654, 69655, 69656, 69657, 69658])
s[:s.index展开收缩[-1]+1]
returns:
[69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572]
Logic explanation.
First diff gets to where you were.
ne(1) - only get locations where diff is not equal to 1 and then we return the indices for those points.
[-1] get the last index add 2 and then slice the original array.
Numpy Version:
# stay in pandas series/df and numpy arrays as much as possible avoid lists
for DSCD in FirmReturnIndexValues['DSCD'].unique():
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD'].eq(DSCD)]
t=secondary_df[secondary_df['RIe Pct Return'].eq(0)].index.values
t.sort()
if len(t)>=1:
# get last index where diff is not 1
ind = np.where(np.not_equal(np.diff(t), 1))[0][-1]
# add 2 to index to remove the bolded numbers per your example
result = t[:ind+2]
Answer Continued:
In general here are some simple speed ups for the code you have shown. But it is unclear what exactly you have as an input df and desire as and result. If you can provide that we can benchmark it.
# use direct numpy method calls (eq,lt,gt,ge,le,ne) instead of ==, <, >, etc..
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD'].eq(DSCD)]
t=secondary_df[secondary_df['RIe Pct Return'].eq(0)].index.values # don't convert to list if you do not need to
t.sort() # use numpy array sort
if len(t)>=1:
print(np.diff((t)))
These are quick wins. You likely have structural speed ups from a refactor. But you need to provide a full working example and expected output.
答案2
得分: 0
如果我正确理解您的问题,这段代码将取代编写for循环并删除不需要的数据。
# 生成一个虚拟数据框
df = pd.DataFrame({"col1": np.random.randint(-5, 5, 20)})
# 最后一个0值的索引
idx_end = df[df.col1 == 0]['col1'].index[-1]
# 获取开始最后一个连续的0值序列的0值的索引
idx_start = df.iloc[:idx_end][df.iloc[:idx_end].col1 != 0]['col1'].index[-1] + 1
# 删除数据
df.drop(range(idx_start, idx_end + 1), inplace=True)
英文:
If I have understood your problem correctly, this code will supersede the need to write a for loop and drop the unrequired data.
# generate a dummy dataframe
df = pd.DataFrame({"col1": np.random.randint(-5,5,20)})
# index of the last 0 value
idx_end = df[df.col1 == 0]['col1'].index[-1]
# get the index of the 0 value that starts the last continuous sequence of 0 values
idx_start = df.iloc[:idx_end][df.iloc[:idx_end].col1 != 0]['col1'].index[-1] + 1
# drop data
df.drop(range(idx_start, idx_end+1), inplace=True)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论