英文:
Pandas: Loop within Data frame Rows and return the matched result
问题
以下是一个带有两组值 A 和 B 的数据框,存储在列 'ID' 中。需要在相同的 'MT' 和 'Price' 列中找到先前行中的最小值。如果存在大于当前价格值的价格值,循环应该中断,并对以下行执行此条件。匹配的条件值需要填充到不同的列中,如下所示的示例输出:
df = {'ID': ['A', 'A', 'A', 'A','A','A','B', 'B', 'B', 'B','B','B'],
'Date': ['01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023','01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023'],
'MT': [2, 5, 10, 15, 20, 30,2, 5, 10, 15, 20, 30],
'Price': [300, 100.5, 200, 150.35, 250, 90,300, 100.5, 200, 150.35, 250, 90]}
df = pd.DataFrame(data)
print(df)
Available Data frame:
*******************
ID Date MT Price
0 A 01-05-2023 2 300.00
1 A 02-05-2023 5 100.50
2 A 03-05-2023 10 200.00
3 A 04-05-2023 15 150.35
4 A 06-05-2023 20 250.00
5 A 07-05-2023 30 90.00
6 B 01-05-2023 2 300.00
7 B 02-05-2023 5 100.50
8 B 03-05-2023 10 200.00
9 B 04-05-2023 15 150.35
10 B 06-05-2023 20 250.00
11 B 07-05-2023 30 90.00
Output Required:
*******************
ID Date MT Price Matched_Price Date_Values
0 A 01-05-2023 2 300
1 A 02-05-2023 5 100.5
2 A 03-05-2023 10 200 100.5 02-05-2023
3 A 04-05-2023 15 150.35
4 A 06-05-2023 20 250 100.5,200,150.35 02-05-2023,03-05-2023,04-05-2023
5 A 07-05-2023 30 90
6 B 01-05-2023 2 300
7 B 02-05-2023 5 100.5
8 B 03-05-2023 10 200 100.5 02-05-2023
9 B 04-05-2023 15 150.35
10 B 06-05-2023 20 250 100.5,200,150.35 02-05-2023,03-05-2023,04-05-2023
11 B 07-05-2023 30 90
下面提供的代码能够很好地识别列中的所有较小值,但条件的第二部分,即如果存在大于当前价格值的价格值,循环应该中断,并对以下行执行此条件,需要应用。
for i, row in df.iterrows():
dfa = df.iloc[:i, :] # 切片选择当前行以上的行
dfa = dfa[(dfa.ID==row.ID) & (dfa.MT < row.MT) & (dfa.Price < row.Price)] # 匹配的行
df.loc[i, '匹配价格'] = ','.join(map(str, dfa.Price))
df.loc[i, '匹配日期'] = ','.join(map(str, dfa.Date))
英文:
Below is a Data frame with two set of values A & B in column 'ID'.
Need to find the smallest value from previous rows in the same columns 'MT' & 'Price', if there is a greater price value than the current price value, the loop should break and perform this condition for the following rows.
The matched condition values are required to be populated in different columns, as sample output is shown below,
Python
df = {'ID': ['A', 'A', 'A', 'A','A','A','B', 'B', 'B', 'B','B','B'],
'Date': ['01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023','01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023'],
'MT': [2, 5, 10, 15, 20, 30,2, 5, 10, 15, 20, 30],
'Price': [300, 100.5, 200, 150.35, 250, 90,300, 100.5, 200, 150.35, 250, 90,]}
df = pd.DataFrame(data)
print(df)
Available Data frame:
*******************
ID Date MT Price
0 A 01-05-2023 2 300.00
1 A 02-05-2023 5 100.50
2 A 03-05-2023 10 200.00
3 A 04-05-2023 15 150.35
4 A 06-05-2023 20 250.00
5 A 07-05-2023 30 90.00
6 B 01-05-2023 2 300.00
7 B 02-05-2023 5 100.50
8 B 03-05-2023 10 200.00
9 B 04-05-2023 15 150.35
10 B 06-05-2023 20 250.00
11 B 07-05-2023 30 90.00
Output Required:
*******************
ID Date MT Price Matched_Price Date_Values
0 A 01-05-2023 2 300
1 A 02-05-2023 5 100.5
2 A 03-05-2023 10 200 100.5 02-05-2023
3 A 04-05-2023 15 150.35
4 A 06-05-2023 20 250 100.5,200,150.35 02-05-2023,03-05-2023,04-05-2023
5 A 07-05-2023 30 90
6 B 01-05-2023 2 300
7 B 02-05-2023 5 100.5
8 B 03-05-2023 10 200 100.5 02-05-2023
9 B 04-05-2023 15 150.35
10 B 06-05-2023 20 250 100.5,200,150.35 02-05-2023,03-
05-2023,04-05-2023
11 B 07-05-2023 30 90
The below provided code works fine in identifying all the smaller values in the column, but the second part of the condition i.e) if there is a greater price value than the current price value, the loop should break and perform this condition for the following rows, needs to be applied.
for i, row in df.iterrows():
dfa = df.iloc[:i, :] # slice with rows above current
dfa = dfa[(dfa.ID==row.ID) & (dfa.MT < row.MT) & (dfa.Price< row.Price)] # matched rows
df.loc[i, 'matched_Price'] = ','.join(map(str, dfa.Price))
df.loc[i, 'matched_dates'] = ','.join(map(str, dfa.Date))
答案1
得分: 1
我通常更喜欢在.apply()
函数中执行这些操作,而不是使用循环。在pandas中使用循环并没有错,但我发现它可能会使事情在pandas环境中变得更加复杂。通常,使用apply或其他更好的向量化解决方案速度也更快。
这是我想出的方法:
import pandas as pd
df = {'ID': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
'Date': ['01-05-2023', '02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023', '07-05-2023', '01-05-2023', '02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023', '07-05-2023'],
'MT': [2, 5, 10, 15, 20, 30, 2, 5, 10, 15, 20, 30],
'Price': [300, 100.5, 200, 150.35, 250, 90, 300, 100.5, 200, 150.35, 250, 90]}
df = pd.DataFrame(df)
# 在副本中存储以便我们可以操作它并保留源数据
mod_df = df.copy()
# Shift函数使我们得到了一系列被“推”了一个条目的数据,允许与前一个条目进行比较
mod_df["shift"] = df["Price"].shift()
# diff是我们将要评估的数据框,当这个diff在产品ID内是正数时,我们就关注这些行。
mod_df["diff"] = mod_df["Price"] - mod_df["shift"]
# 用于进行重要操作的函数
def pull_values_w_price_less_than(row: pd.Series, column: str):
# 价格减少或保持不变的情况
if row["diff"] <= 0:
return pd.NA
# 价格增加的情况(我们要评估的情况)
else:
# 与此行相关的条目
relevant_df = mod_df[
# 产品ID相同
(mod_df["ID"] == row["ID"])
# 价格低于此价格
& (mod_df["Price"] < row["Price"])
# 并且在此之前发生
& (mod_df["Date"] < row["Date"])
]
# 由传递的参数列决定的匹配值(无论是价格还是日期),放入一个列表中
matched_values = relevant_df[column].to_list()
# 如果没有匹配到任何值(这在ID边界处发生,价格下降,所以我们在这个区域,但我们不想返回一个空列表)
if len(matched_values) == 0:
return pd.NA
# 返回要放入行中的值
return matched_values
# 对价格进行操作
mod_df["Matched_Price"] = mod_df.apply(
pull_values_w_price_less_than, args=("Price",), axis=1
)
# 对日期进行操作
mod_df["Date_Values"] = mod_df.apply(
pull_values_w_price_less_than, args=("Date",), axis=1
)
# 删除我们的辅助列
output = mod_df.drop(columns=["shift", "diff"])
# 查看数据框(可选)
output
我还将值放入了一个列表中,而不是一个以逗号分隔的字符串,但你可以使用以下函数轻松将这些列转换为字符串:
def list_to_comma_string(value_list) -> str:
# 如果它已经为空,我们不能改变它
if value_list is pd.NA:
return pd.NA
# 使用列表理解将列表中的所有值更改为字符串
joiner = [str(x) for x in value_list]
# Python中组合字符串的最有效方式
return ','.join(joiner)
# 转换两列
output['Matched_Price'] = output['Matched_Price'].apply(list_to_comma_string)
output['Date_Values'] = output['Date_Values'].apply(list_to_comma_string)
# 再次查看数据框
output
希望这对你有所帮助!
英文:
I usually prefer to do these things within the .apply()
function rather than in a loop. It's not wrong to use a loop, but I find it can make things harder to deal with in pandas world. It is also usually slower than using apply or other better vectorized solutions.
Here is what I came up with:
import pandas as pd
df = {'ID': ['A', 'A', 'A', 'A','A','A','B', 'B', 'B', 'B','B','B'],
'Date': ['01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023','01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023'],
'MT': [2, 5, 10, 15, 20, 30,2, 5, 10, 15, 20, 30],
'Price': [300, 100.5, 200, 150.35, 250, 90,300, 100.5, 200, 150.35, 250, 90,]}
df = pd.DataFrame(df)
# Store in a copy so we can manipulate it and maintain the source
mod_df = df.copy()
# Shift gives us a series "pushed" along one entry. Allows comparison with the previous entry
mod_df["shift"] = df["Price"].shift()
# diff is where we will evaluate the dataframe. Where this diff is positive within
# a product ID is the rows you want to focus on.
mod_df["diff"] = mod_df["Price"] - mod_df["shift"]
# Function for doing the heavy lifting
def pull_values_w_price_less_than(row: pd.Series, column: str):
# The case that the price decreased or stayed the same
if row["diff"] <= 0:
return pd.NA
# Price increased (where we want to eval)
else:
# Entries relevant to this row
relevant_df = mod_df[
# Product ID is the same
(mod_df["ID"] == row["ID"])
# Price is less than this one
& (mod_df["Price"] < row["Price"])
# And it happened before this one
& (mod_df["Date"] < row["Date"])
]
# The matched values (be it Price or Date) determined by the passed arg
# column placed in a list
matched_values = relevant_df[column].to_list()
# If nothing was matched (this happens at ID boundaries where the price
# went down so we are in this region, but we don't want to return an
# empty list)
if len(matched_values) == 0:
return pd.NA
# Return the values to put in the row
return matched_values
# Do this for the prices
mod_df["Matched_Price"] = mod_df.apply(
pull_values_w_price_less_than, args=("Price",), axis=1
)
# And the dates
mod_df["Date_Values"] = mod_df.apply(
pull_values_w_price_less_than, args=("Date",), axis=1
)
# Remove our helper columns
output = mod_df.drop(columns=["shift", "diff"])
# View the df (optional ofc)
output
I also put the values into a list rather than a string separated by commas, but you can easily convert those columns using this function:
def list_to_comma_string(value_list) -> str:
# If it was already null we cant change that
if value_list is pd.NA:
return pd.NA
# list comprehension to change all the values to strings in the list
joiner = [str(x) for x in value_list]
# The most efficient way to combine strings in python
return ','.join(joiner)
# Convert both columns
output['Matched_Price'] = output['Matched_Price'].apply(list_to_comma_string)
output['Date_Values'] = output['Date_Values'].apply(list_to_comma_string)
# Again view the df
output
Hope this helps!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论