Pandas:在数据框中循环行并返回匹配的结果。

huangapple go评论72阅读模式
英文:

Pandas: Loop within Data frame Rows and return the matched result

问题

以下是一个带有两组值 A 和 B 的数据框,存储在列 'ID' 中。需要在相同的 'MT' 和 'Price' 列中找到先前行中的最小值。如果存在大于当前价格值的价格值,循环应该中断,并对以下行执行此条件。匹配的条件值需要填充到不同的列中,如下所示的示例输出:

df = {'ID': ['A', 'A', 'A', 'A','A','A','B', 'B', 'B', 'B','B','B'],
        'Date': ['01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023','01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023'],
        'MT': [2, 5, 10, 15, 20, 30,2, 5, 10, 15, 20, 30],
        'Price': [300, 100.5, 200, 150.35, 250, 90,300, 100.5, 200, 150.35, 250, 90]}

df = pd.DataFrame(data)

print(df)

Available Data frame:
*******************
   ID        Date  MT   Price
0   A  01-05-2023   2  300.00
1   A  02-05-2023   5  100.50
2   A  03-05-2023  10  200.00
3   A  04-05-2023  15  150.35
4   A  06-05-2023  20  250.00
5   A  07-05-2023  30   90.00
6   B  01-05-2023   2  300.00
7   B  02-05-2023   5  100.50
8   B  03-05-2023  10  200.00
9   B  04-05-2023  15  150.35
10  B  06-05-2023  20  250.00
11  B  07-05-2023  30   90.00

Output Required:
*******************
	ID	Date	    MT	Price		Matched_Price		Date_Values
0	A	01-05-2023	2	300			
1	A	02-05-2023	5	100.5			
2	A	03-05-2023	10	200		   100.5				02-05-2023
3	A	04-05-2023	15	150.35			
4	A	06-05-2023	20	250		100.5,200,150.35		02-05-2023,03-05-2023,04-05-2023
5	A	07-05-2023	30	90			
6	B	01-05-2023	2	300			
7	B	02-05-2023	5	100.5						
8	B	03-05-2023	10	200			100.5		        02-05-2023
9	B	04-05-2023	15	150.35			
10	B	06-05-2023	20	250			100.5,200,150.35	02-05-2023,03-05-2023,04-05-2023		
11	B	07-05-2023	30	90	

下面提供的代码能够很好地识别列中的所有较小值,但条件的第二部分,即如果存在大于当前价格值的价格值,循环应该中断,并对以下行执行此条件,需要应用。

for i, row in df.iterrows():
dfa = df.iloc[:i, :] # 切片选择当前行以上的行
dfa = dfa[(dfa.ID==row.ID) & (dfa.MT < row.MT) & (dfa.Price < row.Price)] # 匹配的行
df.loc[i, '匹配价格'] = ','.join(map(str, dfa.Price))
df.loc[i, '匹配日期'] = ','.join(map(str, dfa.Date))

英文:

Below is a Data frame with two set of values A & B in column 'ID'.
Need to find the smallest value from previous rows in the same columns 'MT' & 'Price', if there is a greater price value than the current price value, the loop should break and perform this condition for the following rows.
The matched condition values are required to be populated in different columns, as sample output is shown below,

Python
df = {&#39;ID&#39;: [&#39;A&#39;, &#39;A&#39;, &#39;A&#39;, &#39;A&#39;,&#39;A&#39;,&#39;A&#39;,&#39;B&#39;, &#39;B&#39;, &#39;B&#39;, &#39;B&#39;,&#39;B&#39;,&#39;B&#39;],
&#39;Date&#39;: [&#39;01-05-2023&#39;,&#39;02-05-2023&#39;, &#39;03-05-2023&#39;, &#39;04-05-2023&#39;, &#39;06-05-2023&#39;,&#39;07-05-2023&#39;,&#39;01-05-2023&#39;,&#39;02-05-2023&#39;, &#39;03-05-2023&#39;, &#39;04-05-2023&#39;, &#39;06-05-2023&#39;,&#39;07-05-2023&#39;],
&#39;MT&#39;: [2, 5, 10, 15, 20, 30,2, 5, 10, 15, 20, 30],
&#39;Price&#39;: [300, 100.5, 200, 150.35, 250, 90,300, 100.5, 200, 150.35, 250, 90,]}
df = pd.DataFrame(data)
print(df)
Available Data frame:
*******************
ID        Date  MT   Price
0   A  01-05-2023   2  300.00
1   A  02-05-2023   5  100.50
2   A  03-05-2023  10  200.00
3   A  04-05-2023  15  150.35
4   A  06-05-2023  20  250.00
5   A  07-05-2023  30   90.00
6   B  01-05-2023   2  300.00
7   B  02-05-2023   5  100.50
8   B  03-05-2023  10  200.00
9   B  04-05-2023  15  150.35
10  B  06-05-2023  20  250.00
11  B  07-05-2023  30   90.00
Output Required:
*******************
ID	Date	    MT	Price		Matched_Price		Date_Values
0	A	01-05-2023	2	300			
1	A	02-05-2023	5	100.5			
2	A	03-05-2023	10	200		   100.5				02-05-2023
3	A	04-05-2023	15	150.35			
4	A	06-05-2023	20	250		100.5,200,150.35		02-05-2023,03-05-2023,04-05-2023
5	A	07-05-2023	30	90			
6	B	01-05-2023	2	300			
7	B	02-05-2023	5	100.5						
8	B	03-05-2023	10	200			100.5		        02-05-2023
9	B	04-05-2023	15	150.35			
10	B	06-05-2023	20	250			100.5,200,150.35	02-05-2023,03- 
05-2023,04-05-2023		
11	B	07-05-2023	30	90	

The below provided code works fine in identifying all the smaller values in the column, but the second part of the condition i.e) if there is a greater price value than the current price value, the loop should break and perform this condition for the following rows, needs to be applied.

for i, row in df.iterrows():
dfa = df.iloc[:i, :] # slice with rows above current
dfa = dfa[(dfa.ID==row.ID) & (dfa.MT < row.MT) & (dfa.Price< row.Price)] # matched rows
df.loc[i, 'matched_Price'] = ','.join(map(str, dfa.Price))
df.loc[i, 'matched_dates'] = ','.join(map(str, dfa.Date))

答案1

得分: 1

我通常更喜欢在.apply()函数中执行这些操作,而不是使用循环。在pandas中使用循环并没有错,但我发现它可能会使事情在pandas环境中变得更加复杂。通常,使用apply或其他更好的向量化解决方案速度也更快。

这是我想出的方法:

import pandas as pd

df = {'ID': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
      'Date': ['01-05-2023', '02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023', '07-05-2023', '01-05-2023', '02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023', '07-05-2023'],
      'MT': [2, 5, 10, 15, 20, 30, 2, 5, 10, 15, 20, 30],
      'Price': [300, 100.5, 200, 150.35, 250, 90, 300, 100.5, 200, 150.35, 250, 90]}

df = pd.DataFrame(df)

# 在副本中存储以便我们可以操作它并保留源数据
mod_df = df.copy()

# Shift函数使我们得到了一系列被“推”了一个条目的数据,允许与前一个条目进行比较
mod_df["shift"] = df["Price"].shift()
# diff是我们将要评估的数据框,当这个diff在产品ID内是正数时,我们就关注这些行。
mod_df["diff"] = mod_df["Price"] - mod_df["shift"]


# 用于进行重要操作的函数
def pull_values_w_price_less_than(row: pd.Series, column: str):
    # 价格减少或保持不变的情况
    if row["diff"] <= 0:
        return pd.NA
    # 价格增加的情况(我们要评估的情况)
    else:
        # 与此行相关的条目
        relevant_df = mod_df[
            # 产品ID相同
            (mod_df["ID"] == row["ID"])
            # 价格低于此价格
            & (mod_df["Price"] < row["Price"])
            # 并且在此之前发生
            & (mod_df["Date"] < row["Date"])
        ]

        # 由传递的参数列决定的匹配值(无论是价格还是日期),放入一个列表中
        matched_values = relevant_df[column].to_list()

        # 如果没有匹配到任何值(这在ID边界处发生,价格下降,所以我们在这个区域,但我们不想返回一个空列表)
        if len(matched_values) == 0:
            return pd.NA
        # 返回要放入行中的值
        return matched_values

# 对价格进行操作
mod_df["Matched_Price"] = mod_df.apply(
    pull_values_w_price_less_than, args=("Price",), axis=1
)
# 对日期进行操作
mod_df["Date_Values"] = mod_df.apply(
    pull_values_w_price_less_than, args=("Date",), axis=1
)

# 删除我们的辅助列
output = mod_df.drop(columns=["shift", "diff"])

# 查看数据框(可选)
output

我还将值放入了一个列表中,而不是一个以逗号分隔的字符串,但你可以使用以下函数轻松将这些列转换为字符串:

def list_to_comma_string(value_list) -> str:
    # 如果它已经为空,我们不能改变它
    if value_list is pd.NA:
        return pd.NA
    
    # 使用列表理解将列表中的所有值更改为字符串
    joiner = [str(x) for x in value_list]

    # Python中组合字符串的最有效方式
    return ','.join(joiner)

# 转换两列
output['Matched_Price'] = output['Matched_Price'].apply(list_to_comma_string)
output['Date_Values'] = output['Date_Values'].apply(list_to_comma_string)

# 再次查看数据框
output

希望这对你有所帮助!

英文:

I usually prefer to do these things within the .apply() function rather than in a loop. It's not wrong to use a loop, but I find it can make things harder to deal with in pandas world. It is also usually slower than using apply or other better vectorized solutions.

Here is what I came up with:

import pandas as pd
df = {&#39;ID&#39;: [&#39;A&#39;, &#39;A&#39;, &#39;A&#39;, &#39;A&#39;,&#39;A&#39;,&#39;A&#39;,&#39;B&#39;, &#39;B&#39;, &#39;B&#39;, &#39;B&#39;,&#39;B&#39;,&#39;B&#39;],
&#39;Date&#39;: [&#39;01-05-2023&#39;,&#39;02-05-2023&#39;, &#39;03-05-2023&#39;, &#39;04-05-2023&#39;, &#39;06-05-2023&#39;,&#39;07-05-2023&#39;,&#39;01-05-2023&#39;,&#39;02-05-2023&#39;, &#39;03-05-2023&#39;, &#39;04-05-2023&#39;, &#39;06-05-2023&#39;,&#39;07-05-2023&#39;],
&#39;MT&#39;: [2, 5, 10, 15, 20, 30,2, 5, 10, 15, 20, 30],
&#39;Price&#39;: [300, 100.5, 200, 150.35, 250, 90,300, 100.5, 200, 150.35, 250, 90,]}
df = pd.DataFrame(df)
# Store in a copy so we can manipulate it and maintain the source
mod_df = df.copy()
# Shift gives us a series &quot;pushed&quot; along one entry. Allows comparison with the previous entry
mod_df[&quot;shift&quot;] = df[&quot;Price&quot;].shift()
# diff is where we will evaluate the dataframe. Where this diff is positive within
# a product ID is the rows you want to focus on.
mod_df[&quot;diff&quot;] = mod_df[&quot;Price&quot;] - mod_df[&quot;shift&quot;]
# Function for doing the heavy lifting
def pull_values_w_price_less_than(row: pd.Series, column: str):
# The case that the price decreased or stayed the same
if row[&quot;diff&quot;] &lt;= 0:
return pd.NA
# Price increased (where we want to eval)
else:
# Entries relevant to this row
relevant_df = mod_df[
# Product ID is the same
(mod_df[&quot;ID&quot;] == row[&quot;ID&quot;])
# Price is less than this one
&amp; (mod_df[&quot;Price&quot;] &lt; row[&quot;Price&quot;])
# And it happened before this one
&amp; (mod_df[&quot;Date&quot;] &lt; row[&quot;Date&quot;])
]
# The matched values (be it Price or Date) determined by the passed arg
# column placed in a list
matched_values = relevant_df[column].to_list()
# If nothing was matched (this happens at ID boundaries where the price
# went down so we are in this region, but we don&#39;t want to return an
# empty list)
if len(matched_values) == 0:
return pd.NA
# Return the values to put in the row
return matched_values
# Do this for the prices
mod_df[&quot;Matched_Price&quot;] = mod_df.apply(
pull_values_w_price_less_than, args=(&quot;Price&quot;,), axis=1
)
# And the dates
mod_df[&quot;Date_Values&quot;] = mod_df.apply(
pull_values_w_price_less_than, args=(&quot;Date&quot;,), axis=1
)
# Remove our helper columns
output = mod_df.drop(columns=[&quot;shift&quot;, &quot;diff&quot;])
# View the df (optional ofc)
output

I also put the values into a list rather than a string separated by commas, but you can easily convert those columns using this function:

def list_to_comma_string(value_list) -&gt; str:
# If it was already null we cant change that
if value_list is pd.NA:
return pd.NA
# list comprehension to change all the values to strings in the list
joiner = [str(x) for x in value_list]
# The most efficient way to combine strings in python
return &#39;,&#39;.join(joiner)
# Convert both columns
output[&#39;Matched_Price&#39;] = output[&#39;Matched_Price&#39;].apply(list_to_comma_string)
output[&#39;Date_Values&#39;] = output[&#39;Date_Values&#39;].apply(list_to_comma_string)
# Again view the df
output

Hope this helps!

huangapple
  • 本文由 发表于 2023年6月1日 01:39:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76376055.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定