Pandas:在数据框中循环行并返回匹配的结果。

huangapple go评论99阅读模式
英文:

Pandas: Loop within Data frame Rows and return the matched result

问题

以下是一个带有两组值 A 和 B 的数据框,存储在列 'ID' 中。需要在相同的 'MT' 和 'Price' 列中找到先前行中的最小值。如果存在大于当前价格值的价格值,循环应该中断,并对以下行执行此条件。匹配的条件值需要填充到不同的列中,如下所示的示例输出:

  1. df = {'ID': ['A', 'A', 'A', 'A','A','A','B', 'B', 'B', 'B','B','B'],
  2. 'Date': ['01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023','01-05-2023','02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023','07-05-2023'],
  3. 'MT': [2, 5, 10, 15, 20, 30,2, 5, 10, 15, 20, 30],
  4. 'Price': [300, 100.5, 200, 150.35, 250, 90,300, 100.5, 200, 150.35, 250, 90]}
  5. df = pd.DataFrame(data)
  6. print(df)
  7. Available Data frame:
  8. *******************
  9. ID Date MT Price
  10. 0 A 01-05-2023 2 300.00
  11. 1 A 02-05-2023 5 100.50
  12. 2 A 03-05-2023 10 200.00
  13. 3 A 04-05-2023 15 150.35
  14. 4 A 06-05-2023 20 250.00
  15. 5 A 07-05-2023 30 90.00
  16. 6 B 01-05-2023 2 300.00
  17. 7 B 02-05-2023 5 100.50
  18. 8 B 03-05-2023 10 200.00
  19. 9 B 04-05-2023 15 150.35
  20. 10 B 06-05-2023 20 250.00
  21. 11 B 07-05-2023 30 90.00
  22. Output Required:
  23. *******************
  24. ID Date MT Price Matched_Price Date_Values
  25. 0 A 01-05-2023 2 300
  26. 1 A 02-05-2023 5 100.5
  27. 2 A 03-05-2023 10 200 100.5 02-05-2023
  28. 3 A 04-05-2023 15 150.35
  29. 4 A 06-05-2023 20 250 100.5,200,150.35 02-05-2023,03-05-2023,04-05-2023
  30. 5 A 07-05-2023 30 90
  31. 6 B 01-05-2023 2 300
  32. 7 B 02-05-2023 5 100.5
  33. 8 B 03-05-2023 10 200 100.5 02-05-2023
  34. 9 B 04-05-2023 15 150.35
  35. 10 B 06-05-2023 20 250 100.5,200,150.35 02-05-2023,03-05-2023,04-05-2023
  36. 11 B 07-05-2023 30 90

下面提供的代码能够很好地识别列中的所有较小值,但条件的第二部分,即如果存在大于当前价格值的价格值,循环应该中断,并对以下行执行此条件,需要应用。

for i, row in df.iterrows():
dfa = df.iloc[:i, :] # 切片选择当前行以上的行
dfa = dfa[(dfa.ID==row.ID) & (dfa.MT < row.MT) & (dfa.Price < row.Price)] # 匹配的行
df.loc[i, '匹配价格'] = ','.join(map(str, dfa.Price))
df.loc[i, '匹配日期'] = ','.join(map(str, dfa.Date))

英文:

Below is a Data frame with two set of values A & B in column 'ID'.
Need to find the smallest value from previous rows in the same columns 'MT' & 'Price', if there is a greater price value than the current price value, the loop should break and perform this condition for the following rows.
The matched condition values are required to be populated in different columns, as sample output is shown below,

  1. Python
  2. df = {&#39;ID&#39;: [&#39;A&#39;, &#39;A&#39;, &#39;A&#39;, &#39;A&#39;,&#39;A&#39;,&#39;A&#39;,&#39;B&#39;, &#39;B&#39;, &#39;B&#39;, &#39;B&#39;,&#39;B&#39;,&#39;B&#39;],
  3. &#39;Date&#39;: [&#39;01-05-2023&#39;,&#39;02-05-2023&#39;, &#39;03-05-2023&#39;, &#39;04-05-2023&#39;, &#39;06-05-2023&#39;,&#39;07-05-2023&#39;,&#39;01-05-2023&#39;,&#39;02-05-2023&#39;, &#39;03-05-2023&#39;, &#39;04-05-2023&#39;, &#39;06-05-2023&#39;,&#39;07-05-2023&#39;],
  4. &#39;MT&#39;: [2, 5, 10, 15, 20, 30,2, 5, 10, 15, 20, 30],
  5. &#39;Price&#39;: [300, 100.5, 200, 150.35, 250, 90,300, 100.5, 200, 150.35, 250, 90,]}
  6. df = pd.DataFrame(data)
  7. print(df)
  8. Available Data frame:
  9. *******************
  10. ID Date MT Price
  11. 0 A 01-05-2023 2 300.00
  12. 1 A 02-05-2023 5 100.50
  13. 2 A 03-05-2023 10 200.00
  14. 3 A 04-05-2023 15 150.35
  15. 4 A 06-05-2023 20 250.00
  16. 5 A 07-05-2023 30 90.00
  17. 6 B 01-05-2023 2 300.00
  18. 7 B 02-05-2023 5 100.50
  19. 8 B 03-05-2023 10 200.00
  20. 9 B 04-05-2023 15 150.35
  21. 10 B 06-05-2023 20 250.00
  22. 11 B 07-05-2023 30 90.00
  23. Output Required:
  24. *******************
  25. ID Date MT Price Matched_Price Date_Values
  26. 0 A 01-05-2023 2 300
  27. 1 A 02-05-2023 5 100.5
  28. 2 A 03-05-2023 10 200 100.5 02-05-2023
  29. 3 A 04-05-2023 15 150.35
  30. 4 A 06-05-2023 20 250 100.5,200,150.35 02-05-2023,03-05-2023,04-05-2023
  31. 5 A 07-05-2023 30 90
  32. 6 B 01-05-2023 2 300
  33. 7 B 02-05-2023 5 100.5
  34. 8 B 03-05-2023 10 200 100.5 02-05-2023
  35. 9 B 04-05-2023 15 150.35
  36. 10 B 06-05-2023 20 250 100.5,200,150.35 02-05-2023,03-
  37. 05-2023,04-05-2023
  38. 11 B 07-05-2023 30 90

The below provided code works fine in identifying all the smaller values in the column, but the second part of the condition i.e) if there is a greater price value than the current price value, the loop should break and perform this condition for the following rows, needs to be applied.

for i, row in df.iterrows():
dfa = df.iloc[:i, :] # slice with rows above current
dfa = dfa[(dfa.ID==row.ID) & (dfa.MT < row.MT) & (dfa.Price< row.Price)] # matched rows
df.loc[i, 'matched_Price'] = ','.join(map(str, dfa.Price))
df.loc[i, 'matched_dates'] = ','.join(map(str, dfa.Date))

答案1

得分: 1

我通常更喜欢在.apply()函数中执行这些操作,而不是使用循环。在pandas中使用循环并没有错,但我发现它可能会使事情在pandas环境中变得更加复杂。通常,使用apply或其他更好的向量化解决方案速度也更快。

这是我想出的方法:

  1. import pandas as pd
  2. df = {'ID': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
  3. 'Date': ['01-05-2023', '02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023', '07-05-2023', '01-05-2023', '02-05-2023', '03-05-2023', '04-05-2023', '06-05-2023', '07-05-2023'],
  4. 'MT': [2, 5, 10, 15, 20, 30, 2, 5, 10, 15, 20, 30],
  5. 'Price': [300, 100.5, 200, 150.35, 250, 90, 300, 100.5, 200, 150.35, 250, 90]}
  6. df = pd.DataFrame(df)
  7. # 在副本中存储以便我们可以操作它并保留源数据
  8. mod_df = df.copy()
  9. # Shift函数使我们得到了一系列被“推”了一个条目的数据,允许与前一个条目进行比较
  10. mod_df["shift"] = df["Price"].shift()
  11. # diff是我们将要评估的数据框,当这个diff在产品ID内是正数时,我们就关注这些行。
  12. mod_df["diff"] = mod_df["Price"] - mod_df["shift"]
  13. # 用于进行重要操作的函数
  14. def pull_values_w_price_less_than(row: pd.Series, column: str):
  15. # 价格减少或保持不变的情况
  16. if row["diff"] <= 0:
  17. return pd.NA
  18. # 价格增加的情况(我们要评估的情况)
  19. else:
  20. # 与此行相关的条目
  21. relevant_df = mod_df[
  22. # 产品ID相同
  23. (mod_df["ID"] == row["ID"])
  24. # 价格低于此价格
  25. & (mod_df["Price"] < row["Price"])
  26. # 并且在此之前发生
  27. & (mod_df["Date"] < row["Date"])
  28. ]
  29. # 由传递的参数列决定的匹配值(无论是价格还是日期),放入一个列表中
  30. matched_values = relevant_df[column].to_list()
  31. # 如果没有匹配到任何值(这在ID边界处发生,价格下降,所以我们在这个区域,但我们不想返回一个空列表)
  32. if len(matched_values) == 0:
  33. return pd.NA
  34. # 返回要放入行中的值
  35. return matched_values
  36. # 对价格进行操作
  37. mod_df["Matched_Price"] = mod_df.apply(
  38. pull_values_w_price_less_than, args=("Price",), axis=1
  39. )
  40. # 对日期进行操作
  41. mod_df["Date_Values"] = mod_df.apply(
  42. pull_values_w_price_less_than, args=("Date",), axis=1
  43. )
  44. # 删除我们的辅助列
  45. output = mod_df.drop(columns=["shift", "diff"])
  46. # 查看数据框(可选)
  47. output

我还将值放入了一个列表中,而不是一个以逗号分隔的字符串,但你可以使用以下函数轻松将这些列转换为字符串:

  1. def list_to_comma_string(value_list) -> str:
  2. # 如果它已经为空,我们不能改变它
  3. if value_list is pd.NA:
  4. return pd.NA
  5. # 使用列表理解将列表中的所有值更改为字符串
  6. joiner = [str(x) for x in value_list]
  7. # Python中组合字符串的最有效方式
  8. return ','.join(joiner)
  9. # 转换两列
  10. output['Matched_Price'] = output['Matched_Price'].apply(list_to_comma_string)
  11. output['Date_Values'] = output['Date_Values'].apply(list_to_comma_string)
  12. # 再次查看数据框
  13. output

希望这对你有所帮助!

英文:

I usually prefer to do these things within the .apply() function rather than in a loop. It's not wrong to use a loop, but I find it can make things harder to deal with in pandas world. It is also usually slower than using apply or other better vectorized solutions.

Here is what I came up with:

  1. import pandas as pd
  2. df = {&#39;ID&#39;: [&#39;A&#39;, &#39;A&#39;, &#39;A&#39;, &#39;A&#39;,&#39;A&#39;,&#39;A&#39;,&#39;B&#39;, &#39;B&#39;, &#39;B&#39;, &#39;B&#39;,&#39;B&#39;,&#39;B&#39;],
  3. &#39;Date&#39;: [&#39;01-05-2023&#39;,&#39;02-05-2023&#39;, &#39;03-05-2023&#39;, &#39;04-05-2023&#39;, &#39;06-05-2023&#39;,&#39;07-05-2023&#39;,&#39;01-05-2023&#39;,&#39;02-05-2023&#39;, &#39;03-05-2023&#39;, &#39;04-05-2023&#39;, &#39;06-05-2023&#39;,&#39;07-05-2023&#39;],
  4. &#39;MT&#39;: [2, 5, 10, 15, 20, 30,2, 5, 10, 15, 20, 30],
  5. &#39;Price&#39;: [300, 100.5, 200, 150.35, 250, 90,300, 100.5, 200, 150.35, 250, 90,]}
  6. df = pd.DataFrame(df)
  7. # Store in a copy so we can manipulate it and maintain the source
  8. mod_df = df.copy()
  9. # Shift gives us a series &quot;pushed&quot; along one entry. Allows comparison with the previous entry
  10. mod_df[&quot;shift&quot;] = df[&quot;Price&quot;].shift()
  11. # diff is where we will evaluate the dataframe. Where this diff is positive within
  12. # a product ID is the rows you want to focus on.
  13. mod_df[&quot;diff&quot;] = mod_df[&quot;Price&quot;] - mod_df[&quot;shift&quot;]
  14. # Function for doing the heavy lifting
  15. def pull_values_w_price_less_than(row: pd.Series, column: str):
  16. # The case that the price decreased or stayed the same
  17. if row[&quot;diff&quot;] &lt;= 0:
  18. return pd.NA
  19. # Price increased (where we want to eval)
  20. else:
  21. # Entries relevant to this row
  22. relevant_df = mod_df[
  23. # Product ID is the same
  24. (mod_df[&quot;ID&quot;] == row[&quot;ID&quot;])
  25. # Price is less than this one
  26. &amp; (mod_df[&quot;Price&quot;] &lt; row[&quot;Price&quot;])
  27. # And it happened before this one
  28. &amp; (mod_df[&quot;Date&quot;] &lt; row[&quot;Date&quot;])
  29. ]
  30. # The matched values (be it Price or Date) determined by the passed arg
  31. # column placed in a list
  32. matched_values = relevant_df[column].to_list()
  33. # If nothing was matched (this happens at ID boundaries where the price
  34. # went down so we are in this region, but we don&#39;t want to return an
  35. # empty list)
  36. if len(matched_values) == 0:
  37. return pd.NA
  38. # Return the values to put in the row
  39. return matched_values
  40. # Do this for the prices
  41. mod_df[&quot;Matched_Price&quot;] = mod_df.apply(
  42. pull_values_w_price_less_than, args=(&quot;Price&quot;,), axis=1
  43. )
  44. # And the dates
  45. mod_df[&quot;Date_Values&quot;] = mod_df.apply(
  46. pull_values_w_price_less_than, args=(&quot;Date&quot;,), axis=1
  47. )
  48. # Remove our helper columns
  49. output = mod_df.drop(columns=[&quot;shift&quot;, &quot;diff&quot;])
  50. # View the df (optional ofc)
  51. output

I also put the values into a list rather than a string separated by commas, but you can easily convert those columns using this function:

  1. def list_to_comma_string(value_list) -&gt; str:
  2. # If it was already null we cant change that
  3. if value_list is pd.NA:
  4. return pd.NA
  5. # list comprehension to change all the values to strings in the list
  6. joiner = [str(x) for x in value_list]
  7. # The most efficient way to combine strings in python
  8. return &#39;,&#39;.join(joiner)
  9. # Convert both columns
  10. output[&#39;Matched_Price&#39;] = output[&#39;Matched_Price&#39;].apply(list_to_comma_string)
  11. output[&#39;Date_Values&#39;] = output[&#39;Date_Values&#39;].apply(list_to_comma_string)
  12. # Again view the df
  13. output

Hope this helps!

huangapple
  • 本文由 发表于 2023年6月1日 01:39:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76376055.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定