如何在Pandas中使用向量化处理进行与行相关的函数?

huangapple go评论72阅读模式
英文:

How can vectorization be used for row-dependent functions in Pandas?

问题

抱歉,我理解您只需要翻译代码部分。以下是您提供的代码的翻译部分:

def find_winner_row(df, row, result):
    A_val = df['A'][row] # 玩家A的值
    B_val = df['B'][row] # 玩家B的值
    potentials_B = np.where(df['A'][row+1:] == B_val)[0] #[row+1:] 切片并仅考虑未来的A的值
    potentials_A = np.where(df['B'][row+1:] == A_val)[0]
    # 下面的逻辑仅用于处理没有匹配值的情况
    if potentials_B.size == 0:
        B_switch_time = len(df.columns) + 1
    else:
        B_switch_time = potentials_B[0]
    if potentials_A.size == 0:
        A_switch_time = len(df.columns) + 1
    else:
        A_switch_time = potentials_A[0]
    # 现在确定谁先赢了?
    if B_switch_time < A_switch_time:
        result[row] = "B"
    elif B_switch_time > A_switch_time:
        result[row] = "A"
    else:
        result[row] = "None"
    
def find_winner(df):
    result_series = pd.Series(np.zeros(len(df.columns)))
    for num, (index, row) in enumerate(df.iterrows()):
        find_winner_row(df, num, result_series)
    df.loc[:,'Winner'] = result_series.values
    return df
## 所以,对于上面给定的示例,我们可以运行以下代码,看到我们得到了预期的结果
demo_df = pd.DataFrame([[2,4],[3,5],[5,2],[6,5],[2,10],[10,7]],columns=['A','B'])
find_winner(demo_df)

希望这对您有所帮助。如果您有其他问题,请随时提出。

英文:

and sorry if this has been asked before (I could only find approaches that worked on previous rows and not the rest of the dataframe.)

I'm currently trying to switch out my iterative approach for a problem to a more Pandas (and time) friendly version. The problem is as follows: I have two columns, "A" and "B" that are players. At each time, "A" and "B" take on different arbitrary values. I want to add a third column that has a value either "A wins!" or "B wins!" based on the rows beneath the values at that row.

To determine when 'A wins!' for a certain row number, I want to compare the value in column "A" at that row with each value in column 'B' that is beneath this row. To determine when 'B wins!', I want to do the same thing: take the value in row "B" and compare it to each entry in column "A" beneath this entry. Whichever is first to "match" with a value in another column will be the winner. Here's an example:

Time A B Winner
1 2 4 A wins!
2 3 5 B wins!
3 5 2 A wins!
4 6 5 None
5 2 10 B wins!
6 10 7 None

At time 1, A wins because at time 3 "B" takes on the value 2 before "A" can take on 4. At time 2, "B" wins because "A" in the row below takes on the value 5 before "B" took on the value 3. This is similar for time 3 and 5, and at times 4 & 6 there is no winner because the opposing players do not happen to take on each others values in the later rounds.

Right now, I have a working solution by just using df.iterrows(). I have a pretty large dataset, so I would like to speed this up but I can't think of any simple Panda's functions because they usually isolate by row. All my attempts of apply and map have not worked because of the dependence on rows, so I'm looking for a solution that might cut down time and not have to use explicit iteration. Any and all help would be appreciated, thank you!!

EDIT: Here's my working iterative solution. I feed a DataFrame into find_winner which calls find_winner_row on each row.

def find_winner_row(df, row, result):
    A_val = df[&#39;A&#39;][row] # Player A
    B_val = df[&#39;B&#39;][row] # Player B
    potentials_B = np.where(df[&#39;A&#39;][row+1:] == B_val)[0] #[row+1:] slices and only considers the future values of A
    potentials_A = np.where(df[&#39;B&#39;][row+1:] == A_val)[0]
    # below logic is just to handle the case when there are no matching values
    if potentials_B.size == 0:
        B_switch_time = len(df.columns) + 1
    else:
        B_switch_time = potentials_B[0]
    if potentials_A.size == 0:
        A_switch_time = len(df.columns) + 1
    else:
        A_switch_time = potentials_A[0]
    # now which is first?
    if B_switch_time &lt; A_switch_time:
        result[row] = &quot;B&quot;
    elif B_switch_time &gt; A_switch_time:
        result[row] = &quot;A&quot;
    else:
        result[row] = &quot;None&quot;
    
def find_winner(df):
    result_series = pd.Series(np.zeros(len(df.columns)))
    for num, (index, row) in enumerate(df.iterrows()):
        find_winner_row(df, num, result_series)
    df.loc[:,&#39;Winner&#39;] = result_series.values
    return df
## So with our given example above, we can run the following and see we get the expected result
demo_df = pd.DataFrame([[2,4],[3,5],[5,2],[6,5],[2,10],[10,7]],columns=[&#39;A&#39;,&#39;B&#39;])
find_winner(demo_df)

答案1

得分: 3

以下是翻译好的部分:

使用[*duplicates*][1]的一种可能选项:

tmp = df.set_index("Time").stack()

winners = (
    tmp.duplicated(keep="last")
        .groupby(level=0, group_keys=False)
        .apply(lambda x: x.loc[x.eq(True)])
        .reset_index(level=1).query("~index.duplicated()")
        .reindex(df["Time"]).reset_index()["level_1"]
        .add(" 获胜!").rename("获胜者")
)

out = df.join(winners)

输出:

print(out)

   Time   A   B  获胜者
0     1   2   4  A 获胜!
1     2   3   5  B 获胜!
2     3   5   2  A 获胜!
3     4   6   5    NaN
4     5   2  10  B 获胜!
5     6  10   7    NaN

[1]: https://pandas.pydata.org/docs/user_guide/duplicates.html

请注意,我已经将引号从 &quot; 转换为正常的引号。

英文:

A possible option getting use of duplicates :

tmp = df.set_index(&quot;Time&quot;).stack()

winners = (
    tmp.duplicated(keep=&quot;last&quot;)
        .groupby(level=0, group_keys=False)
        .apply(lambda x: x.loc[x.eq(True)])
        .reset_index(level=1).query(&quot;~index.duplicated()&quot;)
        .reindex(df[&quot;Time&quot;]).reset_index()[&quot;level_1&quot;]
        .add(&quot; wins!&quot;).rename(&quot;Winners&quot;)
)

out = df.join(winners)

Output :

print(out)

   Time   A   B  Winners
0     1   2   4  A wins!
1     2   3   5  B wins!
2     3   5   2  A wins!
3     4   6   5      NaN
4     5   2  10  B wins!
5     6  10   7      NaN

答案2

得分: 2

在你添加代码之前,我已经开始写这篇文章——但我觉得它可能仍然有帮助。我能够编写一个函数,根据逻辑返回一个基于行索引和给定DataFrame的获胜者字符串。

# 示例数据
import pandas as pd
data = {"Time":[1,2,3,4,5,6],"A":[2,3,5,6,2,10],"B":[4,5,2,5,10,7],"Winner":["A","B","A","None","B","None"]}
df = pd.DataFrame(data)

def FindWinner(row_index,dataframe=df):
    # 记录指定列中的初始值
    A_initial = df.iloc[row_index]["A"]
    B_initial  = df.iloc[row_index]["B"]

    # 将此行以下的数据转换成一对列表
    rowsUnderA = list(df.iloc[row_index+1:]["A"])
    rowsUnderB = list(df.iloc[row_index+1:]["B"])

    # 使用 .index() 查找初始值在另一列表中下次出现的位置
    try: rowsUntilA_initial = rowsUnderB.index(A_initial)
    except ValueError: rowsUntilA_initial = "DOES_NOT_APPEAR"

    try: rowsUntilB_initial = rowsUnderA.index(B_initial)
    except ValueError: rowsUntilB_initial = "DOES_NOT_APPEAR"

    # 设置获胜条件-->首先处理一个或两个值都不出现的情况
    if rowsUntilB_initial == "DOES_NOT_APPEAR" and rowsUntilA_initial == "DOES_NOT_APPEAR":
        return "No one wins :("

    elif rowsUntilB_initial == "DOES_NOT_APPEAR" and rowsUntilA_initial != "DOES_NOT_APPEAR":
        return "A wins!"

    elif rowsUntilB_initial != "DOES_NOT_APPEAR" and rowsUntilA_initial == "DOES_NOT_APPEAR":
        return "B wins!"

    # 如果 A 首先出现,A 获胜...反之亦然
    elif rowsUntilA_initial < rowsUntilB_initial: return "A wins!"

    elif rowsUntilB_initial < rowsUntilA_initial: return "B wins!"

    # 如果它们相同会发生什么?
    elif rowsUntilB_initial == rowsUntilB_initial: return "... what happens if they're the same?"

基于快速测试,这确实返回了预期的输出。通过使用此函数,应该可以创建一个新列或在每一行中进行一次迭代并创建一个新列。我理解这里的目标是尽量减少迭代,但如果没有以某种方式引用每一行,我不确定是否有一种方法来计算和显示获胜者。这里的逻辑和你的示例代码中的逻辑似乎相似,但我很想知道这些差异是否影响运行时。我没有访问你的数据集,所以无法自行确定,但我觉得尝试一下可能是值得的。

英文:

Started writing this before you added your code -- but figure it might still be helpful. I was able to write 1 function that, based on the logic, returns a string of who wins based on a row index and given DataFrame with minimal (internal) iteration:

# Sample data
import pandas as pd
data = {&quot;Time&quot;:[1,2,3,4,5,6],&quot;A&quot;:[2,3,5,6,2,10],&quot;B&quot;:[4,5,2,5,10,7],&quot;Winner&quot;:[&quot;A&quot;,&quot;B&quot;,&quot;A&quot;,&quot;None&quot;,&quot;B&quot;,&quot;None&quot;]}
df = pd.DataFrame(data)

def FindWinner(row_index,dataframe=df):
    # Record the intial value in indicated column
    A_initial = df.iloc[row_index][&quot;A&quot;]
    B_initial  = df.iloc[row_index][&quot;B&quot;]

    # Convert data underneath this row into a pair of lists
    rowsUnderA = list(df.iloc[row_index+1:][&quot;A&quot;])
    rowsUnderB = list(df.iloc[row_index+1:][&quot;B&quot;])
    
    # Use .index() to find when the inital value appears next in the other list
    try: rowsUntilA_initial = rowsUnderB.index(A_initial)
    except ValueError: rowsUntilA_initial = &quot;DOES_NOT_APPEAR&quot;

    try: rowsUntilB_initial = rowsUnderA.index(B_initial)
    except ValueError: rowsUntilB_initial = &quot;DOES_NOT_APPEAR&quot;

    # Set win conditions--&gt; first handle scenarios where one or both values do not appear
    if rowsUntilB_initial == &quot;DOES_NOT_APPEAR&quot; and rowsUntilA_initial == &quot;DOES_NOT_APPEAR&quot;:
    return &quot;No one wins :(&quot;

    elif rowsUntilB_initial == &quot;DOES_NOT_APPEAR&quot; and rowsUntilA_initial != &quot;DOES_NOT_APPEAR&quot;:
    return &quot;A wins!&quot;
    
    elif rowsUntilB_initial != &quot;DOES_NOT_APPEAR&quot; and rowsUntilA_initial == &quot;DOES_NOT_APPEAR&quot;:
    return &quot;B wins!&quot;
 
    # If A appears first, A wins ... vice versa
    elif rowsUntilA_initial &lt; rowsUntilB_initial: return &quot;A wins!&quot;

    elif rowsUntilB_initial &lt; rowsUntilA_initial: return &quot;B wins!&quot;

    # What if they are the same?
    elif rowsUntilB_initial == rowsUntilB_initial: return &quot;... what happens if they&#39;re the same?&quot;

Based on a quick test this does return the expected output:
如何在Pandas中使用向量化处理进行与行相关的函数?

Using this function it should be possible to make/map a new column, or rather even iterate once through each row and create a new column (which is what map would be doing anyway). I understand the objective here is to minimize iteration, but without referencing each row individually in some capacity - I'm not sure there's a way to compute and display the winners. The logic here and in your sample code seems similar, but I would be interested to know if the differences affect runtime. I do not have access to your dataset so I'm unable to determine that myself, but figured it might be worthwhile anyway to try.

答案3

得分: 2

假设:

我假设"first to match"意味着时间上的先后顺序(即,从上到下),但仍然在相应行的下方,并且您的数据框已按"Time"列升序排序。

如果是这样的话,也许可以尝试这样做(不是pandas解决方案,但可能更快...?):

创建您的数据框:

import pandas as pd
data_df = pd.DataFrame([{'Time': 1, 'A': 2, 'B': 4},
                      {'Time': 2, 'A': 3, 'B': 5},
                      {'Time': 3, 'A': 5, 'B': 2},
                      {'Time': 4, 'A': 6, 'B': 5},
                      {'Time': 5, 'A': 2, 'B': 10},
                      {'Time': 6, 'A': 10, 'B': 7}])

方法:

from collections import deque

data_dict = data_df[['A', 'B']].to_dict(orient='list')
data_dict['A'] = deque(data_dict['A'])
data_dict['B'] = deque(data_dict['B'])
winner_lst = []

def get_idx(value, col_lst):
    return col_lst.index(value) if value in col_lst else len(col_lst) + 1

for _ in range(len(data_dict['A'])):
    a_val = data_dict['A'].popleft()
    b_val = data_dict['B'].popleft()
    a_idx = get_idx(a_val, data_dict['B'])
    b_idx = get_idx(b_val, data_dict['A'])

    if a_idx > b_idx:
        winner_lst.append('B wins!')

    elif a_idx < b_idx:
        winner_lst.append('A wins!')

    else:
        winner_lst.append(None)

data_df['Winner'] = winner_lst

希望这能对您有所帮助!

英文:

Assumptions:

I'm assuming "first to match" means first in time (i.e., from top down) but still beneath the row in question, and that your dataframe is already sorted ascending on the 'Time' column.

If so, maybe try something like this (not a pandas solution but MIGHT be faster...?):

Creating your dataframe:

import pandas as pd
data_df = pd.DataFrame([{&#39;Time&#39;: 1, &#39;A&#39;: 2, &#39;B&#39;: 4},
                      {&#39;Time&#39;: 2, &#39;A&#39;: 3, &#39;B&#39;: 5},
                      {&#39;Time&#39;: 3, &#39;A&#39;: 5, &#39;B&#39;: 2},
                      {&#39;Time&#39;: 4, &#39;A&#39;: 6, &#39;B&#39;: 5},
                      {&#39;Time&#39;: 5, &#39;A&#39;: 2, &#39;B&#39;: 10},
                      {&#39;Time&#39;: 6, &#39;A&#39;: 10, &#39;B&#39;: 7}])

Approach:

from collections import deque


data_dict = data_df[[&#39;A&#39;, &#39;B&#39;]].to_dict(orient=&#39;list&#39;)
data_dict[&#39;A&#39;] = deque(data_dict[&#39;A&#39;])
data_dict[&#39;B&#39;] = deque(data_dict[&#39;B&#39;])
winner_lst = []


def get_idx(value, col_lst):
    return col_lst.index(value) if value in col_lst else len(col_lst) + 1


for _ in range(len(data_dict[&#39;A&#39;])):
    a_val = data_dict[&#39;A&#39;].popleft()
    b_val = data_dict[&#39;B&#39;].popleft()
    a_idx = get_idx(a_val, data_dict[&#39;B&#39;])
    b_idx = get_idx(b_val, data_dict[&#39;A&#39;])

    if a_idx &gt; b_idx:
        winner_lst.append(&#39;B wins!&#39;)

    elif a_idx &lt; b_idx:
        winner_lst.append(&#39;A wins!&#39;)

    else:
        winner_lst.append(None)


data_df[&#39;Winner&#39;] = winner_lst

答案4

得分: 0

以下是您要翻译的内容:

winners = df.melt("Time", var_name="Winner").sort_values("Time")
winners = (
   winners[winners.duplicated("value", keep="last")]
     .drop_duplicates("Time")
     .drop(columns="value")
)

df.merge(winners, how="left")
   Time   A   B Winner
0     1   2   4      A
1     2   3   5      B
2     3   5   2      A
3     4   6   5    NaN
4     5   2  10      B
5     6  10   7    NaN
英文:
winners = df.melt(&quot;Time&quot;, var_name=&quot;Winner&quot;).sort_values(&quot;Time&quot;) 
winners = (
   winners[winners.duplicated(&quot;value&quot;, keep=&quot;last&quot;)]
     .drop_duplicates(&quot;Time&quot;)
     .drop(columns=&quot;value&quot;)
)

df.merge(winners, how=&quot;left&quot;)
   Time   A   B Winner
0     1   2   4      A
1     2   3   5      B
2     3   5   2      A
3     4   6   5    NaN
4     5   2  10      B
5     6  10   7    NaN

huangapple
  • 本文由 发表于 2023年6月9日 03:19:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/76435071.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定