2020年1月6日 17:24:55go评论110阅读模式

英文:

Pandas - Create new column where values are taken from other rows in the same dataframe

问题

I understand that you want to create a new column called reply_to_sender in your DataFrame that shows the name of the sender of the message being replied to. You can achieve this using the merge function in pandas. Here's the code to do it:

import pandas as pd
# Your DataFrame
data = pd.DataFrame({
    'message_id': [1, 2, 3, 4, 5],
    'reply_to_id': [0, 1, 0, 2, 2],
    'sender': ['Roozbeh', 'Amir', 'Neda', 'Roozbeh', 'Neda']
})
# Merge the DataFrame with itself to get the reply_to_sender column
data = data.merge(data[['message_id', 'sender']], left_on='reply_to_id', right_on='message_id', suffixes=('', '_reply_to'))
data.drop('message_id_reply_to', axis=1, inplace=True)
data.rename(columns={'sender_reply_to': 'reply_to_sender'}, inplace=True)
# Fill NaN values in reply_to_sender column
data['reply_to_sender'].fillna('NaN', inplace=True)
print(data)

This code will create the reply_to_sender column as you described in your example.

英文:

I have a DataFrame like this:

 	message_id 	reply_to_id 	sender
0 	1   	    0    	        Roozbeh
1 	2   	    1 	            Amir
2 	3 	        0 	            Neda
3 	4 	        2	            Roozbeh
3 	5 	        2    	        Neda

If the message was a reply to another message, reply_to_id shows the id of the message that it was replied to, otherwise it's 0. Now I want to create another column, reply_to_sender, where it shows the name of the sender of the message that it was replied to (and if it wasn't a reply, it can show NaN)

The message_id column is unique, but reply_to_id and sender columns are obviously not.

I tried this:

data[&quot;reply_to_sender&quot;] = data.loc[data[&quot;reply_to_id&quot;] == data[&quot;message_id&quot;]][&quot;sender&quot;]

But it obviously won't work, because it looks at each row and perform the relational operation. What I'm trying to do is to look at each row and then find the name of the sender from other rows. For the example above, the output needs to be like this:

 	message_id 	reply_to_id 	sender    reply_to_sender
0 	1   	    0    	        Roozbeh   NaN
1 	2   	    1 	            Amir      Roozbeh
2 	3 	        0 	            Neda      NaN
3 	4 	        2	            Roozbeh   Amir
3 	5 	        2    	        Neda      Amir

How can I do that?

答案1

得分: 6

使用Series.map与由message_id和sender创建的Series：

df['reply_to_sender'] = df['reply_to_id'].map(df.set_index('message_id')['sender'])
print (df)
   message_id  reply_to_id   sender reply_to_sender
0           1            0  Roozbeh             NaN
1           2            1     Amir         Roozbeh
2           3            0     Neda             NaN
3           4            2  Roozbeh            Amir
3           5            2     Neda            Amir

英文:

Use Series.map with Series created by message_id and sender:

df[&#39;reply_to_sender&#39;] = df[&#39;reply_to_id&#39;].map(df.set_index(&#39;message_id&#39;)[&#39;sender&#39;])
print (df)
   message_id  reply_to_id   sender reply_to_sender
0           1            0  Roozbeh             NaN
1           2            1     Amir         Roozbeh
2           3            0     Neda             NaN
3           4            2  Roozbeh            Amir
3           5            2     Neda            Amir

答案2

得分: 0

首先，让我们看看你如何手动执行此操作，然后我们将其在代码中实现。

如果我给你一个 reply_to_id，你可以告诉我这条消息是回复给谁的，只需查看 DataFrame，找到其 message_id 等于该数字的行，然后告诉我该行的 sender 列中的值。可以这样做，其中 reply_to_id 变量是我给你的数字：

data.loc[data["message_id"] == reply_to_id]["sender"]

现在这段代码返回一个 pandas.Series，但我们并不要求一个 Series，我们要求的是发送者的名称这个标量值。所以我们需要从 Series 中提取该值。如果 Series 中只有一个值（你需要检查一下），我们可以使用 pandas.Series.values[0] 来提取它。所以代码变成了这样：

reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
    return reply_to_sender_values[0]

现在，如果我给你一个在 message_id 中找不到的数字，会发生什么呢？你会告诉我你什么都没有找到。这对应于以下代码：

reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
    return reply_to_sender_values[0]
else:
    return ""

还有一件事情需要注意。正如你所说，reply_to_id 中的值可能是零。因此，我们需要处理这种情况：

if(reply_to_id != 0):
    reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
    if len(reply_to_sender_values) == 1:
        return reply_to_sender_values[0]
    else:
        return ""
else:
    return ""

正如你所看到的，我们刚刚构建了一个执行你手动操作的函数。让我们给它取个名字：

def reply_to_sender(reply_to_id):
    if(reply_to_id != 0):
        reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
        if len(reply_to_sender_values) == 1:
            return reply_to_sender_values[0]
        else:
            return ""
    else:
        return ""

唯一剩下的事情就是找到一种方法在我们的 DataFrame 的 reply_to_id 列中的所有行上应用此函数。幸运的是，Pandas 中有一个可以做到这一点的方法。它就是，你猜对了，pandas.DataFrame.apply。现在一切都在这一行代码中融合在一起了：

data["reply_to_sender"] = data["reply_to_id"].apply(lambda x: reply_to_sender(x))

需要注意的一件事是，我在 Jupyter Notebook 中测试了这段代码。如果你想从脚本中运行这段代码，你需要将 DataFrame 传递给你的 reply_to_sender 函数。所以代码变成了这样：

def reply_to_sender(data, reply_to_id):
    if(reply_to_id != 0):
        reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
        if len(reply_to_sender_values) == 1:
            return reply_to_sender_values[0]
        else:
            return ""
    else:
        return ""
data["reply_to_sender"] = data["reply_to_id"].apply(lambda x: reply_to_sender(data, x))

英文:

First, let's see how you would do this yourself, by hand. Then we implement it in code.

If I give you a reply_to_id, you can tell me who this message was a reply to, by simply looking at the DataFrame, finding the row whose message_id is equal to that number, and then telling me the value in that row's sender column. This can be done like this, where the reply_to_id variable is the number I gave you:

data.loc[data[&quot;message_id&quot;] == reply_to_id][&quot;sender&quot;]

Now this code returns a pandas.Series, but we didn't ask for a Series, we asked for a scalar value which is the name of the sender. So we need to extract that value from the Series. If there's only one value in the Series (you need to check that), we can extract it using pandas.Series.values[0]. So the code becomes like this:

reply_to_sender_values = data.loc[data[&quot;message_id&quot;] == reply_to_id][&quot;sender&quot;].values
if len(reply_to_sender_values) == 1:
    return reply_to_sender_values[0]

Now what would happen if I gave you a number that you didn't find in message_id? What would you do? You'd tell me that you found nothing. That translates to this:

reply_to_sender_values = data.loc[data[&quot;message_id&quot;] == reply_to_id][&quot;sender&quot;].values
if len(reply_to_sender_values) == 1:
    return reply_to_sender_values[0]
else:
    return &quot;&quot;

There's one more thing we need to pay attention to. As you said, the values in reply_to_id can be zero. So we need to take care of that:

if(reply_to_id != 0):
    reply_to_sender_values = data.loc[data[&quot;message_id&quot;] == reply_to_id][&quot;sender&quot;].values
    if len(reply_to_sender_values) == 1:
        return reply_to_sender_values[0]
    else:
        return &quot;&quot;
else:
    return &quot;&quot;

As you can see, we've just built a function to do what you would do by hand. Let's give it a name:

def reply_to_sender(reply_to_id):
    if(reply_to_id != 0):
        reply_to_sender_values = data.loc[data[&quot;message_id&quot;] == reply_to_id][&quot;sender&quot;].values
        if len(reply_to_sender_values) == 1:
            return reply_to_sender_values[0]
        else:
            return &quot;&quot;
    else:
        return &quot;&quot;

All there's left to do is to find a way to apply this function on all the rows in reply_to_id column of our DataFrame. Luckily, there's this method in Pandas that does just that. And it's called, you guessed it, pandas.DataFrame.apply. Now it all comes together with this line of code:

data[&quot;reply_to_sender&quot;] = data[&quot;reply_to_id&quot;].apply(lambda x: reply_to_sender(x))

One thing to notice, I tested this code in a Jupyter Notebook. If you want to run this code from a script, you need to pass the DataFrame to your reply_to_sender function. So the code changes to this:

def reply_to_sender(data, reply_to_id):
    if(reply_to_id != 0):
        reply_to_sender_values = data.loc[data[&quot;message_id&quot;] == reply_to_id][&quot;sender&quot;].values
        if len(reply_to_sender_values) == 1:
            return reply_to_sender_values[0]
        else:
            return &quot;&quot;
    else:
        return &quot;&quot;
data[&quot;reply_to_sender&quot;] = data[&quot;reply_to_id&quot;].apply(lambda x: reply_to_sender(data, x))

答案3

得分: 0

You can do

mymap = {val: df.sender.loc[key] for key, val in df.message_id.to_dict().items()}

and then

df['reply_to_sender'] = df.reply_to_id.map(mymap)

This gives you

   message_id  reply_to_id   sender reply_to_sender
0           1            0  Roozbeh             NaN
1           2            1     Amir         Roozbeh
2           3            0     Neda             NaN
3           4            2  Roozbeh            Amir
3           5            2     Neda            Amir

英文:

You can do

mymap = {val: df.sender.loc[key] for key, val in df.message_id.to_dict().items()}

and then

df[&#39;reply_to_sender&#39;] = df.reply_to_id.map(mymap)

This give you

   message_id  reply_to_id   sender reply_to_sender
0           1            0  Roozbeh             NaN
1           2            1     Amir         Roozbeh
2           3            0     Neda             NaN
3           4            2  Roozbeh            Amir
3           5            2     Neda            Amir

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas – 创建新列，其值取自同一数据框中的其他行

问题

答案1

答案2

答案3

你能在Python turtle中清除特定的字母吗

在Dask中的交叉合并/笛卡尔积

突出显示X轴上的特定值。

从 tenacity retry_state.outcome.result() 获取错误消息会导致程序终止。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。