Pandas – 创建新列,其值取自同一数据框中的其他行

huangapple go评论82阅读模式
英文:

Pandas - Create new column where values are taken from other rows in the same dataframe

问题

I understand that you want to create a new column called reply_to_sender in your DataFrame that shows the name of the sender of the message being replied to. You can achieve this using the merge function in pandas. Here's the code to do it:

import pandas as pd

# Your DataFrame
data = pd.DataFrame({
    'message_id': [1, 2, 3, 4, 5],
    'reply_to_id': [0, 1, 0, 2, 2],
    'sender': ['Roozbeh', 'Amir', 'Neda', 'Roozbeh', 'Neda']
})

# Merge the DataFrame with itself to get the reply_to_sender column
data = data.merge(data[['message_id', 'sender']], left_on='reply_to_id', right_on='message_id', suffixes=('', '_reply_to'))
data.drop('message_id_reply_to', axis=1, inplace=True)
data.rename(columns={'sender_reply_to': 'reply_to_sender'}, inplace=True)

# Fill NaN values in reply_to_sender column
data['reply_to_sender'].fillna('NaN', inplace=True)

print(data)

This code will create the reply_to_sender column as you described in your example.

英文:

I have a DataFrame like this:

 	message_id 	reply_to_id 	sender
0 	1   	    0    	        Roozbeh
1 	2   	    1 	            Amir
2 	3 	        0 	            Neda
3 	4 	        2	            Roozbeh
3 	5 	        2    	        Neda

If the message was a reply to another message, reply_to_id shows the id of the message that it was replied to, otherwise it's 0. Now I want to create another column, reply_to_sender, where it shows the name of the sender of the message that it was replied to (and if it wasn't a reply, it can show NaN)

The message_id column is unique, but reply_to_id and sender columns are obviously not.

I tried this:

data["reply_to_sender"] = data.loc[data["reply_to_id"] == data["message_id"]]["sender"]

But it obviously won't work, because it looks at each row and perform the relational operation. What I'm trying to do is to look at each row and then find the name of the sender from other rows. For the example above, the output needs to be like this:

 	message_id 	reply_to_id 	sender    reply_to_sender
0 	1   	    0    	        Roozbeh   NaN
1 	2   	    1 	            Amir      Roozbeh
2 	3 	        0 	            Neda      NaN
3 	4 	        2	            Roozbeh   Amir
3 	5 	        2    	        Neda      Amir

How can I do that?

答案1

得分: 6

使用Series.map与由message_idsender创建的Series

df['reply_to_sender'] = df['reply_to_id'].map(df.set_index('message_id')['sender'])
print (df)
   message_id  reply_to_id   sender reply_to_sender
0           1            0  Roozbeh             NaN
1           2            1     Amir         Roozbeh
2           3            0     Neda             NaN
3           4            2  Roozbeh            Amir
3           5            2     Neda            Amir
英文:

Use Series.map with Series created by message_id and sender:

df['reply_to_sender'] = df['reply_to_id'].map(df.set_index('message_id')['sender'])
print (df)
   message_id  reply_to_id   sender reply_to_sender
0           1            0  Roozbeh             NaN
1           2            1     Amir         Roozbeh
2           3            0     Neda             NaN
3           4            2  Roozbeh            Amir
3           5            2     Neda            Amir

答案2

得分: 0

首先,让我们看看你如何手动执行此操作,然后我们将其在代码中实现。

如果我给你一个 reply_to_id,你可以告诉我这条消息是回复给谁的,只需查看 DataFrame,找到其 message_id 等于该数字的行,然后告诉我该行的 sender 列中的值。可以这样做,其中 reply_to_id 变量是我给你的数字:

data.loc[data["message_id"] == reply_to_id]["sender"]

现在这段代码返回一个 pandas.Series,但我们并不要求一个 Series,我们要求的是发送者的名称这个标量值。所以我们需要从 Series 中提取该值。如果 Series 中只有一个值(你需要检查一下),我们可以使用 pandas.Series.values[0] 来提取它。所以代码变成了这样:

reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
    return reply_to_sender_values[0]

现在,如果我给你一个在 message_id 中找不到的数字,会发生什么呢?你会告诉我你什么都没有找到。这对应于以下代码:

reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
    return reply_to_sender_values[0]
else:
    return ""

还有一件事情需要注意。正如你所说,reply_to_id 中的值可能是零。因此,我们需要处理这种情况:

if(reply_to_id != 0):
    reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
    if len(reply_to_sender_values) == 1:
        return reply_to_sender_values[0]
    else:
        return ""
else:
    return ""

正如你所看到的,我们刚刚构建了一个执行你手动操作的函数。让我们给它取个名字:

def reply_to_sender(reply_to_id):
    if(reply_to_id != 0):
        reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
        if len(reply_to_sender_values) == 1:
            return reply_to_sender_values[0]
        else:
            return ""
    else:
        return ""

唯一剩下的事情就是找到一种方法在我们的 DataFramereply_to_id 列中的所有行上应用此函数。幸运的是,Pandas 中有一个可以做到这一点的方法。它就是,你猜对了,pandas.DataFrame.apply。现在一切都在这一行代码中融合在一起了:

data["reply_to_sender"] = data["reply_to_id"].apply(lambda x: reply_to_sender(x))

需要注意的一件事是,我在 Jupyter Notebook 中测试了这段代码。如果你想从脚本中运行这段代码,你需要将 DataFrame 传递给你的 reply_to_sender 函数。所以代码变成了这样:

def reply_to_sender(data, reply_to_id):
    if(reply_to_id != 0):
        reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
        if len(reply_to_sender_values) == 1:
            return reply_to_sender_values[0]
        else:
            return ""
    else:
        return ""

data["reply_to_sender"] = data["reply_to_id"].apply(lambda x: reply_to_sender(data, x))
英文:

First, let's see how you would do this yourself, by hand. Then we implement it in code.

If I give you a reply_to_id, you can tell me who this message was a reply to, by simply looking at the DataFrame, finding the row whose message_id is equal to that number, and then telling me the value in that row's sender column. This can be done like this, where the reply_to_id variable is the number I gave you:

data.loc[data["message_id"] == reply_to_id]["sender"]

Now this code returns a pandas.Series, but we didn't ask for a Series, we asked for a scalar value which is the name of the sender. So we need to extract that value from the Series. If there's only one value in the Series (you need to check that), we can extract it using pandas.Series.values[0]. So the code becomes like this:

reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
    return reply_to_sender_values[0]

Now what would happen if I gave you a number that you didn't find in message_id? What would you do? You'd tell me that you found nothing. That translates to this:

reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
    return reply_to_sender_values[0]
else:
    return ""

There's one more thing we need to pay attention to. As you said, the values in reply_to_id can be zero. So we need to take care of that:

if(reply_to_id != 0):
    reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
    if len(reply_to_sender_values) == 1:
        return reply_to_sender_values[0]
    else:
        return ""
else:
    return ""

As you can see, we've just built a function to do what you would do by hand. Let's give it a name:

def reply_to_sender(reply_to_id):
    if(reply_to_id != 0):
        reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
        if len(reply_to_sender_values) == 1:
            return reply_to_sender_values[0]
        else:
            return ""
    else:
        return ""

All there's left to do is to find a way to apply this function on all the rows in reply_to_id column of our DataFrame. Luckily, there's this method in Pandas that does just that. And it's called, you guessed it, pandas.DataFrame.apply. Now it all comes together with this line of code:

data["reply_to_sender"] = data["reply_to_id"].apply(lambda x: reply_to_sender(x))

One thing to notice, I tested this code in a Jupyter Notebook. If you want to run this code from a script, you need to pass the DataFrame to your reply_to_sender function. So the code changes to this:

def reply_to_sender(data, reply_to_id):
    if(reply_to_id != 0):
        reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
        if len(reply_to_sender_values) == 1:
            return reply_to_sender_values[0]
        else:
            return ""
    else:
        return ""

data["reply_to_sender"] = data["reply_to_id"].apply(lambda x: reply_to_sender(data, x))

答案3

得分: 0

You can do

mymap = {val: df.sender.loc[key] for key, val in df.message_id.to_dict().items()}

and then

df['reply_to_sender'] = df.reply_to_id.map(mymap)

This gives you

   message_id  reply_to_id   sender reply_to_sender
0           1            0  Roozbeh             NaN
1           2            1     Amir         Roozbeh
2           3            0     Neda             NaN
3           4            2  Roozbeh            Amir
3           5            2     Neda            Amir
英文:

You can do

mymap = {val: df.sender.loc[key] for key, val in df.message_id.to_dict().items()}

and then

df['reply_to_sender'] = df.reply_to_id.map(mymap)

This give you

   message_id  reply_to_id   sender reply_to_sender
0           1            0  Roozbeh             NaN
1           2            1     Amir         Roozbeh
2           3            0     Neda             NaN
3           4            2  Roozbeh            Amir
3           5            2     Neda            Amir

huangapple
  • 本文由 发表于 2020年1月6日 17:24:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/59609524.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定