英文:
Pandas - Create new column where values are taken from other rows in the same dataframe
问题
I understand that you want to create a new column called reply_to_sender
in your DataFrame that shows the name of the sender of the message being replied to. You can achieve this using the merge
function in pandas. Here's the code to do it:
import pandas as pd
# Your DataFrame
data = pd.DataFrame({
'message_id': [1, 2, 3, 4, 5],
'reply_to_id': [0, 1, 0, 2, 2],
'sender': ['Roozbeh', 'Amir', 'Neda', 'Roozbeh', 'Neda']
})
# Merge the DataFrame with itself to get the reply_to_sender column
data = data.merge(data[['message_id', 'sender']], left_on='reply_to_id', right_on='message_id', suffixes=('', '_reply_to'))
data.drop('message_id_reply_to', axis=1, inplace=True)
data.rename(columns={'sender_reply_to': 'reply_to_sender'}, inplace=True)
# Fill NaN values in reply_to_sender column
data['reply_to_sender'].fillna('NaN', inplace=True)
print(data)
This code will create the reply_to_sender
column as you described in your example.
英文:
I have a DataFrame
like this:
message_id reply_to_id sender
0 1 0 Roozbeh
1 2 1 Amir
2 3 0 Neda
3 4 2 Roozbeh
3 5 2 Neda
If the message was a reply to another message, reply_to_id
shows the id of the message that it was replied to, otherwise it's 0
. Now I want to create another column, reply_to_sender
, where it shows the name of the sender of the message that it was replied to (and if it wasn't a reply, it can show NaN
)
The message_id
column is unique, but reply_to_id
and sender
columns are obviously not.
I tried this:
data["reply_to_sender"] = data.loc[data["reply_to_id"] == data["message_id"]]["sender"]
But it obviously won't work, because it looks at each row and perform the relational operation. What I'm trying to do is to look at each row and then find the name of the sender from other rows. For the example above, the output needs to be like this:
message_id reply_to_id sender reply_to_sender
0 1 0 Roozbeh NaN
1 2 1 Amir Roozbeh
2 3 0 Neda NaN
3 4 2 Roozbeh Amir
3 5 2 Neda Amir
How can I do that?
答案1
得分: 6
使用Series.map
与由message_id
和sender
创建的Series
:
df['reply_to_sender'] = df['reply_to_id'].map(df.set_index('message_id')['sender'])
print (df)
message_id reply_to_id sender reply_to_sender
0 1 0 Roozbeh NaN
1 2 1 Amir Roozbeh
2 3 0 Neda NaN
3 4 2 Roozbeh Amir
3 5 2 Neda Amir
英文:
Use Series.map
with Series
created by message_id
and sender
:
df['reply_to_sender'] = df['reply_to_id'].map(df.set_index('message_id')['sender'])
print (df)
message_id reply_to_id sender reply_to_sender
0 1 0 Roozbeh NaN
1 2 1 Amir Roozbeh
2 3 0 Neda NaN
3 4 2 Roozbeh Amir
3 5 2 Neda Amir
答案2
得分: 0
首先,让我们看看你如何手动执行此操作,然后我们将其在代码中实现。
如果我给你一个 reply_to_id
,你可以告诉我这条消息是回复给谁的,只需查看 DataFrame
,找到其 message_id
等于该数字的行,然后告诉我该行的 sender
列中的值。可以这样做,其中 reply_to_id
变量是我给你的数字:
data.loc[data["message_id"] == reply_to_id]["sender"]
现在这段代码返回一个 pandas.Series,但我们并不要求一个 Series
,我们要求的是发送者的名称这个标量值。所以我们需要从 Series 中提取该值。如果 Series 中只有一个值(你需要检查一下),我们可以使用 pandas.Series.values[0]
来提取它。所以代码变成了这样:
reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
return reply_to_sender_values[0]
现在,如果我给你一个在 message_id
中找不到的数字,会发生什么呢?你会告诉我你什么都没有找到。这对应于以下代码:
reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
return reply_to_sender_values[0]
else:
return ""
还有一件事情需要注意。正如你所说,reply_to_id
中的值可能是零。因此,我们需要处理这种情况:
if(reply_to_id != 0):
reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
return reply_to_sender_values[0]
else:
return ""
else:
return ""
正如你所看到的,我们刚刚构建了一个执行你手动操作的函数。让我们给它取个名字:
def reply_to_sender(reply_to_id):
if(reply_to_id != 0):
reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
return reply_to_sender_values[0]
else:
return ""
else:
return ""
唯一剩下的事情就是找到一种方法在我们的 DataFrame
的 reply_to_id
列中的所有行上应用此函数。幸运的是,Pandas
中有一个可以做到这一点的方法。它就是,你猜对了,pandas.DataFrame.apply
。现在一切都在这一行代码中融合在一起了:
data["reply_to_sender"] = data["reply_to_id"].apply(lambda x: reply_to_sender(x))
需要注意的一件事是,我在 Jupyter Notebook 中测试了这段代码。如果你想从脚本中运行这段代码,你需要将 DataFrame
传递给你的 reply_to_sender
函数。所以代码变成了这样:
def reply_to_sender(data, reply_to_id):
if(reply_to_id != 0):
reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
return reply_to_sender_values[0]
else:
return ""
else:
return ""
data["reply_to_sender"] = data["reply_to_id"].apply(lambda x: reply_to_sender(data, x))
英文:
First, let's see how you would do this yourself, by hand. Then we implement it in code.
If I give you a reply_to_id
, you can tell me who this message was a reply to, by simply looking at the DataFrame
, finding the row whose message_id
is equal to that number, and then telling me the value in that row's sender
column. This can be done like this, where the reply_to_id
variable is the number I gave you:
data.loc[data["message_id"] == reply_to_id]["sender"]
Now this code returns a pandas.Series, but we didn't ask for a Series
, we asked for a scalar value which is the name of the sender. So we need to extract that value from the Series. If there's only one value in the Series (you need to check that), we can extract it using pandas.Series.values[0]
. So the code becomes like this:
reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
return reply_to_sender_values[0]
Now what would happen if I gave you a number that you didn't find in message_id
? What would you do? You'd tell me that you found nothing. That translates to this:
reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
return reply_to_sender_values[0]
else:
return ""
There's one more thing we need to pay attention to. As you said, the values in reply_to_id
can be zero. So we need to take care of that:
if(reply_to_id != 0):
reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
return reply_to_sender_values[0]
else:
return ""
else:
return ""
As you can see, we've just built a function to do what you would do by hand. Let's give it a name:
def reply_to_sender(reply_to_id):
if(reply_to_id != 0):
reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
return reply_to_sender_values[0]
else:
return ""
else:
return ""
All there's left to do is to find a way to apply this function on all the rows in reply_to_id
column of our DataFrame
. Luckily, there's this method in Pandas
that does just that. And it's called, you guessed it, pandas.DataFrame.apply
. Now it all comes together with this line of code:
data["reply_to_sender"] = data["reply_to_id"].apply(lambda x: reply_to_sender(x))
One thing to notice, I tested this code in a Jupyter Notebook. If you want to run this code from a script, you need to pass the DataFrame
to your reply_to_sender
function. So the code changes to this:
def reply_to_sender(data, reply_to_id):
if(reply_to_id != 0):
reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
return reply_to_sender_values[0]
else:
return ""
else:
return ""
data["reply_to_sender"] = data["reply_to_id"].apply(lambda x: reply_to_sender(data, x))
答案3
得分: 0
You can do
mymap = {val: df.sender.loc[key] for key, val in df.message_id.to_dict().items()}
and then
df['reply_to_sender'] = df.reply_to_id.map(mymap)
This gives you
message_id reply_to_id sender reply_to_sender
0 1 0 Roozbeh NaN
1 2 1 Amir Roozbeh
2 3 0 Neda NaN
3 4 2 Roozbeh Amir
3 5 2 Neda Amir
英文:
You can do
mymap = {val: df.sender.loc[key] for key, val in df.message_id.to_dict().items()}
and then
df['reply_to_sender'] = df.reply_to_id.map(mymap)
This give you
message_id reply_to_id sender reply_to_sender
0 1 0 Roozbeh NaN
1 2 1 Amir Roozbeh
2 3 0 Neda NaN
3 4 2 Roozbeh Amir
3 5 2 Neda Amir
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论