如何在pandas数据框中基于前两列创建第三列?

huangapple go评论59阅读模式
英文:

How to create a third column based on first two column in pandas dataframe?

问题

我有一个如下的数据框

col-a col-b
abc 123
def 456
ghi 789

我有一个字符串 str = f"https://{val1}.{val2}",我想使用它来创建如下的col-c

col-a col-b col-c
abc 123 https://abc.123
def 456 https://def.456
ghi 789 https://ghi.789

数据框很大,我想使用np.where/np.select,因为我认为 .apply() 函数会很慢。即使使用 .apply(),我也无法将两列的值放入列 C。是否有人可以提供帮助?

英文:

I have a data frame like below

col-a col-b
abc 123
def 456
ghi 789

I have a string str = f"https://{val1}.{val2}" and using it I want to create col-c like below

col-a col-b col-c
abc 123 https://abc.123
def 456 https://def.456
ghi 789 https://ghi.789

The dataframe is big and I want to use np.where/np.select because I think .apply() function will be slow. Even with apply() I am unable to put two column values into column C .Could anyone help here?

答案1

得分: 1

你可以定义一个方法,该方法接收你想要连接的两个值并返回结果:

import pandas as pd

# 创建示例数据帧
df = pd.DataFrame({'column_1': ['abc', 'def', 'ghi'], 'column_2': [123, 456, 789]})

# 定义用下划线连接两个值的函数
def concatenate_with_underscore(val1, val2):
    return f"https://{val1}.{val2}"

# 将函数应用于数据帧的每一行
df['new_column'] = df.apply(lambda row: concatenate_with_underscore(row['column_1'], row['column_2']), axis=1)

print(df)

这将生成所需的数据帧:

  column_1  column_2       new_column
0      abc       123  https://abc.123
1      def       456  https://def.456
2      ghi       789  https://ghi.789
英文:

You can define a method that receives the the two values you want to concatenate and returns the result:

import pandas as pd

# create example dataframe
df = pd.DataFrame({'column_1': ['abc', 'def', 'ghi'], 'column_2': [123, 456, 789]})

# define function to concatenate two values with an underscore
def concatenate_with_underscore(val1, val2):
    return f"https://{val1}.{val2}"

# apply function to each row of the dataframe
df['new_column'] = df.apply(lambda row: concatenate_with_underscore(row['column_1'], row['column_2']), axis=1)

print(df)

This produces the desired df:

column_1  column_2       new_column
0      abc       123  https://abc.123
1      def       456  https://def.456
2      ghi       789  https://ghi.789

答案2

得分: 1

Using you dataframe as an input

df = pd.DataFrame(
{
 'col-a' : ['abc', 'def', 'ghi'], 
 'col-b' : [123, 456, 789]
     }
)

I tried timing a code using apply and another one using string concatenation:

%timeit df['col-c'] = df.apply(lambda row : f"https://{row['col-a']}.{row['col-b']}", axis = 1)
499 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['col-c'] = "https://" + df['col-a'] + "." + df['col-b'].astype(str)
347 µs ± 1.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Edit: added a few thousand rows

df = pd.concat([df] * 10000, axis = 0)

Now .apply() takes

%timeit df['col-c'] = df.apply(lambda row : f"https://{row['col-a']}.{row['col-b']}", axis = 1)
195 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And string concatenation takes

%timeit df['col-c'] = "https://" + df['col-a'] + "." + df['col-b'].astype(str)
11.3 ms ± 91.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Seems to me like string concatenation is the faster approach.

英文:

Using you dataframe as an input

df = pd.DataFrame(
{
 'col-a' : ['abc', 'def', 'ghi'], 
 'col-b' : [123, 456, 789]
     }
)

I tried timing a code using apply and another one using string concatenation:

%timeit df['col-c'] = df.apply(lambda row : f"https://{row['col-a']}.{row['col-b']}", axis = 1)

499 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['col-c'] = "https://" + df['col-a'] + "." + df['col-b'].astype(str)`

347 µs ± 1.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Edit: added a few thousand rows

df = pd.concat([df] * 10000, axis = 0)

now .apply() takes

%timeit df['col-c'] = df.apply(lambda row : f"https://{row['col-a']}.{row['col-b']}", axis = 1)

195 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
and string concatenation takes

%timeit df['col-c'] = "https://" + df['col-a'] + "." + df['col-b'].astype(str)

11.3 ms ± 91.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Seems to me like string concatenation is the faster approach.

huangapple
  • 本文由 发表于 2023年3月4日 00:10:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75629423.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定