2023年3月4日 00:10:49go评论70阅读模式

英文:

How to create a third column based on first two column in pandas dataframe?

问题

我有一个如下的数据框

col-a	col-b
abc	123
def	456
ghi	789

我有一个字符串 str = f"https://{val1}.{val2}"，我想使用它来创建如下的col-c

col-a	col-b	col-c
abc	123	https://abc.123
def	456	https://def.456
ghi	789	https://ghi.789

数据框很大，我想使用np.where/np.select，因为我认为 .apply() 函数会很慢。即使使用 .apply()，我也无法将两列的值放入列 C。是否有人可以提供帮助？

英文:

I have a data frame like below

col-a	col-b
abc	123
def	456
ghi	789

I have a string str = f"https://{val1}.{val2}" and using it I want to create col-c like below

col-a	col-b	col-c
abc	123	https://abc.123
def	456	https://def.456
ghi	789	https://ghi.789

The dataframe is big and I want to use np.where/np.select because I think .apply() function will be slow. Even with apply() I am unable to put two column values into column C .Could anyone help here?

答案1

得分: 1

你可以定义一个方法，该方法接收你想要连接的两个值并返回结果：

import pandas as pd

# 创建示例数据帧
df = pd.DataFrame({'column_1': ['abc', 'def', 'ghi'], 'column_2': [123, 456, 789]})

# 定义用下划线连接两个值的函数
def concatenate_with_underscore(val1, val2):
    return f"https://{val1}.{val2}"

# 将函数应用于数据帧的每一行
df['new_column'] = df.apply(lambda row: concatenate_with_underscore(row['column_1'], row['column_2']), axis=1)

print(df)

这将生成所需的数据帧：

  column_1  column_2       new_column
0      abc       123  https://abc.123
1      def       456  https://def.456
2      ghi       789  https://ghi.789

英文:

You can define a method that receives the the two values you want to concatenate and returns the result:

import pandas as pd

# create example dataframe
df = pd.DataFrame({&#39;column_1&#39;: [&#39;abc&#39;, &#39;def&#39;, &#39;ghi&#39;], &#39;column_2&#39;: [123, 456, 789]})

# define function to concatenate two values with an underscore
def concatenate_with_underscore(val1, val2):
    return f&quot;https://{val1}.{val2}&quot;

# apply function to each row of the dataframe
df[&#39;new_column&#39;] = df.apply(lambda row: concatenate_with_underscore(row[&#39;column_1&#39;], row[&#39;column_2&#39;]), axis=1)

print(df)

This produces the desired df:

column_1  column_2       new_column
0      abc       123  https://abc.123
1      def       456  https://def.456
2      ghi       789  https://ghi.789

答案2

得分: 1

Using you dataframe as an input

df = pd.DataFrame(
{
 'col-a' : ['abc', 'def', 'ghi'], 
 'col-b' : [123, 456, 789]
     }
)

I tried timing a code using apply and another one using string concatenation:

%timeit df['col-c'] = df.apply(lambda row : f"https://{row['col-a']}.{row['col-b']}", axis = 1)
499 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['col-c'] = "https://" + df['col-a'] + "." + df['col-b'].astype(str)
347 µs ± 1.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Edit: added a few thousand rows

df = pd.concat([df] * 10000, axis = 0)

Now .apply() takes

%timeit df['col-c'] = df.apply(lambda row : f"https://{row['col-a']}.{row['col-b']}", axis = 1)
195 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And string concatenation takes

%timeit df['col-c'] = "https://" + df['col-a'] + "." + df['col-b'].astype(str)
11.3 ms ± 91.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Seems to me like string concatenation is the faster approach.

英文:

Using you dataframe as an input

df = pd.DataFrame(
{
 &#39;col-a&#39; : [&#39;abc&#39;, &#39;def&#39;, &#39;ghi&#39;], 
 &#39;col-b&#39; : [123, 456, 789]
     }
)

I tried timing a code using apply and another one using string concatenation:

%timeit df[&#39;col-c&#39;] = df.apply(lambda row : f&quot;https://{row[&#39;col-a&#39;]}.{row[&#39;col-b&#39;]}&quot;, axis = 1)

499 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df[&#39;col-c&#39;] = &quot;https://&quot; + df[&#39;col-a&#39;] + &quot;.&quot; + df[&#39;col-b&#39;].astype(str)`

347 µs ± 1.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Edit: added a few thousand rows

df = pd.concat([df] * 10000, axis = 0)

now .apply() takes

%timeit df[&#39;col-c&#39;] = df.apply(lambda row : f&quot;https://{row[&#39;col-a&#39;]}.{row[&#39;col-b&#39;]}&quot;, axis = 1)

195 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
and string concatenation takes

%timeit df[&#39;col-c&#39;] = &quot;https://&quot; + df[&#39;col-a&#39;] + &quot;.&quot; + df[&#39;col-b&#39;].astype(str)

11.3 ms ± 91.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Seems to me like string concatenation is the faster approach.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在pandas数据框中基于前两列创建第三列？

问题

答案1

答案2

Fastest way to compute n-gram overlap matrix in Python

如何将Python数组中的条目转换为数组？

What should I do when I receive an error saying 'coroutine was never awaited' with asyncio.create_task?

Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论