英文:
how to concatenate contents with linebreaks in dataframe
问题
I have three columns (col1, col2, col3) containing strings with line breaks. I want to concatenate col1, col2, and col3 into a new column (DesiredCol). Here is the code and dataset:
import pandas as pd
d = {'col1': ["ABC1\nABC2\nABC3", "BBC1\nBBC2\nBBC3"], 'col2': ["A\nB\nC", "A\nB\nC"], 'col3': ["YES\nNO\nYES", "NO\nNO\nYES"]}
df = pd.DataFrame(data=d)
I attempted to use a lambda function as follows, but it didn't give the desired column:
cols=['col1','col2','col3']
df['DesiredCol'] = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
英文:
I have three columns (col1, col2, col3) having string contents with line breaks inserted. I would like to concatenate col1,col2 and col3 with new column (DesiredCol) as shown below
Here is the dataset
import pandas as pd
d = {'col1': ["ABC1"+"\n"+"ABC2"+"\n"+"ABC3", "BBC1"+"\n"+"BBC2"+"\n"+"BBC3"], 'col2': ["A"+"\n"+"B"+"\n"+"C", "A"+"\n"+"B"+"\n"+"C"],'col3': ["YES"+"\n"+"NO"+"\n"+"YES", "NO"+"\n"+"NO"+"\n"+"YES"]}
df = pd.DataFrame(data=d)
I tried using lambda function as below, however its not giving the desired column
cols=['col1','col2','col3']
df['DesiredCol'] = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
答案1
得分: 2
以下是一种可能的解决方案(关键是使用zip
内置函数来连接相应的元素,然后在数据帧上进一步使用apply
方法来插入所需的换行字符):
A) 编辑:最快的解决方案
(感谢mozway。)
要了解此代码实现的内容以及每个操作如何逐步将数据更改为所需的列输出,请参见下面的解决方案“B”。但是,对于大型数据帧,最好避免使用applymap
,而是像这里所示一样结合map
和zip
在列表推导中组合使用。
df['DesiredCol'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
for x in zip(*(df[c] for c in cols))]
B) 原始解决方案
提供数据帧操作的详细分解。 *注意:
applymap
是将函数应用于数据帧的每个元素(或单元格)的简单方法,这里用于说明数据帧的中间状态,但通常有更快的方法。
完整代码:
计算效率更高的解决方案:
但是,上述“A”解决方案将更快,并且建议使用。如此处所示,使用转置不是最有效的方法。
df = pd.DataFrame(data=d)
df2 = df.T.apply(
lambda x: list(
map(" ".join, list(zip(*[y.split("\n") for y in x.values])))
)
).T
cols = df2.columns
df['col4'] = df2[cols].astype(str).apply('\n'.join, axis=1)
以下是在下面解释的步骤中使用的解决方案的完整代码:
df2 = df.applymap(lambda s: s.split('\n'))
df3 = df2.T.apply(lambda x: list(zip(*x.values)))
df3 = df3.T.apply(lambda x: x.str.join(' '))
cols = df3.columns
df['col4'] = df3[cols].astype(str).apply('\n'.join, axis=1)
数据帧操作的分解
1) 使用applymap
*来限定每个单元格在第1-3列中的内容
import pandas as pd
d = {
"col1": [
"ABC1" + "\n" + "ABC2" + "\n" + "ABC3",
"BBC1" + "\n" + "BBC2" + "\n" + "BBC3",
],
"col2": ["A" + "\n" + "B" + "\n" + "C", "A" + "\n" + "B" + "\n" + "C"],
"col3": [
"YES" + "\n" + "NO" + "\n" + "YES",
"NO" + "\n" + "NO" + "\n" + "YES",
],
}
df = pd.DataFrame(data=d)
df2 = df.applymap(lambda s: s.split("\n"))
df2
2) 然后,应用zip
来分别连接来自列的列表的元素
df3 = df2.T.apply(lambda x: list(zip(*x.values)))
df3 = df3.T.apply(lambda x: x.str.join(" "))
df3
3) 最后,将已压缩的新多列连接列表连接成换行字符限定的字符串
cols = df3.columns
df["col4"] = df3[cols].astype(str).apply("\n".join, axis=1)
df
得到最终的解决方案:
英文:
Here is one possible solution (the key being to use the zip
built-in function to join the respective elements, and then further apply
methods on the dataframe to insert the desired newline characters specifically):
A) Edit: Fastest Solution
> (Thanks to mozway.)
> For an understanding of what this code achieves, and how each operation changes the data step by step into the desired column output, see solution "B" below. However, for large dataframes, it is better to avoid the use of applymap
and instead combine map
and zip
in a list comprehension as shown here.
df['DesiredCol'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
for x in zip(*(df[c] for c in cols))]
B) Original Solution
> Providing a breakdown of the dataframe operations. *Note: applymap
is a simple way of applying a function to every element (or cell) of a dataframe and is shown here for illustration of the intermittent states of the dataframe, but usually there is a faster approach.
Full code:
Computationally more efficient solution:
> However, solution "A" above will be faster and is recommended. Using the transpositions as shown here is not the most efficient approach.
df = pd.DataFrame(data=d)
df2 = df.T.apply(
lambda x: list(
map(" ".join, list(zip(*[y.split("\n") for y in x.values])))
)
).T
cols = df2.columns
df['col4'] = df2[cols].astype(str).apply('\n'.join, axis=1)
Full code for solution as used in explained steps below:
df2 = df.applymap(lambda s: s.split('\n'))
df3 = df2.T.apply(lambda x: list(zip(*x.values)))
df3 = df3.T.apply(lambda x: x.str.join(' '))
cols = df3.columns
df['col4'] = df3[cols].astype(str).apply('\n'.join, axis=1)
Breakdown of df operations
1) Use applymap
* to delimit each cell in cols 1-3
import pandas as pd
d = {
"col1": [
"ABC1" + "\n" + "ABC2" + "\n" + "ABC3",
"BBC1" + "\n" + "BBC2" + "\n" + "BBC3",
],
"col2": ["A" + "\n" + "B" + "\n" + "C", "A" + "\n" + "B" + "\n" + "C"],
"col3": [
"YES" + "\n" + "NO" + "\n" + "YES",
"NO" + "\n" + "NO" + "\n" + "YES",
],
}
df = pd.DataFrame(data=d)
df2 = df.applymap(lambda s: s.split("\n"))
df2
2) Then, apply zip
to respectively join the elements of the lists from the columns
df3 = df2.T.apply(lambda x: list(zip(*x.values)))
df3 = df3.T.apply(lambda x: x.str.join(" "))
df3
3) And lastly join the zipped new multi-column-concatenated lists into newline character-delimited strings
cols = df3.columns
df["col4"] = df3[cols].astype(str).apply("\n".join, axis=1)
df
which gives the final solution:
答案2
得分: 2
Using a list comprehension with zip
:
df['out'] = [list(map(' '.join, zip(*(y.splitlines() for y in x))))
for x in zip(df['col1'], df['col2'], df['col3'])]
And for a double line break:
df['out'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
for x in zip(df['col1'], df['col2'], df['col3'])]
generalization to an arbitrary number of columns
cols = ['col1', 'col2', 'col3']
# for all columns
# cols = list(df)
df['DesiredCol'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
for x in zip(*(df[c] for c in cols))]
comparison of answers
difference in behavior
- John's and Mark's approaches use all columns, thus if you repeat the operation, the output is included. To avoid this behavior in the timing, I added in all approaches a step to check for and remove the output column before running the code. I also added an alternative approach in my answer to handle an arbitrary number of columns.
- Pravash's approach uses the same string to
join
the strings chunks and the joined substrings, which I ignored for the timing considering it a valid output.
timings for a varying number of rows
John's code is likely slower due to the use of applymap
and of pandas to perform the transposition operations (constructing the intermediate DataFrames is expensive). Mark's approach bottlenecks are the use of explode
and groupby
.
英文:
Using a list comprehension with zip
:
df['out'] = [list(map(' '.join, zip(*(y.splitlines() for y in x))))
for x in zip(df['col1'], df['col2'], df['col3'])]
col1 col2 col3 out
0 ABC1\nABC2\nABC3 A\nB\nC YES\nNO\nYES [ABC1 A YES, ABC2 B NO, ABC3 C YES]
1 BBC1\nBBC2\nBBC3 A\nB\nC NO\nNO\nYES [BBC1 A NO, BBC2 B NO, BBC3 C YES]
And for a double line break:
df['out'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
for x in zip(df['col1'], df['col2'], df['col3'])]
col1 col2 col3 out
0 ABC1\nABC2\nABC3 A\nB\nC YES\nNO\nYES ABC1 A YES\n\nABC2 B NO\n\nABC3 C YES
1 BBC1\nBBC2\nBBC3 A\nB\nC NO\nNO\nYES BBC1 A NO\n\nBBC2 B NO\n\nBBC3 C YES
generalization to an arbitrary number of columns
cols = ['col1', 'col2', 'col3']
# for all columns
# cols = list(df)
df['DesiredCol'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
for x in zip(*(df[c] for c in cols))]
comparison of answers
difference in behavior
- John's and Mark's approaches use all columns, thus if you repeat the operation, the output is included. To avoid this behavior in the timing I added in all approaches a step to check for and remove the output column before running the code. I also added an alternative approach in my answer to handle an arbitrary number of columns.
- Pravash' approach uses the same string to
join
the strings chunks and the joined substrings, which I ignored for the timing considering it a valid output.
timings for a varying number of rows
John's code is likely slower due to the use of applymap
and of pandas to perform the transposition operations (constructing the intermediate DataFrames is expensive). Mark's approach bottlenecks are the use of explode
and groupby
.
答案3
得分: 1
你可以尝试使用以下代码 -
我使用了zip和split('\n')来为行中的列创建单独的元组。
df["DesiredCol"] = df.apply(lambda row: " ".join(value for col in zip(row["col1"].split("\n"), row["col2"].split("\n"), row["col3"].split("\n")) for value in col), axis=1)
英文:
You can try using below code -
I have used zip and split('\n') to create individual tuples for rows in cols.
df["DesiredCol"] = df.apply(lambda row: " ".join(value for col in zip(row["col1"].split("\n"), row["col2"].split("\n"), row["col3"].split("\n")) for value in col), axis=1)
答案4
得分: 1
这是另一个答案:
o = df.apply(lambda x: x.str.split('\n').explode()).apply(lambda x: ' '.join(x), axis=1)
df['new col'] = o.groupby(o.index).apply(lambda x: '\n'.join(x)).reset_index(drop=True)
英文:
Here's another answer
o = df.apply(lambda x: x.str.split('\n').explode()).apply(lambda x: ' '.join(x), axis=1)
df['new col'] = o.groupby(o.index).apply(lambda x: '\n'.join(x)).reset_index(drop=True)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论