如何在DataFrame中连接带有换行符的内容

huangapple go评论81阅读模式
英文:

how to concatenate contents with linebreaks in dataframe

问题

I have three columns (col1, col2, col3) containing strings with line breaks. I want to concatenate col1, col2, and col3 into a new column (DesiredCol). Here is the code and dataset:

import pandas as pd

d = {'col1': ["ABC1\nABC2\nABC3", "BBC1\nBBC2\nBBC3"], 'col2': ["A\nB\nC", "A\nB\nC"], 'col3': ["YES\nNO\nYES", "NO\nNO\nYES"]}

df = pd.DataFrame(data=d)

I attempted to use a lambda function as follows, but it didn't give the desired column:

cols=['col1','col2','col3']

df['DesiredCol'] = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

如何在DataFrame中连接带有换行符的内容

英文:

I have three columns (col1, col2, col3) having string contents with line breaks inserted. I would like to concatenate col1,col2 and col3 with new column (DesiredCol) as shown below

如何在DataFrame中连接带有换行符的内容

Here is the dataset

import pandas as pd

d = {'col1': ["ABC1"+"\n"+"ABC2"+"\n"+"ABC3", "BBC1"+"\n"+"BBC2"+"\n"+"BBC3"], 'col2': ["A"+"\n"+"B"+"\n"+"C", "A"+"\n"+"B"+"\n"+"C"],'col3': ["YES"+"\n"+"NO"+"\n"+"YES", "NO"+"\n"+"NO"+"\n"+"YES"]}

df = pd.DataFrame(data=d)

I tried using lambda function as below, however its not giving the desired column

cols=['col1','col2','col3']

df['DesiredCol'] = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

答案1

得分: 2

以下是一种可能的解决方案(关键是使用zip内置函数来连接相应的元素,然后在数据帧上进一步使用apply方法来插入所需的换行字符):

A) 编辑:最快的解决方案

(感谢mozway。)
要了解此代码实现的内容以及每个操作如何逐步将数据更改为所需的列输出,请参见下面的解决方案“B”。但是,对于大型数据帧,最好避免使用applymap,而是像这里所示一样结合mapzip在列表推导中组合使用。

df['DesiredCol'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
                    for x in zip(*(df[c] for c in cols))]

B) 原始解决方案

提供数据帧操作的详细分解。 *注意:applymap是将函数应用于数据帧的每个元素(或单元格)的简单方法,这里用于说明数据帧的中间状态,但通常有更快的方法。

完整代码:

计算效率更高的解决方案:

但是,上述“A”解决方案将更快,并且建议使用。如此处所示,使用转置不是最有效的方法。

df = pd.DataFrame(data=d)

df2 = df.T.apply(
    lambda x: list(
        map(" ".join, list(zip(*[y.split("\n") for y in x.values])))
    )
).T

cols = df2.columns
df['col4'] = df2[cols].astype(str).apply('\n'.join, axis=1)

以下是在下面解释的步骤中使用的解决方案的完整代码:

df2 = df.applymap(lambda s: s.split('\n'))

df3 = df2.T.apply(lambda x: list(zip(*x.values)))

df3 = df3.T.apply(lambda x: x.str.join(' '))

cols = df3.columns
df['col4'] = df3[cols].astype(str).apply('\n'.join, axis=1)

数据帧操作的分解

1) 使用applymap*来限定每个单元格在第1-3列中的内容

import pandas as pd

d = {
    "col1": [
        "ABC1" + "\n" + "ABC2" + "\n" + "ABC3",
        "BBC1" + "\n" + "BBC2" + "\n" + "BBC3",
    ],
    "col2": ["A" + "\n" + "B" + "\n" + "C", "A" + "\n" + "B" + "\n" + "C"],
    "col3": [
        "YES" + "\n" + "NO" + "\n" + "YES",
        "NO" + "\n" + "NO" + "\n" + "YES",
    ],
}

df = pd.DataFrame(data=d)

df2 = df.applymap(lambda s: s.split("\n"))
df2

如何在DataFrame中连接带有换行符的内容

2) 然后,应用zip来分别连接来自列的列表的元素

df3 = df2.T.apply(lambda x: list(zip(*x.values)))

df3 = df3.T.apply(lambda x: x.str.join(" "))
df3

如何在DataFrame中连接带有换行符的内容

3) 最后,将已压缩的新多列连接列表连接成换行字符限定的字符串


cols = df3.columns
df["col4"] = df3[cols].astype(str).apply("\n".join, axis=1)

df

得到最终的解决方案:

如何在DataFrame中连接带有换行符的内容

英文:

Here is one possible solution (the key being to use the zip built-in function to join the respective elements, and then further apply methods on the dataframe to insert the desired newline characters specifically):

A) Edit: Fastest Solution

> (Thanks to mozway.)
> For an understanding of what this code achieves, and how each operation changes the data step by step into the desired column output, see solution "B" below. However, for large dataframes, it is better to avoid the use of applymap and instead combine map and zip in a list comprehension as shown here.

df['DesiredCol'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
                    for x in zip(*(df[c] for c in cols))]

B) Original Solution

> Providing a breakdown of the dataframe operations. *Note: applymap is a simple way of applying a function to every element (or cell) of a dataframe and is shown here for illustration of the intermittent states of the dataframe, but usually there is a faster approach.

Full code:

Computationally more efficient solution:

> However, solution "A" above will be faster and is recommended. Using the transpositions as shown here is not the most efficient approach.

df = pd.DataFrame(data=d)

df2 = df.T.apply(
    lambda x: list(
        map(" ".join, list(zip(*[y.split("\n") for y in x.values])))
    )
).T

cols = df2.columns
df['col4'] = df2[cols].astype(str).apply('\n'.join, axis=1)

Full code for solution as used in explained steps below:

df2 = df.applymap(lambda s: s.split('\n'))

df3 = df2.T.apply(lambda x: list(zip(*x.values)))

df3 = df3.T.apply(lambda x: x.str.join(' '))

cols = df3.columns
df['col4'] = df3[cols].astype(str).apply('\n'.join, axis=1)

Breakdown of df operations

1) Use applymap* to delimit each cell in cols 1-3

import pandas as pd

d = {
    "col1": [
        "ABC1" + "\n" + "ABC2" + "\n" + "ABC3",
        "BBC1" + "\n" + "BBC2" + "\n" + "BBC3",
    ],
    "col2": ["A" + "\n" + "B" + "\n" + "C", "A" + "\n" + "B" + "\n" + "C"],
    "col3": [
        "YES" + "\n" + "NO" + "\n" + "YES",
        "NO" + "\n" + "NO" + "\n" + "YES",
    ],
}

df = pd.DataFrame(data=d)

df2 = df.applymap(lambda s: s.split("\n"))
df2

如何在DataFrame中连接带有换行符的内容

2) Then, apply zip to respectively join the elements of the lists from the columns

df3 = df2.T.apply(lambda x: list(zip(*x.values)))

df3 = df3.T.apply(lambda x: x.str.join(" "))
df3

如何在DataFrame中连接带有换行符的内容

3) And lastly join the zipped new multi-column-concatenated lists into newline character-delimited strings


cols = df3.columns
df["col4"] = df3[cols].astype(str).apply("\n".join, axis=1)

df

which gives the final solution:

如何在DataFrame中连接带有换行符的内容

答案2

得分: 2

Using a list comprehension with zip:

df['out'] = [list(map(' '.join, zip(*(y.splitlines() for y in x))))
             for x in zip(df['col1'], df['col2'], df['col3'])]

And for a double line break:

df['out'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
             for x in zip(df['col1'], df['col2'], df['col3'])]

generalization to an arbitrary number of columns
cols = ['col1', 'col2', 'col3']
# for all columns
# cols = list(df)

df['DesiredCol'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
                    for x in zip(*(df[c] for c in cols))]

comparison of answers

difference in behavior
  • John's and Mark's approaches use all columns, thus if you repeat the operation, the output is included. To avoid this behavior in the timing, I added in all approaches a step to check for and remove the output column before running the code. I also added an alternative approach in my answer to handle an arbitrary number of columns.
  • Pravash's approach uses the same string to join the strings chunks and the joined substrings, which I ignored for the timing considering it a valid output.
timings for a varying number of rows

John's code is likely slower due to the use of applymap and of pandas to perform the transposition operations (constructing the intermediate DataFrames is expensive). Mark's approach bottlenecks are the use of explode and groupby.

英文:

Using a list comprehension with zip:


df['out'] = [list(map(' '.join, zip(*(y.splitlines() for y in x))))
             for x in zip(df['col1'], df['col2'], df['col3'])]

               col1     col2          col3                                  out
0  ABC1\nABC2\nABC3  A\nB\nC  YES\nNO\nYES  [ABC1 A YES, ABC2 B NO, ABC3 C YES]
1  BBC1\nBBC2\nBBC3  A\nB\nC   NO\nNO\nYES   [BBC1 A NO, BBC2 B NO, BBC3 C YES]

And for a double line break:


df['out'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
             for x in zip(df['col1'], df['col2'], df['col3'])]

               col1     col2          col3                                    out
0  ABC1\nABC2\nABC3  A\nB\nC  YES\nNO\nYES  ABC1 A YES\n\nABC2 B NO\n\nABC3 C YES
1  BBC1\nBBC2\nBBC3  A\nB\nC   NO\nNO\nYES   BBC1 A NO\n\nBBC2 B NO\n\nBBC3 C YES
generalization to an arbitrary number of columns
cols = ['col1', 'col2', 'col3']
# for all columns
# cols = list(df)

df['DesiredCol'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
                    for x in zip(*(df[c] for c in cols))]

comparison of answers

difference in behavior
  • John's and Mark's approaches use all columns, thus if you repeat the operation, the output is included. To avoid this behavior in the timing I added in all approaches a step to check for and remove the output column before running the code. I also added an alternative approach in my answer to handle an arbitrary number of columns.
  • Pravash' approach uses the same string to join the strings chunks and the joined substrings, which I ignored for the timing considering it a valid output.
timings for a varying number of rows

John's code is likely slower due to the use of applymap and of pandas to perform the transposition operations (constructing the intermediate DataFrames is expensive). Mark's approach bottlenecks are the use of explode and groupby.

如何在DataFrame中连接带有换行符的内容

答案3

得分: 1

你可以尝试使用以下代码 -

我使用了zip和split('\n')来为行中的列创建单独的元组。

df["DesiredCol"] = df.apply(lambda row: " ".join(value for col in zip(row["col1"].split("\n"), row["col2"].split("\n"), row["col3"].split("\n")) for value in col), axis=1)
英文:

You can try using below code -

I have used zip and split('\n') to create individual tuples for rows in cols.

df["DesiredCol"] = df.apply(lambda row: " ".join(value for col in zip(row["col1"].split("\n"), row["col2"].split("\n"), row["col3"].split("\n")) for value in col), axis=1)

答案4

得分: 1

这是另一个答案:

o = df.apply(lambda x: x.str.split('\n').explode()).apply(lambda x: ' '.join(x), axis=1)
df['new col'] = o.groupby(o.index).apply(lambda x: '\n'.join(x)).reset_index(drop=True)
英文:

Here's another answer

o = df.apply(lambda x: x.str.split('\n').explode()).apply(lambda x: ' '.join(x), axis=1)
df['new col'] = o.groupby(o.index).apply(lambda x: '\n'.join(x)).reset_index(drop=True)

huangapple
  • 本文由 发表于 2023年7月18日 13:49:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76709818.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定