2023年7月18日 13:49:04go评论105阅读模式

英文:

how to concatenate contents with linebreaks in dataframe

问题

I have three columns (col1, col2, col3) containing strings with line breaks. I want to concatenate col1, col2, and col3 into a new column (DesiredCol). Here is the code and dataset:

import pandas as pd
d = {'col1': ["ABC1\nABC2\nABC3", "BBC1\nBBC2\nBBC3"], 'col2': ["A\nB\nC", "A\nB\nC"], 'col3': ["YES\nNO\nYES", "NO\nNO\nYES"]}
df = pd.DataFrame(data=d)

I attempted to use a lambda function as follows, but it didn't give the desired column:

cols=['col1','col2','col3']
df['DesiredCol'] = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

英文:

I have three columns (col1, col2, col3) having string contents with line breaks inserted. I would like to concatenate col1,col2 and col3 with new column (DesiredCol) as shown below

Here is the dataset

import pandas as pd
d = {&#39;col1&#39;: [&quot;ABC1&quot;+&quot;\n&quot;+&quot;ABC2&quot;+&quot;\n&quot;+&quot;ABC3&quot;, &quot;BBC1&quot;+&quot;\n&quot;+&quot;BBC2&quot;+&quot;\n&quot;+&quot;BBC3&quot;], &#39;col2&#39;: [&quot;A&quot;+&quot;\n&quot;+&quot;B&quot;+&quot;\n&quot;+&quot;C&quot;, &quot;A&quot;+&quot;\n&quot;+&quot;B&quot;+&quot;\n&quot;+&quot;C&quot;],&#39;col3&#39;: [&quot;YES&quot;+&quot;\n&quot;+&quot;NO&quot;+&quot;\n&quot;+&quot;YES&quot;, &quot;NO&quot;+&quot;\n&quot;+&quot;NO&quot;+&quot;\n&quot;+&quot;YES&quot;]}
df = pd.DataFrame(data=d)

I tried using lambda function as below, however its not giving the desired column

cols=[&#39;col1&#39;,&#39;col2&#39;,&#39;col3&#39;]
df[&#39;DesiredCol&#39;] = df[cols].apply(lambda row: &#39; &#39;.join(row.values.astype(str)), axis=1)

答案1

得分: 2

以下是一种可能的解决方案（关键是使用zip内置函数来连接相应的元素，然后在数据帧上进一步使用apply方法来插入所需的换行字符）：

A) 编辑：最快的解决方案

（感谢mozway。）
要了解此代码实现的内容以及每个操作如何逐步将数据更改为所需的列输出，请参见下面的解决方案“B”。但是，对于大型数据帧，最好避免使用applymap，而是像这里所示一样结合map和zip在列表推导中组合使用。

df['DesiredCol'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
                    for x in zip(*(df[c] for c in cols))]

B) 原始解决方案

提供数据帧操作的详细分解。 *注意：applymap是将函数应用于数据帧的每个元素（或单元格）的简单方法，这里用于说明数据帧的中间状态，但通常有更快的方法。

完整代码：

计算效率更高的解决方案：

但是，上述“A”解决方案将更快，并且建议使用。如此处所示，使用转置不是最有效的方法。

df = pd.DataFrame(data=d)
df2 = df.T.apply(
    lambda x: list(
        map(" ".join, list(zip(*[y.split("\n") for y in x.values])))
    )
).T
cols = df2.columns
df['col4'] = df2[cols].astype(str).apply('\n'.join, axis=1)

以下是在下面解释的步骤中使用的解决方案的完整代码：

df2 = df.applymap(lambda s: s.split('\n'))
df3 = df2.T.apply(lambda x: list(zip(*x.values)))
df3 = df3.T.apply(lambda x: x.str.join(' '))
cols = df3.columns
df['col4'] = df3[cols].astype(str).apply('\n'.join, axis=1)

数据帧操作的分解

1) 使用`applymap`*来限定每个单元格在第1-3列中的内容

import pandas as pd
d = {
    "col1": [
        "ABC1" + "\n" + "ABC2" + "\n" + "ABC3",
        "BBC1" + "\n" + "BBC2" + "\n" + "BBC3",
    ],
    "col2": ["A" + "\n" + "B" + "\n" + "C", "A" + "\n" + "B" + "\n" + "C"],
    "col3": [
        "YES" + "\n" + "NO" + "\n" + "YES",
        "NO" + "\n" + "NO" + "\n" + "YES",
    ],
}
df = pd.DataFrame(data=d)
df2 = df.applymap(lambda s: s.split("\n"))
df2

2) 然后，应用`zip`来分别连接来自列的列表的元素

df3 = df2.T.apply(lambda x: list(zip(*x.values)))
df3 = df3.T.apply(lambda x: x.str.join(" "))
df3

3) 最后，将已压缩的新多列连接列表连接成换行字符限定的字符串


cols = df3.columns
df["col4"] = df3[cols].astype(str).apply("\n".join, axis=1)
df

得到最终的解决方案：

英文:

Here is one possible solution (the key being to use the zip built-in function to join the respective elements, and then further apply methods on the dataframe to insert the desired newline characters specifically):

A) Edit: Fastest Solution

> (Thanks to mozway.)
> For an understanding of what this code achieves, and how each operation changes the data step by step into the desired column output, see solution "B" below. However, for large dataframes, it is better to avoid the use of applymap and instead combine map and zip in a list comprehension as shown here.

df[&#39;DesiredCol&#39;] = [&#39;\n\n&#39;.join(map(&#39; &#39;.join, zip(*(y.splitlines() for y in x))))
                    for x in zip(*(df[c] for c in cols))]

B) Original Solution

> Providing a breakdown of the dataframe operations. *Note: applymap is a simple way of applying a function to every element (or cell) of a dataframe and is shown here for illustration of the intermittent states of the dataframe, but usually there is a faster approach.

Full code:

Computationally more efficient solution:

> However, solution "A" above will be faster and is recommended. Using the transpositions as shown here is not the most efficient approach.

df = pd.DataFrame(data=d)
df2 = df.T.apply(
    lambda x: list(
        map(&quot; &quot;.join, list(zip(*[y.split(&quot;\n&quot;) for y in x.values])))
    )
).T
cols = df2.columns
df[&#39;col4&#39;] = df2[cols].astype(str).apply(&#39;\n&#39;.join, axis=1)

Full code for solution as used in explained steps below:

df2 = df.applymap(lambda s: s.split(&#39;\n&#39;))
df3 = df2.T.apply(lambda x: list(zip(*x.values)))
df3 = df3.T.apply(lambda x: x.str.join(&#39; &#39;))
cols = df3.columns
df[&#39;col4&#39;] = df3[cols].astype(str).apply(&#39;\n&#39;.join, axis=1)

Breakdown of df operations

1) Use `applymap`* to delimit each cell in cols 1-3

import pandas as pd
d = {
    &quot;col1&quot;: [
        &quot;ABC1&quot; + &quot;\n&quot; + &quot;ABC2&quot; + &quot;\n&quot; + &quot;ABC3&quot;,
        &quot;BBC1&quot; + &quot;\n&quot; + &quot;BBC2&quot; + &quot;\n&quot; + &quot;BBC3&quot;,
    ],
    &quot;col2&quot;: [&quot;A&quot; + &quot;\n&quot; + &quot;B&quot; + &quot;\n&quot; + &quot;C&quot;, &quot;A&quot; + &quot;\n&quot; + &quot;B&quot; + &quot;\n&quot; + &quot;C&quot;],
    &quot;col3&quot;: [
        &quot;YES&quot; + &quot;\n&quot; + &quot;NO&quot; + &quot;\n&quot; + &quot;YES&quot;,
        &quot;NO&quot; + &quot;\n&quot; + &quot;NO&quot; + &quot;\n&quot; + &quot;YES&quot;,
    ],
}
df = pd.DataFrame(data=d)
df2 = df.applymap(lambda s: s.split(&quot;\n&quot;))
df2

2) Then, apply `zip` to respectively join the elements of the lists from the columns

df3 = df2.T.apply(lambda x: list(zip(*x.values)))
df3 = df3.T.apply(lambda x: x.str.join(&quot; &quot;))
df3

3) And lastly join the zipped new multi-column-concatenated lists into newline character-delimited strings


cols = df3.columns
df[&quot;col4&quot;] = df3[cols].astype(str).apply(&quot;\n&quot;.join, axis=1)
df

which gives the final solution:

答案2

得分: 2

Using a list comprehension with zip:

df['out'] = [list(map(' '.join, zip(*(y.splitlines() for y in x))))
             for x in zip(df['col1'], df['col2'], df['col3'])]

And for a double line break:

df['out'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
             for x in zip(df['col1'], df['col2'], df['col3'])]

generalization to an arbitrary number of columns

cols = ['col1', 'col2', 'col3']
# for all columns
# cols = list(df)
df['DesiredCol'] = ['\n\n'.join(map(' '.join, zip(*(y.splitlines() for y in x))))
                    for x in zip(*(df[c] for c in cols))]

comparison of answers

difference in behavior

John's and Mark's approaches use all columns, thus if you repeat the operation, the output is included. To avoid this behavior in the timing, I added in all approaches a step to check for and remove the output column before running the code. I also added an alternative approach in my answer to handle an arbitrary number of columns.
Pravash's approach uses the same string to join the strings chunks and the joined substrings, which I ignored for the timing considering it a valid output.

timings for a varying number of rows

John's code is likely slower due to the use of applymap and of pandas to perform the transposition operations (constructing the intermediate DataFrames is expensive). Mark's approach bottlenecks are the use of explode and groupby.

英文:

Using a list comprehension with zip:


df[&#39;out&#39;] = [list(map(&#39; &#39;.join, zip(*(y.splitlines() for y in x))))
             for x in zip(df[&#39;col1&#39;], df[&#39;col2&#39;], df[&#39;col3&#39;])]
               col1     col2          col3                                  out
0  ABC1\nABC2\nABC3  A\nB\nC  YES\nNO\nYES  [ABC1 A YES, ABC2 B NO, ABC3 C YES]
1  BBC1\nBBC2\nBBC3  A\nB\nC   NO\nNO\nYES   [BBC1 A NO, BBC2 B NO, BBC3 C YES]

And for a double line break:


df[&#39;out&#39;] = [&#39;\n\n&#39;.join(map(&#39; &#39;.join, zip(*(y.splitlines() for y in x))))
             for x in zip(df[&#39;col1&#39;], df[&#39;col2&#39;], df[&#39;col3&#39;])]
               col1     col2          col3                                    out
0  ABC1\nABC2\nABC3  A\nB\nC  YES\nNO\nYES  ABC1 A YES\n\nABC2 B NO\n\nABC3 C YES
1  BBC1\nBBC2\nBBC3  A\nB\nC   NO\nNO\nYES   BBC1 A NO\n\nBBC2 B NO\n\nBBC3 C YES

generalization to an arbitrary number of columns

cols = [&#39;col1&#39;, &#39;col2&#39;, &#39;col3&#39;]
# for all columns
# cols = list(df)
df[&#39;DesiredCol&#39;] = [&#39;\n\n&#39;.join(map(&#39; &#39;.join, zip(*(y.splitlines() for y in x))))
                    for x in zip(*(df[c] for c in cols))]

comparison of answers

difference in behavior

John's and Mark's approaches use all columns, thus if you repeat the operation, the output is included. To avoid this behavior in the timing I added in all approaches a step to check for and remove the output column before running the code. I also added an alternative approach in my answer to handle an arbitrary number of columns.
Pravash' approach uses the same string to join the strings chunks and the joined substrings, which I ignored for the timing considering it a valid output.

timings for a varying number of rows

答案3

得分: 1

你可以尝试使用以下代码 -

我使用了zip和split('\n')来为行中的列创建单独的元组。

df["DesiredCol"] = df.apply(lambda row: " ".join(value for col in zip(row["col1"].split("\n"), row["col2"].split("\n"), row["col3"].split("\n")) for value in col), axis=1)

英文:

You can try using below code -

I have used zip and split('\n') to create individual tuples for rows in cols.

df[&quot;DesiredCol&quot;] = df.apply(lambda row: &quot; &quot;.join(value for col in zip(row[&quot;col1&quot;].split(&quot;\n&quot;), row[&quot;col2&quot;].split(&quot;\n&quot;), row[&quot;col3&quot;].split(&quot;\n&quot;)) for value in col), axis=1)

答案4

得分: 1

这是另一个答案：

o = df.apply(lambda x: x.str.split('\n').explode()).apply(lambda x: ' '.join(x), axis=1)
df['new col'] = o.groupby(o.index).apply(lambda x: '\n'.join(x)).reset_index(drop=True)

英文:

Here's another answer

o = df.apply(lambda x: x.str.split(&#39;\n&#39;).explode()).apply(lambda x: &#39; &#39;.join(x), axis=1)
df[&#39;new col&#39;] = o.groupby(o.index).apply(lambda x: &#39;\n&#39;.join(x)).reset_index(drop=True)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

问题

答案1

A) 编辑：最快的解决方案

B) 原始解决方案

完整代码：

计算效率更高的解决方案：

以下是在下面解释的步骤中使用的解决方案的完整代码：

数据帧操作的分解

1) 使用applymap*来限定每个单元格在第1-3列中的内容

2) 然后，应用zip来分别连接来自列的列表的元素

3) 最后，将已压缩的新多列连接列表连接成换行字符限定的字符串

A) Edit: Fastest Solution

B) Original Solution

Full code:

Computationally more efficient solution:

Full code for solution as used in explained steps below:

Breakdown of df operations

1) Use applymap* to delimit each cell in cols 1-3

2) Then, apply zip to respectively join the elements of the lists from the columns

3) And lastly join the zipped new multi-column-concatenated lists into newline character-delimited strings

答案2

generalization to an arbitrary number of columns

comparison of answers

difference in behavior

timings for a varying number of rows

generalization to an arbitrary number of columns

comparison of answers

difference in behavior

timings for a varying number of rows

答案3

答案4

发表评论

1) 使用`applymap`*来限定每个单元格在第1-3列中的内容

2) 然后，应用`zip`来分别连接来自列的列表的元素

1) Use `applymap`* to delimit each cell in cols 1-3

2) Then, apply `zip` to respectively join the elements of the lists from the columns