2023年3月7日 07:20:24go评论103阅读模式

英文:

Compare pandas dataframe columns ignoring number in front of text

问题

我创建了2个数据框，将它们连接起来，交换了列级别，并将列名放在一起。然后，我比较这些列。我的列中的数据具有类似1-1然后文本的格式。我想比较文本并忽略前面的数字。

我目前的进展示例：

import pandas as pd
import numpy as np
if __name__ == "__main__":
    data1 = pd.DataFrame({
    "ID": [1, 2, 3, 4],
    "Text": ['1-1 Text here1', '1-2 Text here2', '1-3 Text here3', '1-4 Text here4']})
    data2 = pd.DataFrame({
    "ID": [1, 2, 3, 4],
    "Text": ['1-1 Text here1', '1-4 Text here', '1-5 Text here3', '1-6 Text here']})
    df_both = pd.concat([data1.set_index('ID'), data2.set_index('ID')], axis=1, keys=['Data1', 'Data2'])
    df_both = df_both.swaplevel(axis='columns')
    for column in df_both.columns.levels[0]:
        print(df_both.index[np.where(df_both[column]['Data1'] != df_both[column]['Data2'])])

结果: Int64Index([2, 3, 4], dtype='int64', name='ID')

目前，返回了ID为2、3和4，因为它们不相等。我想找到一种方法来忽略文本前面的数字（例如1-1，1-2，1-3，1-4），只比较后面的文本部分。

英文:

I am creating 2 dataframes, concatenating them, and swapping the column levels and putting the column names next to each other. I am then comparing the columns. The data in my columns has a number with the format like 1-1 and then text. I would like to compare the text and ignore the number before it.

Example of what I have so far:

import pandas as pd
import numpy as np
if __name__ == &quot;__main__&quot;:
    data1 = pd.DataFrame({
    &quot;ID&quot;: [1, 2, 3, 4],
    &quot;Text&quot;: [&#39;1-1 Text here1&#39;, &#39;1-2 Text here2&#39;, &#39;1-3 Text here3&#39;, &#39;1-4 Text here4&#39;]})
    data2 = pd.DataFrame({
    &quot;ID&quot;: [1, 2, 3, 4],
    &quot;Text&quot;: [&#39;1-1 Text here1&#39;, &#39;1-4 Text here&#39;, &#39;1-5 Text here3&#39;, &#39;1-6 Text here&#39;]})
    df_both = pd.concat([data1.set_index(&#39;ID&#39;), data2.set_index(&#39;ID&#39;)], axis=1, keys=[&#39;Data1&#39;, &#39;Data2&#39;])
    df_both = df_both.swaplevel(axis=&#39;columns&#39;)
    for column in df_both.columns.levels[0]:
        print(df_both.index[np.where(df_both[column][&#39;Data1&#39;] != df_both[column][&#39;Data2&#39;])])

Result: Int64Index([2, 3, 4], dtype='int64', name='ID')

Currently, the IDs 2, 3, and 4 are being returned because they are not equal. I would like to find a way to ignore the number before the text (ex. 1-1, 1-2, 1-3, 1-4) and only compare what follows it.

答案1

得分: 0

为了忽略文本前面的数字部分，你可以使用 pandas Series 的 str.split() 方法，在第一个空格字符处分割 'Text' 列，并保留第二部分。可以尝试以下操作：

import pandas as pd
import numpy as np
if __name__ == "__main__":
    data1 = pd.DataFrame({
        "ID": [1, 2, 3, 4],
        "Text": ['1-1 Text here1', '1-2 Text here2', '1-3 Text here3', '1-4 Text here4']
    })
    data2 = pd.DataFrame({
        "ID": [1, 2, 3, 4],
        "Text": ['1-1 Text here1', '1-4 Text here', '1-5 Text here3', '1-6 Text here']
    })
    # 在第一个空格字符处分割 'Text' 列，并保留第二部分
    data1['Text'] = data1['Text'].str.split(' ', n=1).str[1]
    data2['Text'] = data2['Text'].str.split(' ', n=1).str[1]
    df_both = pd.concat([data1.set_index('ID'), data2.set_index('ID')], axis=1, keys=['Data1', 'Data2'])
    df_both = df_both.swaplevel(axis='columns')
    for column in df_both.columns.levels[0]:
        # 在第一个空格字符处分割 'Text' 列，并保留第二部分
        data1_col = data1['Text'].str.split(' ', n=1).str[1]
        data2_col = data2['Text'].str.split(' ', n=1).str[1]
        print(df_both.index[np.where(data1_col != data2_col)])

希望这对你有所帮助。

英文:

To ignore the number before the text, you can split the 'Text' column on the first space character, and keep only the second part. This can be done using the str.split() method of pandas Series, and then accessing the second element of the resulting list using the str accessor. You can try this:

import pandas as pd
import numpy as np
if __name__ == &quot;__main__&quot;:
    data1 = pd.DataFrame({
        &quot;ID&quot;: [1, 2, 3, 4],
        &quot;Text&quot;: [&#39;1-1 Text here1&#39;, &#39;1-2 Text here2&#39;, &#39;1-3 Text here3&#39;, &#39;1-4 Text here4&#39;]
    })
    data2 = pd.DataFrame({
        &quot;ID&quot;: [1, 2, 3, 4],
        &quot;Text&quot;: [&#39;1-1 Text here1&#39;, &#39;1-4 Text here&#39;, &#39;1-5 Text here3&#39;, &#39;1-6 Text here&#39;]
    })
    # Split &#39;Text&#39; column on the first space character and keep only the second part
    data1[&#39;Text&#39;] = data1[&#39;Text&#39;].str.split(&#39; &#39;, n=1).str[1]
    data2[&#39;Text&#39;] = data2[&#39;Text&#39;].str.split(&#39; &#39;, n=1).str[1]
    df_both = pd.concat([data1.set_index(&#39;ID&#39;), data2.set_index(&#39;ID&#39;)], axis=1, keys=[&#39;Data1&#39;, &#39;Data2&#39;])
    df_both = df_both.swaplevel(axis=&#39;columns&#39;)
    for column in df_both.columns.levels[0]:
        # Split &#39;Text&#39; column on the first space character and keep only the second part
        data1_col = data1[&#39;Text&#39;].str.split(&#39; &#39;, n=1).str[1]
        data2_col = data2[&#39;Text&#39;].str.split(&#39; &#39;, n=1).str[1]
        print(df_both.index[np.where(data1_col != data2_col)])

答案2

得分: 0

import pandas as pd
data1 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-2 Text here2', '1-3 Text here3', '1-4 Text here4']})
data2 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-4 Text here', '1-5 Text here3', '1-6 Text here']})
df_both = pd.concat([data1.set_index('ID'), data2.set_index('ID')], axis=1, keys=['Data1', 'Data2'])
df_both = df_both.swaplevel(axis='columns')
s1 = df_both.loc[:,('Text','Data1')].str.extract('(\d-\d)(.+)')[1]
s2 = df_both.loc[:,('Text','Data2')].str.extract('(\d-\d)(.+)')[1]
m = s1.eq(s2)
print(df_both[m])

              Text                
             Data1           Data2
ID                                
1   1-1 Text here1  1-1 Text here1
3   1-3 Text here3  1-5 Text here3

英文:

import pandas as pd
data1 = pd.DataFrame({
&quot;ID&quot;: [1, 2, 3, 4],
&quot;Text&quot;: [&#39;1-1 Text here1&#39;, &#39;1-2 Text here2&#39;, &#39;1-3 Text here3&#39;, &#39;1-4 Text here4&#39;]})
data2 = pd.DataFrame({
&quot;ID&quot;: [1, 2, 3, 4],
&quot;Text&quot;: [&#39;1-1 Text here1&#39;, &#39;1-4 Text here&#39;, &#39;1-5 Text here3&#39;, &#39;1-6 Text here&#39;]})
df_both = pd.concat([data1.set_index(&#39;ID&#39;), data2.set_index(&#39;ID&#39;)], axis=1, keys=[&#39;Data1&#39;, &#39;Data2&#39;])
df_both = df_both.swaplevel(axis=&#39;columns&#39;)
s1 = df_both.loc[:,(&#39;Text&#39;,&#39;Data1&#39;)].str.extract(&#39;(\d-\d)(.+)&#39;)[1]
s2 = df_both.loc[:,(&#39;Text&#39;,&#39;Data2&#39;)].str.extract(&#39;(\d-\d)(.+)&#39;)[1]
m = s1.eq(s2)
print(df_both[m])

              Text                
             Data1           Data2
ID                                
1   1-1 Text here1  1-1 Text here1
3   1-3 Text here3  1-5 Text here3

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

比较 pandas 数据框的列，忽略文本前面的数字。

问题

答案1

答案2

将一个 Pandas 数据框转换为类别和项目列表？

有没有办法根据它们的属性更改列表中特定元素的数据类型？

Pandas 分组累计计数条件

如何通过Python Telegram Bot库在固定时间或间隔内从机器人向用户发送消息？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。