2023年4月4日 18:03:52go评论72阅读模式

英文:

How to clean a column with names inserted with different formats (separated by commas, by dots, etc.)

问题

ID	Name
1	Ellie Joella
2	Antonio Chaz
3	Dr. Ian Coretta
4	John Doe
5	Marie Eliza Grey
6	Lary O Mason
7	Mr. Barry Winfred
8	Andrea T B Shaw

英文:

Imagine that you have a dataset with two columns, an id and a name, but the column name was inserted manually and has names typed in different formats. Some are separated with dots instead of blank spaces. Others put the surname first, then a comma, and then the first name. Some rows have middle names or even name titles.

ID	Name
1	Ellie Joella
2	Antonio.Chaz
3	Dr. Ian Coretta
4	Doe, John
5	Marie.Eliza.Grey
6	Mason, Lary O
7	Winfred, Mr. Barry
8	Andrea.T.B.Shaw

How would you clean this column so the result would be something like : <name title (if inserted)> <first name> <middle names (if inserted)> <surname>.

ID	Name
1	Ellie Joella
2	Antonio Chaz
3	Dr. Ian Coretta
4	John Doe
5	Marie Eliza Grey
6	Lary O Mason
7	Mr. Barry Winfred
8	Andrea T B Shaw

Thank you!

答案1

得分: 1

你可以尝试使用正则表达式字典来进行替换：

df['Name'] = df['Name'].replace({r'(?<!Dr|Mr)(\.\s*)': ' ', r'([^,]+)\s*,\s*(.*)': r' '}, regex=True)
print(df)

# 输出结果
   ID               Name
0   1       Ellie Joella
1   2       Antonio Chaz
2   3    Dr. Ian Coretta
3   4           John Doe
4   5   Marie Eliza Grey
5   6       Lary O Mason
6   7  Mr. Barry Winfred
7   8    Andrea T B Shaw

英文:

You can try replace with a dict of regex:

df[&#39;Name&#39;] = df[&#39;Name&#39;].replace({r&#39;(?&lt;!Dr|Mr)(\.\s*)&#39;: &#39; &#39;, r&#39;([^,]+)\s*,\s*(.*)&#39;: r&#39; &#39;}, regex=True)
print(df)

# Output
   ID               Name
0   1       Ellie Joella
1   2       Antonio Chaz
2   3    Dr. Ian Coretta
3   4           John Doe
4   5   Marie Eliza Grey
5   6       Lary O Mason
6   7  Mr. Barry Winfred
7   8    Andrea T B Shaw

答案2

得分: 0

I think you have to use the split method. you have to do several depending on what the text contains. I would first check if there are spaces in the text, and split the string after them.
You would also need to make a list of exceptions Mr. Dr. etc. and if the first string after splitting is any of them then merge it with the second one.

You could try making the function

def person(data):
    test = ('dr.', 'mr.', 'etc.')
    try:
        space = val.index(' ')
    except:
        space = None

    if space:
        pers = data.split(' ')
    else:
        pers = data.split('.')

and next check all You want
But how to check if a word is a first or last name I don't know

<details>
<summary>英文:</summary>

I think you have to use the split method. you have to do several depending on what the text contains. I would first check if there are spaces in the text, and split the string after them.  
You would also need to make a list of exceptions Mr. Dr. etc. and if the first string after splitting is any of them then merge it with the second one.

**You could try making the function**

    def person(data):
        test = (&#39;dr.&#39;, &#39;mr.&#39;, &#39;etc.&#39;)
        try:
            space = val.index(&#39; &#39;)
        except:
            space = None
            
        if space:
            pers = data.split(&#39; &#39;)
        else:
            pers = data.split(&#39;.&#39;)

and next check all You want
But how to check if a word is a first or last name I don&#39;t know

</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何清理带有不同格式名称的列（用逗号、点等分隔）？

问题

答案1

答案2

Python discord bot not joining voice channel, but without sending an error.

如何删除日期？ (sqlite3)

在Pandas中，按另一列对数据进行分组，计算行之间的百分比变化。

Python sqlalchemy获取一个用户的所有评分

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论