如何清理带有不同格式名称的列(用逗号、点等分隔)?

huangapple go评论72阅读模式
英文:

How to clean a column with names inserted with different formats (separated by commas, by dots, etc.)

问题

ID Name
1 Ellie Joella
2 Antonio Chaz
3 Dr. Ian Coretta
4 John Doe
5 Marie Eliza Grey
6 Lary O Mason
7 Mr. Barry Winfred
8 Andrea T B Shaw
英文:

Imagine that you have a dataset with two columns, an id and a name, but the column name was inserted manually and has names typed in different formats. Some are separated with dots instead of blank spaces. Others put the surname first, then a comma, and then the first name. Some rows have middle names or even name titles.

ID Name
1 Ellie Joella
2 Antonio.Chaz
3 Dr. Ian Coretta
4 Doe, John
5 Marie.Eliza.Grey
6 Mason, Lary O
7 Winfred, Mr. Barry
8 Andrea.T.B.Shaw

How would you clean this column so the result would be something like : <name title (if inserted)> <first name> <middle names (if inserted)> <surname>.

ID Name
1 Ellie Joella
2 Antonio Chaz
3 Dr. Ian Coretta
4 John Doe
5 Marie Eliza Grey
6 Lary O Mason
7 Mr. Barry Winfred
8 Andrea T B Shaw

Thank you!

答案1

得分: 1

你可以尝试使用正则表达式字典来进行替换:

df['Name'] = df['Name'].replace({r'(?<!Dr|Mr)(\.\s*)': ' ', r'([^,]+)\s*,\s*(.*)': r' '}, regex=True)
print(df)

# 输出结果
   ID               Name
0   1       Ellie Joella
1   2       Antonio Chaz
2   3    Dr. Ian Coretta
3   4           John Doe
4   5   Marie Eliza Grey
5   6       Lary O Mason
6   7  Mr. Barry Winfred
7   8    Andrea T B Shaw
英文:

You can try replace with a dict of regex:

df[&#39;Name&#39;] = df[&#39;Name&#39;].replace({r&#39;(?&lt;!Dr|Mr)(\.\s*)&#39;: &#39; &#39;, r&#39;([^,]+)\s*,\s*(.*)&#39;: r&#39; &#39;}, regex=True)
print(df)

# Output
   ID               Name
0   1       Ellie Joella
1   2       Antonio Chaz
2   3    Dr. Ian Coretta
3   4           John Doe
4   5   Marie Eliza Grey
5   6       Lary O Mason
6   7  Mr. Barry Winfred
7   8    Andrea T B Shaw

答案2

得分: 0

I think you have to use the split method. you have to do several depending on what the text contains. I would first check if there are spaces in the text, and split the string after them.
You would also need to make a list of exceptions Mr. Dr. etc. and if the first string after splitting is any of them then merge it with the second one.

You could try making the function

def person(data):
    test = ('dr.', 'mr.', 'etc.')
    try:
        space = val.index(' ')
    except:
        space = None

    if space:
        pers = data.split(' ')
    else:
        pers = data.split('.')

and next check all You want
But how to check if a word is a first or last name I don't know

<details>
<summary>英文:</summary>

I think you have to use the split method. you have to do several depending on what the text contains. I would first check if there are spaces in the text, and split the string after them.  
You would also need to make a list of exceptions Mr. Dr. etc. and if the first string after splitting is any of them then merge it with the second one.

**You could try making the function**

    def person(data):
        test = (&#39;dr.&#39;, &#39;mr.&#39;, &#39;etc.&#39;)
        try:
            space = val.index(&#39; &#39;)
        except:
            space = None
            
        if space:
            pers = data.split(&#39; &#39;)
        else:
            pers = data.split(&#39;.&#39;)

and next check all You want
But how to check if a word is a first or last name I don&#39;t know

</details>



huangapple
  • 本文由 发表于 2023年4月4日 18:03:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/75928076.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定