英文:
How to clean a column with names inserted with different formats (separated by commas, by dots, etc.)
问题
ID | Name |
---|---|
1 | Ellie Joella |
2 | Antonio Chaz |
3 | Dr. Ian Coretta |
4 | John Doe |
5 | Marie Eliza Grey |
6 | Lary O Mason |
7 | Mr. Barry Winfred |
8 | Andrea T B Shaw |
英文:
Imagine that you have a dataset with two columns, an id and a name, but the column name was inserted manually and has names typed in different formats. Some are separated with dots instead of blank spaces. Others put the surname first, then a comma, and then the first name. Some rows have middle names or even name titles.
ID | Name |
---|---|
1 | Ellie Joella |
2 | Antonio.Chaz |
3 | Dr. Ian Coretta |
4 | Doe, John |
5 | Marie.Eliza.Grey |
6 | Mason, Lary O |
7 | Winfred, Mr. Barry |
8 | Andrea.T.B.Shaw |
How would you clean this column so the result would be something like : <name title (if inserted)> <first name> <middle names (if inserted)> <surname>.
ID | Name |
---|---|
1 | Ellie Joella |
2 | Antonio Chaz |
3 | Dr. Ian Coretta |
4 | John Doe |
5 | Marie Eliza Grey |
6 | Lary O Mason |
7 | Mr. Barry Winfred |
8 | Andrea T B Shaw |
Thank you!
答案1
得分: 1
你可以尝试使用正则表达式字典来进行替换:
df['Name'] = df['Name'].replace({r'(?<!Dr|Mr)(\.\s*)': ' ', r'([^,]+)\s*,\s*(.*)': r' '}, regex=True)
print(df)
# 输出结果
ID Name
0 1 Ellie Joella
1 2 Antonio Chaz
2 3 Dr. Ian Coretta
3 4 John Doe
4 5 Marie Eliza Grey
5 6 Lary O Mason
6 7 Mr. Barry Winfred
7 8 Andrea T B Shaw
英文:
You can try replace
with a dict of regex:
df['Name'] = df['Name'].replace({r'(?<!Dr|Mr)(\.\s*)': ' ', r'([^,]+)\s*,\s*(.*)': r' '}, regex=True)
print(df)
# Output
ID Name
0 1 Ellie Joella
1 2 Antonio Chaz
2 3 Dr. Ian Coretta
3 4 John Doe
4 5 Marie Eliza Grey
5 6 Lary O Mason
6 7 Mr. Barry Winfred
7 8 Andrea T B Shaw
答案2
得分: 0
I think you have to use the split method. you have to do several depending on what the text contains. I would first check if there are spaces in the text, and split the string after them.
You would also need to make a list of exceptions Mr. Dr. etc. and if the first string after splitting is any of them then merge it with the second one.
You could try making the function
def person(data):
test = ('dr.', 'mr.', 'etc.')
try:
space = val.index(' ')
except:
space = None
if space:
pers = data.split(' ')
else:
pers = data.split('.')
and next check all You want
But how to check if a word is a first or last name I don't know
<details>
<summary>英文:</summary>
I think you have to use the split method. you have to do several depending on what the text contains. I would first check if there are spaces in the text, and split the string after them.
You would also need to make a list of exceptions Mr. Dr. etc. and if the first string after splitting is any of them then merge it with the second one.
**You could try making the function**
def person(data):
test = ('dr.', 'mr.', 'etc.')
try:
space = val.index(' ')
except:
space = None
if space:
pers = data.split(' ')
else:
pers = data.split('.')
and next check all You want
But how to check if a word is a first or last name I don't know
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论