英文:
Splitting "Check all that apply" survey column from Google Forms
问题
I have a google forms data with a column Reasons which looks like this for 2 rows depending on how many the user checks on the survey:
import pandas as pd
filename = 'Example.csv'
df = pd.read_csv(filename)
print(df.to_dict("list"))
Output:
{
'ID': [1, 2],
'Join Date': [
Timestamp('2023-01-01 00:00:00'),
Timestamp('2022-12-01 00:00:00')
],
'Reasons': [
'Benefits [Leave, Flexi, Dental, Insurance etc.], Compensation [Salary & Bonus]',
'Career & Growth Opportunities [Learning & Development, Progression], Meaningful work'
]
}
I want it to look like:
{
'ID': [1, 1, 2, 2],
'Join Date': [
Timestamp('2023-01-01 00:00:00'),
Timestamp('2023-01-01 00:00:00'),
Timestamp('2022-12-01 00:00:00'),
Timestamp('2022-12-01 00:00:00')
],
'Reasons': [
'Benefits [Leave, Flexi, Dental, Insurance etc.]',
'Compensation [Salary & Bonus]',
'Career & Growth Opportunities [Learning & Development, Progression]',
'Meaningful work'
]
}
Converted back to a dataframe
After importing the data into python as a dataframe, how can I split this up in python and create duplicate rows for each reason checked by a user?
I can't split it by comma because there are commas in the reasons provided. Will it work using explode()
?
Hopefully someone can help me.
英文:
I have a google forms data with a column Reasons which looks like this for 2 rows depending on how many the user checks on the survey:
import pandas as pd
filename = 'Example.csv'
df = pd.read_csv(filename)
print(df.to_dict("list"))
Output:
{
'ID': [1, 2],
'Join Date': [
Timestamp('2023-01-01 00:00:00'),
Timestamp('2022-12-01 00:00:00')
],
'Reasons': [
'Benefits [Leave, Flexi, Dental, Insurance etc.], Compensation [Salary & Bonus]',
'Career & Growth Opportunities [Learning & Development, Progression], Meaningful work'
]
}
I want it to look like:
{
'ID': [1, 1, 2, 2],
'Join Date': [
Timestamp('2023-01-01 00:00:00'),
Timestamp('2023-01-01 00:00:00'),
Timestamp('2022-12-01 00:00:00'),
Timestamp('2022-12-01 00:00:00')
],
'Reasons': [
'Benefits [Leave, Flexi, Dental, Insurance etc.]',
'Compensation [Salary & Bonus]',
'Career & Growth Opportunities [Learning & Development, Progression]',
'Meaningful work'
]
}
Converted back to a dataframe
After importing the data into python as a dataframe, how can i split this up in python and create duplicate rows for each reason checked by a user?
I cant split it by comma because there are commas in the reasons provided. Will it work using explode()?
Hopefully someone can help me.
答案1
得分: 0
这是正则表达式规则部分:
r"([^,\[\]]+\[[^\[\]]+\])|([^,\[\]]+)"
这是数据框使用该规则的部分:
import re
df["Reasons"]=df["Reasons"].apply(lambda x: re.findall(r"([^,\[\]]+\[[^\[\]]+\])|([^,\[\]]+)", x))
如你所见,空字符串和我们需要的答案存储在元组中,然后放在列表中。让我们去掉空字符串的部分:
df = df.explode("Reasons") # 列表值拆分为新行
df["Reasons"]=df["Reasons"].apply(lambda x: [i for i in x if i != ""][0]) # 如果值不等于空字符串,则将其放入列表中。然后获取列表中的第一个元素。
希望这些部分有所帮助。
英文:
Here is the regex
rule:
r"([^,\[\]]+\[[^\[\]]+\])|([^,\[\]]+)"
Now, we can use this rule for dataframe:
import re
df["Reasons"]=df["Reasons"].apply(lambda x: re.findall(r"([^,\[\]]+\[[^\[\]]+\])|([^,\[\]]+)", x))
'''
| | ID | Join Date | Reasons |
|---:|-----:|:--------------------|:--------------------------------------------------------------------------------------------------------|
| 0 | 1 | 2023-01-01 00:00:00 | [('Benefits [Leave, Flexi, Dental, Insurance etc.]', ''), (' Compensation [Salary & Bonus]', '')] |
| 1 | 2 | 2022-12-01 00:00:00 | [('Career & Growth Opportunities [Learning & Development, Progression]', ''), ('', ' Meaningful work')] |
'''
As you can see, empty strings and the answers we need are stored in tuples and in a list. Let's get rid of empty strings:
df = df.explode("Reasons") #list values to new rows
df["Reasons"]=df["Reasons"].apply(lambda x: [i for i in x if i != ""][0]) # if value not equal empty string put it on a list. And get the first element in that we have a list of one element.
Out:
| | ID | Join Date | Reasons |
|---:|-----:|:--------------------|:--------------------------------------------------------------------|
| 0 | 1 | 2023-01-01 00:00:00 | Benefits [Leave, Flexi, Dental, Insurance etc.] |
| 0 | 1 | 2023-01-01 00:00:00 | Compensation [Salary & Bonus] |
| 1 | 2 | 2022-12-01 00:00:00 | Career & Growth Opportunities [Learning & Development, Progression] |
| 1 | 2 | 2022-12-01 00:00:00 | Meaningful work |
Note:
I'm not good at regex rules. So I used chatgpt
to find the regex rule. You can ask like this: what is the regex rule of the xxxxx clause ?
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论