Splitting "Check all that apply" survey column from Google Forms

huangapple go评论52阅读模式
英文:

Splitting "Check all that apply" survey column from Google Forms

问题

I have a google forms data with a column Reasons which looks like this for 2 rows depending on how many the user checks on the survey:

import pandas as pd

filename = 'Example.csv'
df = pd.read_csv(filename)
print(df.to_dict("list"))

Output:

{
  'ID': [1, 2], 
  'Join Date': [
    Timestamp('2023-01-01 00:00:00'), 
    Timestamp('2022-12-01 00:00:00')
  ], 
  'Reasons': [
    'Benefits [Leave, Flexi, Dental, Insurance etc.], Compensation [Salary & Bonus]', 
    'Career & Growth Opportunities [Learning & Development, Progression], Meaningful work'
  ]
}

I want it to look like:

{
  'ID': [1, 1, 2, 2], 
  'Join Date': [
    Timestamp('2023-01-01 00:00:00'), 
    Timestamp('2023-01-01 00:00:00'), 
    Timestamp('2022-12-01 00:00:00'), 
    Timestamp('2022-12-01 00:00:00')
  ], 
  'Reasons': [
    'Benefits [Leave, Flexi, Dental, Insurance etc.]', 
    'Compensation [Salary & Bonus]', 
    'Career & Growth Opportunities [Learning & Development, Progression]',         
    'Meaningful work'
  ]
}

Converted back to a dataframe

After importing the data into python as a dataframe, how can I split this up in python and create duplicate rows for each reason checked by a user?

I can't split it by comma because there are commas in the reasons provided. Will it work using explode()?

Hopefully someone can help me.

英文:

I have a google forms data with a column Reasons which looks like this for 2 rows depending on how many the user checks on the survey:

import pandas as pd

filename = 'Example.csv'
df = pd.read_csv(filename)
print(df.to_dict("list"))

Output:

{
  'ID': [1, 2], 
  'Join Date': [
    Timestamp('2023-01-01 00:00:00'), 
    Timestamp('2022-12-01 00:00:00')
  ], 
  'Reasons': [
    'Benefits [Leave, Flexi, Dental, Insurance etc.], Compensation [Salary & Bonus]', 
    'Career & Growth Opportunities [Learning & Development, Progression], Meaningful work'
  ]
}

I want it to look like:

{
  'ID': [1, 1, 2, 2], 
  'Join Date': [
    Timestamp('2023-01-01 00:00:00'), 
    Timestamp('2023-01-01 00:00:00'), 
    Timestamp('2022-12-01 00:00:00'), 
    Timestamp('2022-12-01 00:00:00')
  ], 
  'Reasons': [
    'Benefits [Leave, Flexi, Dental, Insurance etc.]', 
    'Compensation [Salary & Bonus]', 
    'Career & Growth Opportunities [Learning & Development, Progression]',         
    'Meaningful work'
  ]
}

Converted back to a dataframe

After importing the data into python as a dataframe, how can i split this up in python and create duplicate rows for each reason checked by a user?

I cant split it by comma because there are commas in the reasons provided. Will it work using explode()?

Hopefully someone can help me.

答案1

得分: 0

这是正则表达式规则部分:

r"([^,\[\]]+\[[^\[\]]+\])|([^,\[\]]+)"

这是数据框使用该规则的部分:

import re
df["Reasons"]=df["Reasons"].apply(lambda x: re.findall(r"([^,\[\]]+\[[^\[\]]+\])|([^,\[\]]+)", x))

如你所见,空字符串和我们需要的答案存储在元组中,然后放在列表中。让我们去掉空字符串的部分:

df = df.explode("Reasons") # 列表值拆分为新行
df["Reasons"]=df["Reasons"].apply(lambda x: [i for i in x if i != ""][0]) # 如果值不等于空字符串,则将其放入列表中。然后获取列表中的第一个元素。

希望这些部分有所帮助。

英文:

Here is the regex rule:

r"([^,\[\]]+\[[^\[\]]+\])|([^,\[\]]+)"

Now, we can use this rule for dataframe:

import re
df["Reasons"]=df["Reasons"].apply(lambda x: re.findall(r"([^,\[\]]+\[[^\[\]]+\])|([^,\[\]]+)", x))

'''
|    |   ID | Join Date           | Reasons                                                                                                 |
|---:|-----:|:--------------------|:--------------------------------------------------------------------------------------------------------|
|  0 |    1 | 2023-01-01 00:00:00 | [('Benefits [Leave, Flexi, Dental, Insurance etc.]', ''), (' Compensation [Salary & Bonus]', '')]       |
|  1 |    2 | 2022-12-01 00:00:00 | [('Career & Growth Opportunities [Learning & Development, Progression]', ''), ('', ' Meaningful work')] |
'''

As you can see, empty strings and the answers we need are stored in tuples and in a list. Let's get rid of empty strings:

df = df.explode("Reasons") #list values to new rows
df["Reasons"]=df["Reasons"].apply(lambda x: [i for i in x if i != ""][0]) # if value not equal empty string put it on a list. And get the first element in that we have a list of one element.

Out:

|    |   ID | Join Date           | Reasons                                                             |
|---:|-----:|:--------------------|:--------------------------------------------------------------------|
|  0 |    1 | 2023-01-01 00:00:00 | Benefits [Leave, Flexi, Dental, Insurance etc.]                     |
|  0 |    1 | 2023-01-01 00:00:00 | Compensation [Salary & Bonus]                                       |
|  1 |    2 | 2022-12-01 00:00:00 | Career & Growth Opportunities [Learning & Development, Progression] |
|  1 |    2 | 2022-12-01 00:00:00 | Meaningful work                                                     |

Note:
I'm not good at regex rules. So I used chatgpt to find the regex rule. You can ask like this: what is the regex rule of the xxxxx clause ?

huangapple
  • 本文由 发表于 2023年5月17日 17:59:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76270848.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定