Splitting "Check all that apply" survey column from Google Forms

huangapple go评论92阅读模式
英文:

Splitting "Check all that apply" survey column from Google Forms

问题

I have a google forms data with a column Reasons which looks like this for 2 rows depending on how many the user checks on the survey:

  1. import pandas as pd
  2. filename = 'Example.csv'
  3. df = pd.read_csv(filename)
  4. print(df.to_dict("list"))

Output:

  1. {
  2. 'ID': [1, 2],
  3. 'Join Date': [
  4. Timestamp('2023-01-01 00:00:00'),
  5. Timestamp('2022-12-01 00:00:00')
  6. ],
  7. 'Reasons': [
  8. 'Benefits [Leave, Flexi, Dental, Insurance etc.], Compensation [Salary & Bonus]',
  9. 'Career & Growth Opportunities [Learning & Development, Progression], Meaningful work'
  10. ]
  11. }

I want it to look like:

  1. {
  2. 'ID': [1, 1, 2, 2],
  3. 'Join Date': [
  4. Timestamp('2023-01-01 00:00:00'),
  5. Timestamp('2023-01-01 00:00:00'),
  6. Timestamp('2022-12-01 00:00:00'),
  7. Timestamp('2022-12-01 00:00:00')
  8. ],
  9. 'Reasons': [
  10. 'Benefits [Leave, Flexi, Dental, Insurance etc.]',
  11. 'Compensation [Salary & Bonus]',
  12. 'Career & Growth Opportunities [Learning & Development, Progression]',
  13. 'Meaningful work'
  14. ]
  15. }

Converted back to a dataframe

After importing the data into python as a dataframe, how can I split this up in python and create duplicate rows for each reason checked by a user?

I can't split it by comma because there are commas in the reasons provided. Will it work using explode()?

Hopefully someone can help me.

英文:

I have a google forms data with a column Reasons which looks like this for 2 rows depending on how many the user checks on the survey:

  1. import pandas as pd
  2. filename = 'Example.csv'
  3. df = pd.read_csv(filename)
  4. print(df.to_dict("list"))

Output:

  1. {
  2. 'ID': [1, 2],
  3. 'Join Date': [
  4. Timestamp('2023-01-01 00:00:00'),
  5. Timestamp('2022-12-01 00:00:00')
  6. ],
  7. 'Reasons': [
  8. 'Benefits [Leave, Flexi, Dental, Insurance etc.], Compensation [Salary & Bonus]',
  9. 'Career & Growth Opportunities [Learning & Development, Progression], Meaningful work'
  10. ]
  11. }

I want it to look like:

  1. {
  2. 'ID': [1, 1, 2, 2],
  3. 'Join Date': [
  4. Timestamp('2023-01-01 00:00:00'),
  5. Timestamp('2023-01-01 00:00:00'),
  6. Timestamp('2022-12-01 00:00:00'),
  7. Timestamp('2022-12-01 00:00:00')
  8. ],
  9. 'Reasons': [
  10. 'Benefits [Leave, Flexi, Dental, Insurance etc.]',
  11. 'Compensation [Salary & Bonus]',
  12. 'Career & Growth Opportunities [Learning & Development, Progression]',
  13. 'Meaningful work'
  14. ]
  15. }

Converted back to a dataframe

After importing the data into python as a dataframe, how can i split this up in python and create duplicate rows for each reason checked by a user?

I cant split it by comma because there are commas in the reasons provided. Will it work using explode()?

Hopefully someone can help me.

答案1

得分: 0

这是正则表达式规则部分:

  1. r"([^,\[\]]+\[[^\[\]]+\])|([^,\[\]]+)"

这是数据框使用该规则的部分:

  1. import re
  2. df["Reasons"]=df["Reasons"].apply(lambda x: re.findall(r"([^,\[\]]+\[[^\[\]]+\])|([^,\[\]]+)", x))

如你所见,空字符串和我们需要的答案存储在元组中,然后放在列表中。让我们去掉空字符串的部分:

  1. df = df.explode("Reasons") # 列表值拆分为新行
  2. df["Reasons"]=df["Reasons"].apply(lambda x: [i for i in x if i != ""][0]) # 如果值不等于空字符串,则将其放入列表中。然后获取列表中的第一个元素。

希望这些部分有所帮助。

英文:

Here is the regex rule:

  1. r"([^,\[\]]+\[[^\[\]]+\])|([^,\[\]]+)"

Now, we can use this rule for dataframe:

  1. import re
  2. df["Reasons"]=df["Reasons"].apply(lambda x: re.findall(r"([^,\[\]]+\[[^\[\]]+\])|([^,\[\]]+)", x))
  3. '''
  4. | | ID | Join Date | Reasons |
  5. |---:|-----:|:--------------------|:--------------------------------------------------------------------------------------------------------|
  6. | 0 | 1 | 2023-01-01 00:00:00 | [('Benefits [Leave, Flexi, Dental, Insurance etc.]', ''), (' Compensation [Salary & Bonus]', '')] |
  7. | 1 | 2 | 2022-12-01 00:00:00 | [('Career & Growth Opportunities [Learning & Development, Progression]', ''), ('', ' Meaningful work')] |
  8. '''

As you can see, empty strings and the answers we need are stored in tuples and in a list. Let's get rid of empty strings:

  1. df = df.explode("Reasons") #list values to new rows
  2. df["Reasons"]=df["Reasons"].apply(lambda x: [i for i in x if i != ""][0]) # if value not equal empty string put it on a list. And get the first element in that we have a list of one element.

Out:

  1. | | ID | Join Date | Reasons |
  2. |---:|-----:|:--------------------|:--------------------------------------------------------------------|
  3. | 0 | 1 | 2023-01-01 00:00:00 | Benefits [Leave, Flexi, Dental, Insurance etc.] |
  4. | 0 | 1 | 2023-01-01 00:00:00 | Compensation [Salary & Bonus] |
  5. | 1 | 2 | 2022-12-01 00:00:00 | Career & Growth Opportunities [Learning & Development, Progression] |
  6. | 1 | 2 | 2022-12-01 00:00:00 | Meaningful work |

Note:
I'm not good at regex rules. So I used chatgpt to find the regex rule. You can ask like this: what is the regex rule of the xxxxx clause ?

huangapple
  • 本文由 发表于 2023年5月17日 17:59:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76270848.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定