2023年5月17日 17:59:19go评论92阅读模式

英文:

Splitting "Check all that apply" survey column from Google Forms

问题

I have a google forms data with a column Reasons which looks like this for 2 rows depending on how many the user checks on the survey:

import pandas as pd
filename = 'Example.csv'
df = pd.read_csv(filename)
print(df.to_dict("list"))

Output:

{
  'ID': [1, 2], 
  'Join Date': [
    Timestamp('2023-01-01 00:00:00'), 
    Timestamp('2022-12-01 00:00:00')
  ], 
  'Reasons': [
    'Benefits [Leave, Flexi, Dental, Insurance etc.], Compensation [Salary & Bonus]', 
    'Career & Growth Opportunities [Learning & Development, Progression], Meaningful work'
  ]
}

I want it to look like:

{
  'ID': [1, 1, 2, 2], 
  'Join Date': [
    Timestamp('2023-01-01 00:00:00'), 
    Timestamp('2023-01-01 00:00:00'), 
    Timestamp('2022-12-01 00:00:00'), 
    Timestamp('2022-12-01 00:00:00')
  ], 
  'Reasons': [
    'Benefits [Leave, Flexi, Dental, Insurance etc.]', 
    'Compensation [Salary & Bonus]', 
    'Career & Growth Opportunities [Learning & Development, Progression]',         
    'Meaningful work'
  ]
}

Converted back to a dataframe

After importing the data into python as a dataframe, how can I split this up in python and create duplicate rows for each reason checked by a user?

I can't split it by comma because there are commas in the reasons provided. Will it work using explode()?

Hopefully someone can help me.

英文:

I have a google forms data with a column Reasons which looks like this for 2 rows depending on how many the user checks on the survey:

import pandas as pd
filename = &#39;Example.csv&#39;
df = pd.read_csv(filename)
print(df.to_dict(&quot;list&quot;))

Output:

{
  &#39;ID&#39;: [1, 2], 
  &#39;Join Date&#39;: [
    Timestamp(&#39;2023-01-01 00:00:00&#39;), 
    Timestamp(&#39;2022-12-01 00:00:00&#39;)
  ], 
  &#39;Reasons&#39;: [
    &#39;Benefits [Leave, Flexi, Dental, Insurance etc.], Compensation [Salary &amp; Bonus]&#39;, 
    &#39;Career &amp; Growth Opportunities [Learning &amp; Development, Progression], Meaningful work&#39;
  ]
}

I want it to look like:

{
  &#39;ID&#39;: [1, 1, 2, 2], 
  &#39;Join Date&#39;: [
    Timestamp(&#39;2023-01-01 00:00:00&#39;), 
    Timestamp(&#39;2023-01-01 00:00:00&#39;), 
    Timestamp(&#39;2022-12-01 00:00:00&#39;), 
    Timestamp(&#39;2022-12-01 00:00:00&#39;)
  ], 
  &#39;Reasons&#39;: [
    &#39;Benefits [Leave, Flexi, Dental, Insurance etc.]&#39;, 
    &#39;Compensation [Salary &amp; Bonus]&#39;, 
    &#39;Career &amp; Growth Opportunities [Learning &amp; Development, Progression]&#39;,         
    &#39;Meaningful work&#39;
  ]
}

Converted back to a dataframe

After importing the data into python as a dataframe, how can i split this up in python and create duplicate rows for each reason checked by a user?

I cant split it by comma because there are commas in the reasons provided. Will it work using explode()?

Hopefully someone can help me.

答案1

得分: 0

这是正则表达式规则部分：

r&quot;([^,\[\]]+\[[^\[\]]+\])|([^,\[\]]+)&quot;

这是数据框使用该规则的部分：

import re
df[&quot;Reasons&quot;]=df[&quot;Reasons&quot;].apply(lambda x: re.findall(r&quot;([^,\[\]]+\[[^\[\]]+\])|([^,\[\]]+)&quot;, x))

如你所见，空字符串和我们需要的答案存储在元组中，然后放在列表中。让我们去掉空字符串的部分：

df = df.explode(&quot;Reasons&quot;) # 列表值拆分为新行
df[&quot;Reasons&quot;]=df[&quot;Reasons&quot;].apply(lambda x: [i for i in x if i != &quot;&quot;][0]) # 如果值不等于空字符串，则将其放入列表中。然后获取列表中的第一个元素。

希望这些部分有所帮助。

英文:

Here is the regex rule:

r&quot;([^,\[\]]+\[[^\[\]]+\])|([^,\[\]]+)&quot;

Now, we can use this rule for dataframe:

import re
df[&quot;Reasons&quot;]=df[&quot;Reasons&quot;].apply(lambda x: re.findall(r&quot;([^,\[\]]+\[[^\[\]]+\])|([^,\[\]]+)&quot;, x))
&#39;&#39;&#39;
|    |   ID | Join Date           | Reasons                                                                                                 |
|---:|-----:|:--------------------|:--------------------------------------------------------------------------------------------------------|
|  0 |    1 | 2023-01-01 00:00:00 | [(&#39;Benefits [Leave, Flexi, Dental, Insurance etc.]&#39;, &#39;&#39;), (&#39; Compensation [Salary &amp; Bonus]&#39;, &#39;&#39;)]       |
|  1 |    2 | 2022-12-01 00:00:00 | [(&#39;Career &amp; Growth Opportunities [Learning &amp; Development, Progression]&#39;, &#39;&#39;), (&#39;&#39;, &#39; Meaningful work&#39;)] |
&#39;&#39;&#39;

As you can see, empty strings and the answers we need are stored in tuples and in a list. Let's get rid of empty strings:

df = df.explode(&quot;Reasons&quot;) #list values to new rows
df[&quot;Reasons&quot;]=df[&quot;Reasons&quot;].apply(lambda x: [i for i in x if i != &quot;&quot;][0]) # if value not equal empty string put it on a list. And get the first element in that we have a list of one element.

Out:

|    |   ID | Join Date           | Reasons                                                             |
|---:|-----:|:--------------------|:--------------------------------------------------------------------|
|  0 |    1 | 2023-01-01 00:00:00 | Benefits [Leave, Flexi, Dental, Insurance etc.]                     |
|  0 |    1 | 2023-01-01 00:00:00 | Compensation [Salary &amp; Bonus]                                       |
|  1 |    2 | 2022-12-01 00:00:00 | Career &amp; Growth Opportunities [Learning &amp; Development, Progression] |
|  1 |    2 | 2022-12-01 00:00:00 | Meaningful work                                                     |

Note:
I'm not good at regex rules. So I used chatgpt to find the regex rule. You can ask like this: what is the regex rule of the xxxxx clause ?

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Splitting "Check all that apply" survey column from Google Forms

问题

答案1

有没有一个标准的类，通过调用int(self)来实现所有类似整数的魔术方法？

Python函数用于添加或减去数字的函数。

I am trying to webscrape basketball reference, and I am using proxy networks to avoid error 429, but I it keeps giving an error

如何从架构数据库XML创建嵌套字典（JSON）？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。