2023年1月9日 16:51:31go评论92阅读模式

英文:

using regex in pivot_longer to unpivot multiple sets of columns with common grouping variable

问题

Follow-up from my last question:
https://stackoverflow.com/questions/74957441/pyjanitor-pivot-longer-multiple-sets-of-columns-with-common-grouping-variable-an/74957533#74957533

In my last question, the dataset I gave was oversimplified for the problem I was having. I have changed the column names to represent the ones in my dataset, as I couldn't figure out how to fix them myself using regex in pivot_longer. In the model dataset I gave, columns were written with the following pattern: number_word, but in my dataset the columns are in any order and never separated by underscores (e.g., wordnumber).

Note that the number needs to be the same grouping variable for each column set. So there should be a rating, estimate, and type for each number.

The dataset

df = pd.DataFrame({
    'id': [1, 1, 1],
    'ratingfirst': [1, 2, 3],
    'ratingsecond': [2.8, 2.9, 2.2],
    'ratingthird': [3.4, 3.8, 2.9],
    'firstestimate': [1.2, 2.4, 2.8],
    'secondestimate': [2.4, 3, 2.4],
    'thirdestimate':[3.4, 3.8, 2.9],
    'firsttype': ['red', 'green', 'blue'],
    'secondtype': ['red', 'green', 'yellow'],
    'thirdtype': ['red', 'red', 'blue'],
})

Desired output

The header of my desired output is the following:

id	category	rating	estimate	type
1	first	1.0	1.2	'red'

英文:

Follow-up from my last question:
https://stackoverflow.com/questions/74957441/pyjanitor-pivot-longer-multiple-sets-of-columns-with-common-grouping-variable-an/74957533#74957533

Note that the number needs to be the same grouping variable for each column set. So there should be a rating, estimate, and type for each number.

The dataset

df = pd.DataFrame({
    &#39;id&#39;: [1, 1, 1],
    &#39;ratingfirst&#39;: [1, 2, 3],
    &#39;ratingsecond&#39;: [2.8, 2.9, 2.2],
    &#39;ratingthird&#39;: [3.4, 3.8, 2.9],
    &#39;firstestimate&#39;: [1.2, 2.4, 2.8],
    &#39;secondestimate&#39;: [2.4, 3, 2.4],
    &#39;thirdestimate&#39;:[3.4, 3.8, 2.9],
    &#39;firsttype&#39;: [&#39;red&#39;, &#39;green&#39;, &#39;blue&#39;],
    &#39;secondtype&#39;: [&#39;red&#39;, &#39;green&#39;, &#39;yellow&#39;],
    &#39;thirdtype&#39;: [&#39;red&#39;, &#39;red&#39;, &#39;blue&#39;],
})

Desired output

The header of my desired output is the following:

id	category	rating	estimate	type
1	first	1.0	1.2	'red'

答案1

得分: 1

以下是翻译好的部分：

我认为最简单的方法是将您拥有的列与先前问题中使用的列对齐，类似于：
def fix_col_header(s, d):
    for word, word_replace in d.items():
        s = s.replace(word, word_replace)
        if s.startswith("_"):
            s = s[len(word_replace):] + s[:len(word_replace)]
    return s
d = {"first": "_first", "second": "_second", "third": "_third"}
df.columns = [fix_col_header(col, d) for col in df.columns]

这将给出以下列名：

id, rating_first, rating_second, rating_third, estimate_first, estimate_second, estimate_third, type_first, type_second, type_third

现在，您可以应用前一个问题中的解决方案（请注意，category 和 value 被互换了）。为了完整起见，我在此处添加了它：

import janitor
(df
.pivot_longer(
    column_names="*_*", 
    names_to = (".value", "category"), 
    names_sep="_")
)

英文:

I think the easiest way would be to align the columns you have with what was used in the previous question, something like:

def fix_col_header(s, d):
    for word, word_replace in d.items():
        s = s.replace(word, word_replace)
        if s.startswith(&quot;_&quot;):
            s = s[len(word_replace):] + s[:len(word_replace)]
    return s
d = {&quot;first&quot;:&quot;_first&quot;, &quot;second&quot;:&quot;_second&quot;, &quot;third&quot;: &quot;_third&quot;}
df.columns = [fix_col_header(col, d) for col in df.columns]

This will give the columns:

id, rating_first, rating_second, rating_third, estimate_first, estimate_second, estimate_third, type_first, type_second, type_third

Now you can apply the solution from the previous question (note that category and value are switched). For completeness I have added it here:

import janitor
(df
.pivot_longer(
    column_names=&quot;*_*&quot;, 
    names_to = (&quot;.value&quot;, &quot;category&quot;), 
    names_sep=&quot;_&quot;)
)

答案2

得分: 1

pivot_longer 支持多个 .value - 您可以利用这一点来重新整理数据框，使用 names_sep 参数：

# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
    index='id',
    names_to = (".value", "category", ".value"),
    names_sep = "(first|second|third)")
)
   id category  rating  estimate    type
0   1    first     1.0       1.2     red
1   1    first     2.0       2.4   green
2   1    first     3.0       2.8    blue
3   1   second     2.8       2.4     red
4   1   second     2.9       3.0   green
5   1   second     2.2       2.4  yellow
6   1    third     3.4       3.4     red
7   1    third     3.8       3.8     red
8   1    third     2.9       2.9    blue

如果您查看分解，您可以看到这里发生了什么：

df.columns[1:].str.split("(first|second|third)")
Index([   ['rating', 'first', ''],   ['rating', 'second', ''],
          ['rating', 'third', ''],  ['', 'first', 'estimate'],
       ['', 'second', 'estimate'],  ['', 'third', 'estimate'],
            ['', 'first', 'type'],     ['', 'second', 'type'],
            ['', 'third', 'type']],
      dtype='object')

请注意，我们有三个条目，其中一个为空字符串。这与我们的 names_to 参数匹配 - > (".value", "category", ".value")；一旦 pivot_longer 匹配，然后在最终输出中将 .value 组合成一个，所以对于 ['rating', 'first', '']，它提取出 ('rating', '')，最终合并成一个 - > rating，对于其他列也是如此。

另一个选项是使用 pd.stack：

temp = df.set_index('id')
temp.columns = temp.columns.str.split("(first|second|third)", expand = True)
temp.columns.names = [None, 'category', None]
temp = temp.stack('category')
temp.columns = temp.columns.map("".join)
temp.reset_index()
   id category  estimate    type  rating
0   1    first       1.2     red     1.0
1   1   second       2.4     red     2.8
2   1    third       3.4     red     3.4
3   1    first       2.4   green     2.0
4   1   second       3.0   green     2.9
5   1    third       3.8     red     3.8
6   1    first       2.8    blue     3.0
7   1   second       2.4  yellow     2.2
8   1    third       2.9    blue     2.9

英文:

pivot_longer supports multiple .value - you can take advantage of that to reshape your dataframe, using the names_sep parameter:

# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
    index=&#39;id&#39;,
    names_to = (&quot;.value&quot;, &quot;category&quot;, &quot;.value&quot;),
    names_sep = &quot;(first|second|third)&quot;)
)
   id category  rating  estimate    type
0   1    first     1.0       1.2     red
1   1    first     2.0       2.4   green
2   1    first     3.0       2.8    blue
3   1   second     2.8       2.4     red
4   1   second     2.9       3.0   green
5   1   second     2.2       2.4  yellow
6   1    third     3.4       3.4     red
7   1    third     3.8       3.8     red
8   1    third     2.9       2.9    blue

If you look at the breakdown you can see what's going on here :

df.columns[1:].str.split(&quot;(first|second|third)&quot;)
Index([   [&#39;rating&#39;, &#39;first&#39;, &#39;&#39;],   [&#39;rating&#39;, &#39;second&#39;, &#39;&#39;],
          [&#39;rating&#39;, &#39;third&#39;, &#39;&#39;],  [&#39;&#39;, &#39;first&#39;, &#39;estimate&#39;],
       [&#39;&#39;, &#39;second&#39;, &#39;estimate&#39;],  [&#39;&#39;, &#39;third&#39;, &#39;estimate&#39;],
            [&#39;&#39;, &#39;first&#39;, &#39;type&#39;],     [&#39;&#39;, &#39;second&#39;, &#39;type&#39;],
            [&#39;&#39;, &#39;third&#39;, &#39;type&#39;]],
      dtype=&#39;object&#39;)

Note how we have three entries, and one of them is an empty string. This matches our names_to argument -> (".value", "category", ".value"); once pivot_longer matches, it then in the final output combine the .values into one, so for ['rating', 'first', ''], it pulls out ('rating', ''), and finally lumps them into one -> rating, and on and on for the rest of the columns.

Another option is with pd.stack:

temp = df.set_index(&#39;id&#39;)
temp.columns = temp.columns.str.split(&quot;(first|second|third)&quot;, expand = True)
temp.columns.names = [None, &#39;category&#39;, None]
temp = temp.stack(&#39;category&#39;)
temp.columns = temp.columns.map(&quot;&quot;.join)
temp.reset_index()
   id category  estimate    type  rating
0   1    first       1.2     red     1.0
1   1   second       2.4     red     2.8
2   1    third       3.4     red     3.4
3   1    first       2.4   green     2.0
4   1   second       3.0   green     2.9
5   1    third       3.8     red     3.8
6   1    first       2.8    blue     3.0
7   1   second       2.4  yellow     2.2
8   1    third       2.9    blue     2.9

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用正则表达式在`pivot_longer`中，将具有共同分组变量的多个列集合展开。

问题

The dataset

Desired output

The dataset

Desired output

答案1

答案2

在DataFrame中根据值是否包含列表中的子字符串来筛选列值。

如何在pandas中更改包含字典的字符串？

将DataFrame写入Excel文件，其中列表中的项目被放入单独的单元格。

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation:

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论