使用正则表达式在`pivot_longer`中,将具有共同分组变量的多个列集合展开。

huangapple go评论57阅读模式
英文:

using regex in pivot_longer to unpivot multiple sets of columns with common grouping variable

问题

Follow-up from my last question:
https://stackoverflow.com/questions/74957441/pyjanitor-pivot-longer-multiple-sets-of-columns-with-common-grouping-variable-an/74957533#74957533

In my last question, the dataset I gave was oversimplified for the problem I was having. I have changed the column names to represent the ones in my dataset, as I couldn't figure out how to fix them myself using regex in pivot_longer. In the model dataset I gave, columns were written with the following pattern: number_word, but in my dataset the columns are in any order and never separated by underscores (e.g., wordnumber).

Note that the number needs to be the same grouping variable for each column set. So there should be a rating, estimate, and type for each number.

The dataset

df = pd.DataFrame({

    'id': [1, 1, 1],
    'ratingfirst': [1, 2, 3],
    'ratingsecond': [2.8, 2.9, 2.2],
    'ratingthird': [3.4, 3.8, 2.9],
    'firstestimate': [1.2, 2.4, 2.8],
    'secondestimate': [2.4, 3, 2.4],
    'thirdestimate':[3.4, 3.8, 2.9],
    'firsttype': ['red', 'green', 'blue'],
    'secondtype': ['red', 'green', 'yellow'],
    'thirdtype': ['red', 'red', 'blue'],
})

Desired output

The header of my desired output is the following:

id category rating estimate type
1 first 1.0 1.2 'red'
英文:

Follow-up from my last question:
https://stackoverflow.com/questions/74957441/pyjanitor-pivot-longer-multiple-sets-of-columns-with-common-grouping-variable-an/74957533#74957533

In my last question, the dataset I gave was oversimplified for the problem I was having. I have changed the column names to represent the ones in my dataset, as I couldn't figure out how to fix them myself using regex in pivot_longer. In the model dataset I gave, columns were written with the following pattern: number_word, but in my dataset the columns are in any order and never separated by underscores (e.g., wordnumber).

Note that the number needs to be the same grouping variable for each column set. So there should be a rating, estimate, and type for each number.

The dataset

df = pd.DataFrame({

    'id': [1, 1, 1],
    'ratingfirst': [1, 2, 3],
    'ratingsecond': [2.8, 2.9, 2.2],
    'ratingthird': [3.4, 3.8, 2.9],
    'firstestimate': [1.2, 2.4, 2.8],
    'secondestimate': [2.4, 3, 2.4],
    'thirdestimate':[3.4, 3.8, 2.9],
    'firsttype': ['red', 'green', 'blue'],
    'secondtype': ['red', 'green', 'yellow'],
    'thirdtype': ['red', 'red', 'blue'],
})

Desired output

The header of my desired output is the following:

id category rating estimate type
1 first 1.0 1.2 'red'

答案1

得分: 1

以下是翻译好的部分:

我认为最简单的方法是将您拥有的列与先前问题中使用的列对齐类似于

def fix_col_header(s, d):
    for word, word_replace in d.items():
        s = s.replace(word, word_replace)
        if s.startswith("_"):
            s = s[len(word_replace):] + s[:len(word_replace)]
    return s

d = {"first": "_first", "second": "_second", "third": "_third"}
df.columns = [fix_col_header(col, d) for col in df.columns]

这将给出以下列名:

id, rating_first, rating_second, rating_third, estimate_first, estimate_second, estimate_third, type_first, type_second, type_third

现在,您可以应用前一个问题中的解决方案(请注意,category 和 value 被互换了)。为了完整起见,我在此处添加了它:

import janitor

(df
.pivot_longer(
    column_names="*_*", 
    names_to = (".value", "category"), 
    names_sep="_")
)
英文:

I think the easiest way would be to align the columns you have with what was used in the previous question, something like:

def fix_col_header(s, d):
    for word, word_replace in d.items():
        s = s.replace(word, word_replace)
        if s.startswith("_"):
            s = s[len(word_replace):] + s[:len(word_replace)]
    return s

d = {"first":"_first", "second":"_second", "third": "_third"}
df.columns = [fix_col_header(col, d) for col in df.columns]

This will give the columns:

id, rating_first, rating_second, rating_third, estimate_first, estimate_second, estimate_third, type_first, type_second, type_third

Now you can apply the solution from the previous question (note that category and value are switched). For completeness I have added it here:

import janitor

(df
.pivot_longer(
    column_names="*_*", 
    names_to = (".value", "category"), 
    names_sep="_")
)

答案2

得分: 1

pivot_longer 支持多个 .value - 您可以利用这一点来重新整理数据框,使用 names_sep 参数:

# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
    index='id',
    names_to = (".value", "category", ".value"),
    names_sep = "(first|second|third)")
)
   id category  rating  estimate    type
0   1    first     1.0       1.2     red
1   1    first     2.0       2.4   green
2   1    first     3.0       2.8    blue
3   1   second     2.8       2.4     red
4   1   second     2.9       3.0   green
5   1   second     2.2       2.4  yellow
6   1    third     3.4       3.4     red
7   1    third     3.8       3.8     red
8   1    third     2.9       2.9    blue

如果您查看分解,您可以看到这里发生了什么:

df.columns[1:].str.split("(first|second|third)")
Index([   ['rating', 'first', ''],   ['rating', 'second', ''],
          ['rating', 'third', ''],  ['', 'first', 'estimate'],
       ['', 'second', 'estimate'],  ['', 'third', 'estimate'],
            ['', 'first', 'type'],     ['', 'second', 'type'],
            ['', 'third', 'type']],
      dtype='object')

请注意,我们有三个条目,其中一个为空字符串。这与我们的 names_to 参数匹配 - > (".value", "category", ".value");一旦 pivot_longer 匹配,然后在最终输出中将 .value 组合成一个,所以对于 ['rating', 'first', ''],它提取出 ('rating', ''),最终合并成一个 - > rating,对于其他列也是如此。

另一个选项是使用 pd.stack

temp = df.set_index('id')
temp.columns = temp.columns.str.split("(first|second|third)", expand = True)
temp.columns.names = [None, 'category', None]
temp = temp.stack('category')
temp.columns = temp.columns.map("".join)
temp.reset_index()

   id category  estimate    type  rating
0   1    first       1.2     red     1.0
1   1   second       2.4     red     2.8
2   1    third       3.4     red     3.4
3   1    first       2.4   green     2.0
4   1   second       3.0   green     2.9
5   1    third       3.8     red     3.8
6   1    first       2.8    blue     3.0
7   1   second       2.4  yellow     2.2
8   1    third       2.9    blue     2.9
英文:

pivot_longer supports multiple .value - you can take advantage of that to reshape your dataframe, using the names_sep parameter:

# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
    index='id',
    names_to = (".value", "category", ".value"),
    names_sep = "(first|second|third)")
)
   id category  rating  estimate    type
0   1    first     1.0       1.2     red
1   1    first     2.0       2.4   green
2   1    first     3.0       2.8    blue
3   1   second     2.8       2.4     red
4   1   second     2.9       3.0   green
5   1   second     2.2       2.4  yellow
6   1    third     3.4       3.4     red
7   1    third     3.8       3.8     red
8   1    third     2.9       2.9    blue

If you look at the breakdown you can see what's going on here :

df.columns[1:].str.split("(first|second|third)")
Index([   ['rating', 'first', ''],   ['rating', 'second', ''],
          ['rating', 'third', ''],  ['', 'first', 'estimate'],
       ['', 'second', 'estimate'],  ['', 'third', 'estimate'],
            ['', 'first', 'type'],     ['', 'second', 'type'],
            ['', 'third', 'type']],
      dtype='object')

Note how we have three entries, and one of them is an empty string. This matches our names_to argument -> (".value", "category", ".value"); once pivot_longer matches, it then in the final output combine the .values into one, so for ['rating', 'first', ''], it pulls out ('rating', ''), and finally lumps them into one -> rating, and on and on for the rest of the columns.

Another option is with pd.stack:

temp = df.set_index('id')
temp.columns = temp.columns.str.split("(first|second|third)", expand = True)
temp.columns.names = [None, 'category', None]
temp = temp.stack('category')
temp.columns = temp.columns.map("".join)
temp.reset_index()

   id category  estimate    type  rating
0   1    first       1.2     red     1.0
1   1   second       2.4     red     2.8
2   1    third       3.4     red     3.4
3   1    first       2.4   green     2.0
4   1   second       3.0   green     2.9
5   1    third       3.8     red     3.8
6   1    first       2.8    blue     3.0
7   1   second       2.4  yellow     2.2
8   1    third       2.9    blue     2.9

huangapple
  • 本文由 发表于 2023年1月9日 16:51:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75054917.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定