英文:
using regex in pivot_longer to unpivot multiple sets of columns with common grouping variable
问题
Follow-up from my last question:
https://stackoverflow.com/questions/74957441/pyjanitor-pivot-longer-multiple-sets-of-columns-with-common-grouping-variable-an/74957533#74957533
In my last question, the dataset I gave was oversimplified for the problem I was having. I have changed the column names to represent the ones in my dataset, as I couldn't figure out how to fix them myself using regex in pivot_longer
. In the model dataset I gave, columns were written with the following pattern: number_word
, but in my dataset the columns are in any order and never separated by underscores (e.g., wordnumber).
Note that the number needs to be the same grouping variable for each column set. So there should be a rating, estimate, and type for each number.
The dataset
df = pd.DataFrame({
'id': [1, 1, 1],
'ratingfirst': [1, 2, 3],
'ratingsecond': [2.8, 2.9, 2.2],
'ratingthird': [3.4, 3.8, 2.9],
'firstestimate': [1.2, 2.4, 2.8],
'secondestimate': [2.4, 3, 2.4],
'thirdestimate':[3.4, 3.8, 2.9],
'firsttype': ['red', 'green', 'blue'],
'secondtype': ['red', 'green', 'yellow'],
'thirdtype': ['red', 'red', 'blue'],
})
Desired output
The header of my desired output is the following:
id | category | rating | estimate | type |
---|---|---|---|---|
1 | first | 1.0 | 1.2 | 'red' |
英文:
Follow-up from my last question:
https://stackoverflow.com/questions/74957441/pyjanitor-pivot-longer-multiple-sets-of-columns-with-common-grouping-variable-an/74957533#74957533
In my last question, the dataset I gave was oversimplified for the problem I was having. I have changed the column names to represent the ones in my dataset, as I couldn't figure out how to fix them myself using regex in pivot_longer
. In the model dataset I gave, columns were written with the following pattern: number_word
, but in my dataset the columns are in any order and never separated by underscores (e.g., wordnumber).
Note that the number needs to be the same grouping variable for each column set. So there should be a rating, estimate, and type for each number.
The dataset
df = pd.DataFrame({
'id': [1, 1, 1],
'ratingfirst': [1, 2, 3],
'ratingsecond': [2.8, 2.9, 2.2],
'ratingthird': [3.4, 3.8, 2.9],
'firstestimate': [1.2, 2.4, 2.8],
'secondestimate': [2.4, 3, 2.4],
'thirdestimate':[3.4, 3.8, 2.9],
'firsttype': ['red', 'green', 'blue'],
'secondtype': ['red', 'green', 'yellow'],
'thirdtype': ['red', 'red', 'blue'],
})
Desired output
The header of my desired output is the following:
id | category | rating | estimate | type |
---|---|---|---|---|
1 | first | 1.0 | 1.2 | 'red' |
答案1
得分: 1
以下是翻译好的部分:
我认为最简单的方法是将您拥有的列与先前问题中使用的列对齐,类似于:
def fix_col_header(s, d):
for word, word_replace in d.items():
s = s.replace(word, word_replace)
if s.startswith("_"):
s = s[len(word_replace):] + s[:len(word_replace)]
return s
d = {"first": "_first", "second": "_second", "third": "_third"}
df.columns = [fix_col_header(col, d) for col in df.columns]
这将给出以下列名:
id, rating_first, rating_second, rating_third, estimate_first, estimate_second, estimate_third, type_first, type_second, type_third
现在,您可以应用前一个问题中的解决方案(请注意,category 和 value 被互换了)。为了完整起见,我在此处添加了它:
import janitor
(df
.pivot_longer(
column_names="*_*",
names_to = (".value", "category"),
names_sep="_")
)
英文:
I think the easiest way would be to align the columns you have with what was used in the previous question, something like:
def fix_col_header(s, d):
for word, word_replace in d.items():
s = s.replace(word, word_replace)
if s.startswith("_"):
s = s[len(word_replace):] + s[:len(word_replace)]
return s
d = {"first":"_first", "second":"_second", "third": "_third"}
df.columns = [fix_col_header(col, d) for col in df.columns]
This will give the columns:
id, rating_first, rating_second, rating_third, estimate_first, estimate_second, estimate_third, type_first, type_second, type_third
Now you can apply the solution from the previous question (note that category and value are switched). For completeness I have added it here:
import janitor
(df
.pivot_longer(
column_names="*_*",
names_to = (".value", "category"),
names_sep="_")
)
答案2
得分: 1
pivot_longer
支持多个 .value
- 您可以利用这一点来重新整理数据框,使用 names_sep
参数:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index='id',
names_to = (".value", "category", ".value"),
names_sep = "(first|second|third)")
)
id category rating estimate type
0 1 first 1.0 1.2 red
1 1 first 2.0 2.4 green
2 1 first 3.0 2.8 blue
3 1 second 2.8 2.4 red
4 1 second 2.9 3.0 green
5 1 second 2.2 2.4 yellow
6 1 third 3.4 3.4 red
7 1 third 3.8 3.8 red
8 1 third 2.9 2.9 blue
如果您查看分解,您可以看到这里发生了什么:
df.columns[1:].str.split("(first|second|third)")
Index([ ['rating', 'first', ''], ['rating', 'second', ''],
['rating', 'third', ''], ['', 'first', 'estimate'],
['', 'second', 'estimate'], ['', 'third', 'estimate'],
['', 'first', 'type'], ['', 'second', 'type'],
['', 'third', 'type']],
dtype='object')
请注意,我们有三个条目,其中一个为空字符串。这与我们的 names_to
参数匹配 - > (".value", "category", ".value")
;一旦 pivot_longer
匹配,然后在最终输出中将 .value
组合成一个,所以对于 ['rating', 'first', '']
,它提取出 ('rating', '')
,最终合并成一个 - > rating
,对于其他列也是如此。
另一个选项是使用 pd.stack
:
temp = df.set_index('id')
temp.columns = temp.columns.str.split("(first|second|third)", expand = True)
temp.columns.names = [None, 'category', None]
temp = temp.stack('category')
temp.columns = temp.columns.map("".join)
temp.reset_index()
id category estimate type rating
0 1 first 1.2 red 1.0
1 1 second 2.4 red 2.8
2 1 third 3.4 red 3.4
3 1 first 2.4 green 2.0
4 1 second 3.0 green 2.9
5 1 third 3.8 red 3.8
6 1 first 2.8 blue 3.0
7 1 second 2.4 yellow 2.2
8 1 third 2.9 blue 2.9
英文:
pivot_longer
supports multiple .value
- you can take advantage of that to reshape your dataframe, using the names_sep
parameter:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index='id',
names_to = (".value", "category", ".value"),
names_sep = "(first|second|third)")
)
id category rating estimate type
0 1 first 1.0 1.2 red
1 1 first 2.0 2.4 green
2 1 first 3.0 2.8 blue
3 1 second 2.8 2.4 red
4 1 second 2.9 3.0 green
5 1 second 2.2 2.4 yellow
6 1 third 3.4 3.4 red
7 1 third 3.8 3.8 red
8 1 third 2.9 2.9 blue
If you look at the breakdown you can see what's going on here :
df.columns[1:].str.split("(first|second|third)")
Index([ ['rating', 'first', ''], ['rating', 'second', ''],
['rating', 'third', ''], ['', 'first', 'estimate'],
['', 'second', 'estimate'], ['', 'third', 'estimate'],
['', 'first', 'type'], ['', 'second', 'type'],
['', 'third', 'type']],
dtype='object')
Note how we have three entries, and one of them is an empty string. This matches our names_to
argument -> (".value", "category", ".value")
; once pivot_longer
matches, it then in the final output combine the .value
s into one, so for ['rating', 'first', '']
, it pulls out ('rating', '')
, and finally lumps them into one -> rating
, and on and on for the rest of the columns.
Another option is with pd.stack
:
temp = df.set_index('id')
temp.columns = temp.columns.str.split("(first|second|third)", expand = True)
temp.columns.names = [None, 'category', None]
temp = temp.stack('category')
temp.columns = temp.columns.map("".join)
temp.reset_index()
id category estimate type rating
0 1 first 1.2 red 1.0
1 1 second 2.4 red 2.8
2 1 third 3.4 red 3.4
3 1 first 2.4 green 2.0
4 1 second 3.0 green 2.9
5 1 third 3.8 red 3.8
6 1 first 2.8 blue 3.0
7 1 second 2.4 yellow 2.2
8 1 third 2.9 blue 2.9
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论