英文:
Creating a table in Python using values of another table as new columns
问题
I am working in Numpy and Pandas.
我正在使用Numpy和Pandas。
I have a table of loans with the features country and sector.
我有一张贷款表格,其中包含country和sector两个特征。
# LOAN | SECTOR | COUNTRY | |
---|---|---|---|
Loan 1 | food | germany | |
Loan 2 | telecom | italy | |
Loan 3 | auto | japan | |
Loan 4 | food | japan | |
Loan 5 | telecom | germany | |
Loan 6 | auto | italy |
I need to drop the duplicates by the sector and the country, ie. select the unique values of these 2 features, and use them as columns creating a table with boolean 1/0 if the loan is active in that country or sector, as follows:
我需要根据sector和country去掉重复项,即选择这两个特征的唯一值,并将它们用作列,创建一个表格,如果贷款在该国家或部门活跃,则为布尔值1/0,如下所示:
# LOAN | food | telecom | auto | germany | italy | japan |
---|---|---|---|---|---|---|
Loan 1 | 1 | 0 | 0 | 1 | 0 | 0 |
Loan 2 | 0 | 1 | 0 | 0 | 1 | 0 |
Loan 3 | 0 | 0 | 1 | 0 | 0 | 1 |
Loan 4 | 1 | 0 | 0 | 0 | 0 | 1 |
Loan 5 | 0 | 1 | 0 | 1 | 0 | 0 |
Loan 6 | 0 | 0 | 1 | 0 | 1 | 0 |
So, Loan1 in the first table had food as sector and germany as country; then, in the second table it has 1 on columns food and germany and 0 on all the other columns.
因此,在第一张表中,Loan1的sector是food,country是germany;然后,在第二张表中,它在food和germany列上为1,在其他列上为0。
It seems a pivot_table but I don't understand how I could put the 1/0 as values?
这看起来像是一个pivot_table,但我不明白如何将1/0作为值放入其中?
Btw, what's the easiest way?
顺便问一下,最简单的方法是什么?
Thanks
谢谢。
英文:
I am working in Numpy and Pandas.
I have a table of loans with the features country and sector.
# LOAN | SECTOR | COUNTRY | |
---|---|---|---|
Loan 1 | food | germany | |
Loan 2 | telecom | italy | |
Loan 3 | auto | japan | |
Loan 4 | food | japan | |
Loan 5 | telecom | germany | |
Loan 6 | auto | italy |
I need to drop the duplicates by the sector and the country, ie. select the unique values of these 2 features, and use them as columns creating a table with boolean 1/0 if the loan is active in that country or sector, as follows:
# LOAN | food | telecom | auto | germany | italy | japan |
---|---|---|---|---|---|---|
Loan 1 | 1 | 0 | 0 | 1 | 0 | 0 |
Loan 2 | 0 | 1 | 0 | 0 | 1 | 0 |
Loan 3 | 0 | 0 | 1 | 0 | 0 | 1 |
Loan 4 | 1 | 0 | 0 | 0 | 0 | 1 |
Loan 5 | 0 | 1 | 0 | 1 | 0 | 0 |
Loan 6 | 0 | 0 | 1 | 0 | 1 | 0 |
So, Loan1 in the first table had food as sector and germany as country; then, in the second table it has 1 on columns food and germany and 0 on all the other columns.
It seems a pivot_table but I don't understand how I could put the 1/0 as values?
Btw, what's the easiest way?
Thanks
答案1
得分: 3
你可以使用 get_dummies
和 groupby.sum
:
out = df[['# LOAN']].join(pd.get_dummies(df[['SECTOR', 'COUNTRY']].stack())
.groupby(level=0).sum())
注意:如果两列可能具有相同的值,可以使用 .groupby(level=0).max().astype(int)
。
输出结果:
# LOAN auto food germany italy japan telecom
0 Loan 1 0 1 1 0 0 0
1 Loan 2 0 0 0 1 0 1
2 Loan 3 1 0 0 0 1 0
3 Loan 4 0 1 0 0 1 0
4 Loan 5 0 0 1 0 0 1
5 Loan 6 1 0 0 1 0 0
其他替代方法:
使用 str.get_dummies
:
out = df[['# LOAN']].join(df[['SECTOR', 'COUNTRY']]
.agg('|'.join, axis=1)
.str.get_dummies()
)
或者使用 crosstab
:
cols = ['SECTOR', 'COUNTRY']
out = (pd.concat(pd.crosstab(df['# LOAN'], df[c]) for c in cols)
.groupby(level=0).sum().reset_index()
)
英文:
You can use get_dummies
and groupby.sum
:
out = df[['# LOAN']].join(pd.get_dummies(df[['SECTOR', 'COUNTRY']].stack())
.groupby(level=0).sum())
NB. use .groupby(level=0).max().astype(int)
if there is a chance that both columns can have the same value.
Output:
# LOAN auto food germany italy japan telecom
0 Loan 1 0 1 1 0 0 0
1 Loan 2 0 0 0 1 0 1
2 Loan 3 1 0 0 0 1 0
3 Loan 4 0 1 0 0 1 0
4 Loan 5 0 0 1 0 0 1
5 Loan 6 1 0 0 1 0 0
Alternatives:
With str.get_dummies
out = df[['# LOAN']].join(df[['SECTOR', 'COUNTRY']]
.agg('|'.join, axis=1)
.str.get_dummies()
)
Or with crosstab
cols = ['SECTOR', 'COUNTRY']
out = (pd.concat(pd.crosstab(df['# LOAN'], df[c]) for c in cols)
.groupby(level=0).sum().reset_index()
)
答案2
得分: 3
使用sklearn
时,您可以使用OneHotEncoder
和ColumnTransformer
:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
[('OHE', OneHotEncoder(dtype=int), ['SECTOR', 'COUNTRY'])],
remainder='passthrough', verbose_feature_names_out=False
)
out = pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())
输出:
>>> out
SECTOR_auto SECTOR_food SECTOR_telecom COUNTRY_germany COUNTRY_italy COUNTRY_japan # LOAN
0 0 1 0 1 0 0 Loan 1
1 0 0 1 0 1 0 Loan 2
2 1 0 0 0 0 1 Loan 3
3 0 1 0 0 0 1 Loan 4
4 0 0 1 1 0 0 Loan 5
5 1 0 0 0 1 0 Loan 6
英文:
In cas you use sklearn
, you can use OneHotEncoder
and ColumnTransformer
:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
[('OHE', OneHotEncoder(dtype=int), ['SECTOR', 'COUNTRY'])],
remainder='passthrough', verbose_feature_names_out=False
)
out = pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())
Output:
>>> out
SECTOR_auto SECTOR_food SECTOR_telecom COUNTRY_germany COUNTRY_italy COUNTRY_japan # LOAN
0 0 1 0 1 0 0 Loan 1
1 0 0 1 0 1 0 Loan 2
2 1 0 0 0 0 1 Loan 3
3 0 1 0 0 0 1 Loan 4
4 0 0 1 1 0 0 Loan 5
5 1 0 0 0 1 0 Loan 6
答案3
得分: 1
你所寻找的是一热编码。关于如何从pd.DataFrame()进行一热编码,有一个很好的帖子在这个链接中:https://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python
Cybernetic的回答非常详细
编辑:这个帖子上mozway的回答完全正确 —— get_dummies 是用于一热编码的pandas函数,我相信他们在链接的帖子中使用了相同的函数。
英文:
What you're looking for is one-hot encoding. There's a great thread on how to get that from a pd.DataFrame() on this thread: https://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python
Cybernetic's answer was pretty thorough
Edit: mozway's answer on this thread is exactly right -- the get dummy's is the pandas fxn for one-hot encoding, I believe they used the same one in the linked thread
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论