在Python中使用另一张表的值创建表格,作为新的列。

huangapple go评论58阅读模式
英文:

Creating a table in Python using values of another table as new columns

问题

I am working in Numpy and Pandas.
我正在使用Numpy和Pandas。

I have a table of loans with the features country and sector.
我有一张贷款表格,其中包含countrysector两个特征。

# LOAN SECTOR COUNTRY
Loan 1 food germany
Loan 2 telecom italy
Loan 3 auto japan
Loan 4 food japan
Loan 5 telecom germany
Loan 6 auto italy

I need to drop the duplicates by the sector and the country, ie. select the unique values of these 2 features, and use them as columns creating a table with boolean 1/0 if the loan is active in that country or sector, as follows:
我需要根据sectorcountry去掉重复项,即选择这两个特征的唯一值,并将它们用作列,创建一个表格,如果贷款在该国家或部门活跃,则为布尔值1/0,如下所示:

# LOAN food telecom auto germany italy japan
Loan 1 1 0 0 1 0 0
Loan 2 0 1 0 0 1 0
Loan 3 0 0 1 0 0 1
Loan 4 1 0 0 0 0 1
Loan 5 0 1 0 1 0 0
Loan 6 0 0 1 0 1 0

So, Loan1 in the first table had food as sector and germany as country; then, in the second table it has 1 on columns food and germany and 0 on all the other columns.
因此,在第一张表中,Loan1的sector是food,country是germany;然后,在第二张表中,它在foodgermany列上为1,在其他列上为0。

It seems a pivot_table but I don't understand how I could put the 1/0 as values?
这看起来像是一个pivot_table,但我不明白如何将1/0作为值放入其中?
Btw, what's the easiest way?
顺便问一下,最简单的方法是什么?

Thanks
谢谢。

英文:

I am working in Numpy and Pandas.

I have a table of loans with the features country and sector.

# LOAN SECTOR COUNTRY
Loan 1 food germany
Loan 2 telecom italy
Loan 3 auto japan
Loan 4 food japan
Loan 5 telecom germany
Loan 6 auto italy

I need to drop the duplicates by the sector and the country, ie. select the unique values of these 2 features, and use them as columns creating a table with boolean 1/0 if the loan is active in that country or sector, as follows:

# LOAN food telecom auto germany italy japan
Loan 1 1 0 0 1 0 0
Loan 2 0 1 0 0 1 0
Loan 3 0 0 1 0 0 1
Loan 4 1 0 0 0 0 1
Loan 5 0 1 0 1 0 0
Loan 6 0 0 1 0 1 0

So, Loan1 in the first table had food as sector and germany as country; then, in the second table it has 1 on columns food and germany and 0 on all the other columns.

It seems a pivot_table but I don't understand how I could put the 1/0 as values?
Btw, what's the easiest way?

Thanks

答案1

得分: 3

你可以使用 get_dummiesgroupby.sum

out = df[['# LOAN']].join(pd.get_dummies(df[['SECTOR', 'COUNTRY']].stack())
                            .groupby(level=0).sum())

注意:如果两列可能具有相同的值,可以使用 .groupby(level=0).max().astype(int)

输出结果:

   # LOAN  auto  food  germany  italy  japan  telecom
0  Loan 1     0     1        1      0      0        0
1  Loan 2     0     0        0      1      0        1
2  Loan 3     1     0        0      0      1        0
3  Loan 4     0     1        0      0      1        0
4  Loan 5     0     0        1      0      0        1
5  Loan 6     1     0        0      1      0        0

其他替代方法:

使用 str.get_dummies

out = df[['# LOAN']].join(df[['SECTOR', 'COUNTRY']]
                          .agg('|'.join, axis=1)
                          .str.get_dummies()
                          )

或者使用 crosstab

cols = ['SECTOR', 'COUNTRY']

out = (pd.concat(pd.crosstab(df['# LOAN'], df[c]) for c in cols)
         .groupby(level=0).sum().reset_index()
       )
英文:

You can use get_dummies and groupby.sum:

out = df[['# LOAN']].join(pd.get_dummies(df[['SECTOR', 'COUNTRY']].stack())
                            .groupby(level=0).sum())

NB. use .groupby(level=0).max().astype(int) if there is a chance that both columns can have the same value.

Output:

   # LOAN  auto  food  germany  italy  japan  telecom
0  Loan 1     0     1        1      0      0        0
1  Loan 2     0     0        0      1      0        1
2  Loan 3     1     0        0      0      1        0
3  Loan 4     0     1        0      0      1        0
4  Loan 5     0     0        1      0      0        1
5  Loan 6     1     0        0      1      0        0

Alternatives:

With str.get_dummies

out = df[['# LOAN']].join(df[['SECTOR', 'COUNTRY']]
                          .agg('|'.join, axis=1)
                          .str.get_dummies()
                          )

Or with crosstab

cols = ['SECTOR', 'COUNTRY']

out = (pd.concat(pd.crosstab(df['# LOAN'], df[c]) for c in cols)
         .groupby(level=0).sum().reset_index()
       )

答案2

得分: 3

使用sklearn时,您可以使用OneHotEncoderColumnTransformer

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(
    [('OHE', OneHotEncoder(dtype=int), ['SECTOR', 'COUNTRY'])],
    remainder='passthrough', verbose_feature_names_out=False
)
out = pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())

输出:

>>> out
  SECTOR_auto SECTOR_food SECTOR_telecom COUNTRY_germany COUNTRY_italy COUNTRY_japan  # LOAN
0           0           1              0               1             0             0  Loan 1
1           0           0              1               0             1             0  Loan 2
2           1           0              0               0             0             1  Loan 3
3           0           1              0               0             0             1  Loan 4
4           0           0              1               1             0             0  Loan 5
5           1           0              0               0             1             0  Loan 6
英文:

In cas you use sklearn, you can use OneHotEncoder and ColumnTransformer:

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(
    [('OHE', OneHotEncoder(dtype=int), ['SECTOR', 'COUNTRY'])],
    remainder='passthrough', verbose_feature_names_out=False
)
out = pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())

Output:

>>> out
  SECTOR_auto SECTOR_food SECTOR_telecom COUNTRY_germany COUNTRY_italy COUNTRY_japan  # LOAN
0           0           1              0               1             0             0  Loan 1
1           0           0              1               0             1             0  Loan 2
2           1           0              0               0             0             1  Loan 3
3           0           1              0               0             0             1  Loan 4
4           0           0              1               1             0             0  Loan 5
5           1           0              0               0             1             0  Loan 6

答案3

得分: 1

你所寻找的是一热编码。关于如何从pd.DataFrame()进行一热编码,有一个很好的帖子在这个链接中:https://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python

Cybernetic的回答非常详细

编辑:这个帖子上mozway的回答完全正确 —— get_dummies 是用于一热编码的pandas函数,我相信他们在链接的帖子中使用了相同的函数。

英文:

What you're looking for is one-hot encoding. There's a great thread on how to get that from a pd.DataFrame() on this thread: https://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python

Cybernetic's answer was pretty thorough

Edit: mozway's answer on this thread is exactly right -- the get dummy's is the pandas fxn for one-hot encoding, I believe they used the same one in the linked thread

huangapple
  • 本文由 发表于 2023年6月6日 02:48:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76409213.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定