2023年6月6日 02:48:12go评论77阅读模式

英文:

Creating a table in Python using values of another table as new columns

问题

I am working in Numpy and Pandas.
我正在使用Numpy和Pandas。

I have a table of loans with the features country and sector.
我有一张贷款表格，其中包含country和sector两个特征。

# LOAN	SECTOR	COUNTRY
Loan 1	food	germany
Loan 2	telecom	italy
Loan 3	auto	japan
Loan 4	food	japan
Loan 5	telecom	germany
Loan 6	auto	italy

I need to drop the duplicates by the sector and the country, ie. select the unique values of these 2 features, and use them as columns creating a table with boolean 1/0 if the loan is active in that country or sector, as follows:
我需要根据sector和country去掉重复项，即选择这两个特征的唯一值，并将它们用作列，创建一个表格，如果贷款在该国家或部门活跃，则为布尔值1/0，如下所示：

# LOAN	food	telecom	auto	germany	italy	japan
Loan 1	1	0	0	1	0	0
Loan 2	0	1	0	0	1	0
Loan 3	0	0	1	0	0	1
Loan 4	1	0	0	0	0	1
Loan 5	0	1	0	1	0	0
Loan 6	0	0	1	0	1	0

So, Loan1 in the first table had food as sector and germany as country; then, in the second table it has 1 on columns food and germany and 0 on all the other columns.
因此，在第一张表中，Loan1的sector是food，country是germany；然后，在第二张表中，它在food和germany列上为1，在其他列上为0。

It seems a pivot_table but I don't understand how I could put the 1/0 as values?
这看起来像是一个pivot_table，但我不明白如何将1/0作为值放入其中？
Btw, what's the easiest way?
顺便问一下，最简单的方法是什么？

Thanks
谢谢。

英文:

I am working in Numpy and Pandas.

I have a table of loans with the features country and sector.

# LOAN	SECTOR	COUNTRY
Loan 1	food	germany
Loan 2	telecom	italy
Loan 3	auto	japan
Loan 4	food	japan
Loan 5	telecom	germany
Loan 6	auto	italy

# LOAN	food	telecom	auto	germany	italy	japan
Loan 1	1	0	0	1	0	0
Loan 2	0	1	0	0	1	0
Loan 3	0	0	1	0	0	1
Loan 4	1	0	0	0	0	1
Loan 5	0	1	0	1	0	0
Loan 6	0	0	1	0	1	0

So, Loan1 in the first table had food as sector and germany as country; then, in the second table it has 1 on columns food and germany and 0 on all the other columns.

It seems a pivot_table but I don't understand how I could put the 1/0 as values?
Btw, what's the easiest way?

Thanks

答案1

得分: 3

你可以使用 get_dummies 和 groupby.sum：

out = df[['# LOAN']].join(pd.get_dummies(df[['SECTOR', 'COUNTRY']].stack())
                            .groupby(level=0).sum())

注意：如果两列可能具有相同的值，可以使用 .groupby(level=0).max().astype(int)。

输出结果：

   # LOAN  auto  food  germany  italy  japan  telecom
0  Loan 1     0     1        1      0      0        0
1  Loan 2     0     0        0      1      0        1
2  Loan 3     1     0        0      0      1        0
3  Loan 4     0     1        0      0      1        0
4  Loan 5     0     0        1      0      0        1
5  Loan 6     1     0        0      1      0        0

其他替代方法：

使用 str.get_dummies：

out = df[['# LOAN']].join(df[['SECTOR', 'COUNTRY']]
                          .agg('|'.join, axis=1)
                          .str.get_dummies()
                          )

或者使用 crosstab：

cols = ['SECTOR', 'COUNTRY']

out = (pd.concat(pd.crosstab(df['# LOAN'], df[c]) for c in cols)
         .groupby(level=0).sum().reset_index()
       )

英文:

You can use get_dummies and groupby.sum:

out = df[[&#39;# LOAN&#39;]].join(pd.get_dummies(df[[&#39;SECTOR&#39;, &#39;COUNTRY&#39;]].stack())
                            .groupby(level=0).sum())

NB. use .groupby(level=0).max().astype(int) if there is a chance that both columns can have the same value.

Output:

   # LOAN  auto  food  germany  italy  japan  telecom
0  Loan 1     0     1        1      0      0        0
1  Loan 2     0     0        0      1      0        1
2  Loan 3     1     0        0      0      1        0
3  Loan 4     0     1        0      0      1        0
4  Loan 5     0     0        1      0      0        1
5  Loan 6     1     0        0      1      0        0

Alternatives:

With str.get_dummies

out = df[[&#39;# LOAN&#39;]].join(df[[&#39;SECTOR&#39;, &#39;COUNTRY&#39;]]
                          .agg(&#39;|&#39;.join, axis=1)
                          .str.get_dummies()
                          )

Or with crosstab

cols = [&#39;SECTOR&#39;, &#39;COUNTRY&#39;]

out = (pd.concat(pd.crosstab(df[&#39;# LOAN&#39;], df[c]) for c in cols)
         .groupby(level=0).sum().reset_index()
       )

答案2

得分: 3

使用sklearn时，您可以使用OneHotEncoder和ColumnTransformer：

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(
    [('OHE', OneHotEncoder(dtype=int), ['SECTOR', 'COUNTRY'])],
    remainder='passthrough', verbose_feature_names_out=False
)
out = pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())

输出:

>>> out
  SECTOR_auto SECTOR_food SECTOR_telecom COUNTRY_germany COUNTRY_italy COUNTRY_japan  # LOAN
0           0           1              0               1             0             0  Loan 1
1           0           0              1               0             1             0  Loan 2
2           1           0              0               0             0             1  Loan 3
3           0           1              0               0             0             1  Loan 4
4           0           0              1               1             0             0  Loan 5
5           1           0              0               0             1             0  Loan 6

英文:

In cas you use sklearn, you can use OneHotEncoder and ColumnTransformer:

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(
    [(&#39;OHE&#39;, OneHotEncoder(dtype=int), [&#39;SECTOR&#39;, &#39;COUNTRY&#39;])],
    remainder=&#39;passthrough&#39;, verbose_feature_names_out=False
)
out = pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())

Output:

&gt;&gt;&gt; out
  SECTOR_auto SECTOR_food SECTOR_telecom COUNTRY_germany COUNTRY_italy COUNTRY_japan  # LOAN
0           0           1              0               1             0             0  Loan 1
1           0           0              1               0             1             0  Loan 2
2           1           0              0               0             0             1  Loan 3
3           0           1              0               0             0             1  Loan 4
4           0           0              1               1             0             0  Loan 5
5           1           0              0               0             1             0  Loan 6

答案3

得分: 1

你所寻找的是一热编码。关于如何从pd.DataFrame()进行一热编码，有一个很好的帖子在这个链接中：https://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python

Cybernetic的回答非常详细

编辑：这个帖子上mozway的回答完全正确 —— get_dummies 是用于一热编码的pandas函数，我相信他们在链接的帖子中使用了相同的函数。

英文:

What you're looking for is one-hot encoding. There's a great thread on how to get that from a pd.DataFrame() on this thread: https://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python

Cybernetic's answer was pretty thorough

Edit: mozway's answer on this thread is exactly right -- the get dummy's is the pandas fxn for one-hot encoding, I believe they used the same one in the linked thread

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Python中使用另一张表的值创建表格，作为新的列。

问题

答案1

答案2

答案3

如何存储人脸的多个特征并计算距离？

在Pandas中计算两个日期之间的有效天数。

Download a .csv file using requests.get() in Python

创建、保存和加载空间索引使用GeoPandas

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论