2023年2月19日 05:37:58go评论90阅读模式

英文:

Why do these two commands yield different results?

问题

我有一个形状为(2250,2)的numpy数组 `x_pcaed`，以及一个具有 2250 行的 pandas 数据框 `seeds_train`，其中有一列 `y`，其值为 0 或 1。
我按照如下方式将 `x_pcaed` 转换为数据框：

pca_df = pd.DataFrame(data=x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])


然后我按如下方式创建了一个新的数据框：

test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)


我困惑的是以下两个命令产生不同的输出：
1）

test1.loc[test1['y'] == 0, 'pr_comp_1'].values


2）

x_pcaed[seeds_train.y == 0, 0]


我的理解是它们应该给出相同的结果，所以我一定是漏掉了什么。

英文:

I have a numpy array

x_pcaed

of shape (2250,2) and a pandas dataframe

seeds_train

with 2250 rows, which has a column

whose values are 0 or 1.
I made x_pcaed into a dataframe as follows:

pca_df = pd.DataFrame(data = x_pcaed, columns=[&#39;pr_comp_1&#39;, &#39;pr_comp_2&#39;])

I then created a new dataframe as follows:

test1 = pd.concat([pca_df, seeds_train[[&#39;y&#39;]]], axis=1)

I'm confused why the following two commands yield different outputs:

test1.loc[test1[&#39;y&#39;]==0,&#39;pr_comp_1&#39;].values

x_pcaed[seeds_train.y==0, 0]

My understanding is that they should give the same result, so I must be missing something.

Update: Here's the full code. It uses Pumpkin Seeds Dataset from here.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
seeds = pd.read_excel(&#39;Pumpkin_Seeds_Dataset.xlsx&#39;)
seeds[&#39;y&#39;] = 0
seeds.loc[seeds.Class==&#39;&#220;rg&#252;p Sivrisi&#39;, &#39;y&#39;]=1
seeds_train, seeds_test = train_test_split(seeds.copy(),
                                              shuffle=True,
                                              random_state=123,
                                              test_size=.1,
                                              stratify=seeds.y.values)
features = seeds_train.columns[:-2]
x = seeds_train.loc[:,features].values
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
pca = PCA(n_components = 2)
x_pcaed = pca.fit_transform(x_scaled)
pca_df = pd.DataFrame(data = x_pcaed, columns=[&#39;pr_comp_1&#39;, &#39;pr_comp_2&#39;])
test1 = pd.concat([pca_df, seeds_train[[&#39;y&#39;]]], axis=1)
print(test1.loc[test1[&#39;y&#39;]==0,&#39;pr_comp_1&#39;].values)
print(&#39;----------&#39;)
print(x_pcaed[seeds_train.y==0, 0])

Output:

[-0.79984874 -2.75176272 -0.26329661 ... -2.03461928 -2.38149466
 -1.46663563]
----------
[-1.36392527 -0.26329661 -4.91873745 ... -1.46508442 -1.07096868
 -4.79462993]

答案1

得分: 1

更新

当您创建pca_df时忘记了index，因此在使用沿着索引轴使用pd.concat时，pca_df和seed_train之间的索引不对齐。使用：

pca_df = pd.DataFrame(data=x_pcaed,
                      columns=['pr_comp_1', 'pr_comp_2'], 
                      index=seeds_train.index)
test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)
r1 = test1.loc[test1['y']==0, 'pr_comp_1'].values
r2 = x_pcaed[seeds_train.y==0, 0]
print(r1)
print(r2)
# 输出
array([-1.36392527, -0.26329661, -4.91873745, ..., -1.46508442,
       -1.07096868, -4.79462993])
array([-1.36392527, -0.26329661, -4.91873745, ..., -1.46508442,
       -1.07096868, -4.79462993])

您的代码没有问题：

rng =  np.random.default_rng(seed=2023)
x_pcaed = rng.random((2250, 2))
seeds_train = pd.DataFrame({'y': rng.choice([0, 1], 2250)})
pca_df = pd.DataFrame(data=x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])
test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)
r1 = test1.loc[test1['y']==0, 'pr_comp_1'].values
r2 = x_pcaed[seeds_train.y==0, 0]

测试：

>>> np.all(r1 == r2)
True

注意：在创建pca_df之后，如果您修改x_pcaed，您的数据框架也会被修改。您的NumPy数组未复制到数据框架，而是链接到了它。但对于test数据框架是不同的，因为pd.concat返回一个副本，因此数据与其引用分离。

英文:

Update

You forgot the index when you create pca_df so the indexes are not aligned between pca_df and seed_train when you use pd.concat along index axis. Use:

pca_df = pd.DataFrame(data=x_pcaed,
                      columns=[&#39;pr_comp_1&#39;, &#39;pr_comp_2&#39;], 
                      index=seeds_train.index)
test1 = pd.concat([pca_df, seeds_train[[&#39;y&#39;]]], axis=1)
r1 = test1.loc[test1[&#39;y&#39;]==0, &#39;pr_comp_1&#39;].values
r2 = x_pcaed[seeds_train.y==0, 0]
print(r1)
print(r2)
# Output
array([-1.36392527, -0.26329661, -4.91873745, ..., -1.46508442,
       -1.07096868, -4.79462993])
array([-1.36392527, -0.26329661, -4.91873745, ..., -1.46508442,
       -1.07096868, -4.79462993])

There is no problem with your code:

rng =  np.random.default_rng(seed=2023)
x_pcaed = rng.random((2250, 2))
seeds_train = pd.DataFrame({&#39;y&#39;: rng.choice([0, 1], 2250)})
pca_df = pd.DataFrame(data=x_pcaed, columns=[&#39;pr_comp_1&#39;, &#39;pr_comp_2&#39;])
test1 = pd.concat([pca_df, seeds_train[[&#39;y&#39;]]], axis=1)
r1 = test1.loc[test1[&#39;y&#39;]==0, &#39;pr_comp_1&#39;].values
r2 = x_pcaed[seeds_train.y==0, 0]

Test:

&gt;&gt;&gt; np.all(r1 == r2)
True

Note: take care after creating pca_df, if you modify x_pcaed, your dataframe will be modified too. Your numpy array is not copied to the dataframe but linked. It's different for test dataframe because pd.concat returns a copy so the data are detached from its reference.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

这两个命令为什么产生不同的结果？

问题

答案1

Python/Pandas. For loop on multiple dataFrames not working correctly.

对一个包含整数的二维数组按列进行排序。

How to append a character to a string in Golang?

根据属性值对对象进行排序的问题

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。