这两个命令为什么产生不同的结果?

huangapple go评论64阅读模式
英文:

Why do these two commands yield different results?

问题

我有一个形状为(2250,2)的numpy数组 `x_pcaed`,以及一个具有 2250 行的 pandas 数据框 `seeds_train`,其中有一列 `y`,其值为 0 或 1。
我按照如下方式将 `x_pcaed` 转换为数据框:

pca_df = pd.DataFrame(data=x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])


然后我按如下方式创建了一个新的数据框:

test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)


我困惑的是以下两个命令产生不同的输出:

1)

test1.loc[test1['y'] == 0, 'pr_comp_1'].values


2)

x_pcaed[seeds_train.y == 0, 0]


我的理解是它们应该给出相同的结果,所以我一定是漏掉了什么。
英文:

I have a numpy array

x_pcaed

of shape (2250,2) and a pandas dataframe

seeds_train

with 2250 rows, which has a column

y

whose values are 0 or 1.
I made x_pcaed into a dataframe as follows:

pca_df = pd.DataFrame(data = x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])

I then created a new dataframe as follows:

test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)

I'm confused why the following two commands yield different outputs:

test1.loc[test1['y']==0,'pr_comp_1'].values
x_pcaed[seeds_train.y==0, 0]

My understanding is that they should give the same result, so I must be missing something.


Update: Here's the full code. It uses Pumpkin Seeds Dataset from here.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

seeds = pd.read_excel('Pumpkin_Seeds_Dataset.xlsx')
seeds['y'] = 0
seeds.loc[seeds.Class=='Ürgüp Sivrisi', 'y']=1

seeds_train, seeds_test = train_test_split(seeds.copy(),
                                              shuffle=True,
                                              random_state=123,
                                              test_size=.1,
                                              stratify=seeds.y.values)

features = seeds_train.columns[:-2]
x = seeds_train.loc[:,features].values
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
pca = PCA(n_components = 2)
x_pcaed = pca.fit_transform(x_scaled)
pca_df = pd.DataFrame(data = x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])
test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)

print(test1.loc[test1['y']==0,'pr_comp_1'].values)
print('----------')
print(x_pcaed[seeds_train.y==0, 0])

Output:

[-0.79984874 -2.75176272 -0.26329661 ... -2.03461928 -2.38149466
 -1.46663563]
----------
[-1.36392527 -0.26329661 -4.91873745 ... -1.46508442 -1.07096868
 -4.79462993]

答案1

得分: 1

更新

当您创建pca_df时忘记了index,因此在使用沿着索引轴使用pd.concat时,pca_dfseed_train之间的索引不对齐。使用:

pca_df = pd.DataFrame(data=x_pcaed,
                      columns=['pr_comp_1', 'pr_comp_2'], 
                      index=seeds_train.index)

test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)

r1 = test1.loc[test1['y']==0, 'pr_comp_1'].values
r2 = x_pcaed[seeds_train.y==0, 0]
print(r1)
print(r2)

# 输出
array([-1.36392527, -0.26329661, -4.91873745, ..., -1.46508442,
       -1.07096868, -4.79462993])
array([-1.36392527, -0.26329661, -4.91873745, ..., -1.46508442,
       -1.07096868, -4.79462993])

您的代码没有问题:

rng =  np.random.default_rng(seed=2023)

x_pcaed = rng.random((2250, 2))
seeds_train = pd.DataFrame({'y': rng.choice([0, 1], 2250)})

pca_df = pd.DataFrame(data=x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])
test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)

r1 = test1.loc[test1['y']==0, 'pr_comp_1'].values
r2 = x_pcaed[seeds_train.y==0, 0]

测试:

>>> np.all(r1 == r2)
True

注意:在创建pca_df之后,如果您修改x_pcaed,您的数据框架也会被修改。您的NumPy数组未复制到数据框架,而是链接到了它。但对于test数据框架是不同的,因为pd.concat返回一个副本,因此数据与其引用分离。

英文:

Update

You forgot the index when you create pca_df so the indexes are not aligned between pca_df and seed_train when you use pd.concat along index axis. Use:

pca_df = pd.DataFrame(data=x_pcaed,
                      columns=['pr_comp_1', 'pr_comp_2'], 
                      index=seeds_train.index)

test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)

r1 = test1.loc[test1['y']==0, 'pr_comp_1'].values
r2 = x_pcaed[seeds_train.y==0, 0]
print(r1)
print(r2)

# Output
array([-1.36392527, -0.26329661, -4.91873745, ..., -1.46508442,
       -1.07096868, -4.79462993])
array([-1.36392527, -0.26329661, -4.91873745, ..., -1.46508442,
       -1.07096868, -4.79462993])

There is no problem with your code:

rng =  np.random.default_rng(seed=2023)

x_pcaed = rng.random((2250, 2))
seeds_train = pd.DataFrame({'y': rng.choice([0, 1], 2250)})

pca_df = pd.DataFrame(data=x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])
test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)

r1 = test1.loc[test1['y']==0, 'pr_comp_1'].values
r2 = x_pcaed[seeds_train.y==0, 0]

Test:

>>> np.all(r1 == r2)
True

Note: take care after creating pca_df, if you modify x_pcaed, your dataframe will be modified too. Your numpy array is not copied to the dataframe but linked. It's different for test dataframe because pd.concat returns a copy so the data are detached from its reference.

huangapple
  • 本文由 发表于 2023年2月19日 05:37:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75496544.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定