英文:
Why do these two commands yield different results?
问题
我有一个形状为(2250,2)的numpy数组 `x_pcaed`,以及一个具有 2250 行的 pandas 数据框 `seeds_train`,其中有一列 `y`,其值为 0 或 1。
我按照如下方式将 `x_pcaed` 转换为数据框:
pca_df = pd.DataFrame(data=x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])
然后我按如下方式创建了一个新的数据框:
test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)
我困惑的是以下两个命令产生不同的输出:
1)
test1.loc[test1['y'] == 0, 'pr_comp_1'].values
2)
x_pcaed[seeds_train.y == 0, 0]
我的理解是它们应该给出相同的结果,所以我一定是漏掉了什么。
英文:
I have a numpy array
x_pcaed
of shape (2250,2) and a pandas dataframe
seeds_train
with 2250 rows, which has a column
y
whose values are 0 or 1.
I made x_pcaed into a dataframe as follows:
pca_df = pd.DataFrame(data = x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])
I then created a new dataframe as follows:
test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)
I'm confused why the following two commands yield different outputs:
test1.loc[test1['y']==0,'pr_comp_1'].values
x_pcaed[seeds_train.y==0, 0]
My understanding is that they should give the same result, so I must be missing something.
Update: Here's the full code. It uses Pumpkin Seeds Dataset from here.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
seeds = pd.read_excel('Pumpkin_Seeds_Dataset.xlsx')
seeds['y'] = 0
seeds.loc[seeds.Class=='Ürgüp Sivrisi', 'y']=1
seeds_train, seeds_test = train_test_split(seeds.copy(),
shuffle=True,
random_state=123,
test_size=.1,
stratify=seeds.y.values)
features = seeds_train.columns[:-2]
x = seeds_train.loc[:,features].values
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
pca = PCA(n_components = 2)
x_pcaed = pca.fit_transform(x_scaled)
pca_df = pd.DataFrame(data = x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])
test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)
print(test1.loc[test1['y']==0,'pr_comp_1'].values)
print('----------')
print(x_pcaed[seeds_train.y==0, 0])
Output:
[-0.79984874 -2.75176272 -0.26329661 ... -2.03461928 -2.38149466
-1.46663563]
----------
[-1.36392527 -0.26329661 -4.91873745 ... -1.46508442 -1.07096868
-4.79462993]
答案1
得分: 1
更新
当您创建pca_df
时忘记了index
,因此在使用沿着索引轴使用pd.concat
时,pca_df
和seed_train
之间的索引不对齐。使用:
pca_df = pd.DataFrame(data=x_pcaed,
columns=['pr_comp_1', 'pr_comp_2'],
index=seeds_train.index)
test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)
r1 = test1.loc[test1['y']==0, 'pr_comp_1'].values
r2 = x_pcaed[seeds_train.y==0, 0]
print(r1)
print(r2)
# 输出
array([-1.36392527, -0.26329661, -4.91873745, ..., -1.46508442,
-1.07096868, -4.79462993])
array([-1.36392527, -0.26329661, -4.91873745, ..., -1.46508442,
-1.07096868, -4.79462993])
您的代码没有问题:
rng = np.random.default_rng(seed=2023)
x_pcaed = rng.random((2250, 2))
seeds_train = pd.DataFrame({'y': rng.choice([0, 1], 2250)})
pca_df = pd.DataFrame(data=x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])
test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)
r1 = test1.loc[test1['y']==0, 'pr_comp_1'].values
r2 = x_pcaed[seeds_train.y==0, 0]
测试:
>>> np.all(r1 == r2)
True
注意:在创建pca_df
之后,如果您修改x_pcaed
,您的数据框架也会被修改。您的NumPy数组未复制到数据框架,而是链接到了它。但对于test
数据框架是不同的,因为pd.concat
返回一个副本,因此数据与其引用分离。
英文:
Update
You forgot the index
when you create pca_df
so the indexes are not aligned between pca_df
and seed_train
when you use pd.concat
along index axis. Use:
pca_df = pd.DataFrame(data=x_pcaed,
columns=['pr_comp_1', 'pr_comp_2'],
index=seeds_train.index)
test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)
r1 = test1.loc[test1['y']==0, 'pr_comp_1'].values
r2 = x_pcaed[seeds_train.y==0, 0]
print(r1)
print(r2)
# Output
array([-1.36392527, -0.26329661, -4.91873745, ..., -1.46508442,
-1.07096868, -4.79462993])
array([-1.36392527, -0.26329661, -4.91873745, ..., -1.46508442,
-1.07096868, -4.79462993])
There is no problem with your code:
rng = np.random.default_rng(seed=2023)
x_pcaed = rng.random((2250, 2))
seeds_train = pd.DataFrame({'y': rng.choice([0, 1], 2250)})
pca_df = pd.DataFrame(data=x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])
test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)
r1 = test1.loc[test1['y']==0, 'pr_comp_1'].values
r2 = x_pcaed[seeds_train.y==0, 0]
Test:
>>> np.all(r1 == r2)
True
Note: take care after creating pca_df
, if you modify x_pcaed
, your dataframe will be modified too. Your numpy array is not copied to the dataframe but linked. It's different for test
dataframe because pd.concat
returns a copy so the data are detached from its reference.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论