这两个命令为什么产生不同的结果?

huangapple go评论90阅读模式
英文:

Why do these two commands yield different results?

问题

  1. 我有一个形状为(2250,2)的numpy数组 `x_pcaed`,以及一个具有 2250 行的 pandas 数据框 `seeds_train`,其中有一列 `y`,其值为 0 1
  2. 我按照如下方式将 `x_pcaed` 转换为数据框:

pca_df = pd.DataFrame(data=x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])

  1. 然后我按如下方式创建了一个新的数据框:

test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)

  1. 我困惑的是以下两个命令产生不同的输出:
  2. 1

test1.loc[test1['y'] == 0, 'pr_comp_1'].values

  1. 2

x_pcaed[seeds_train.y == 0, 0]

  1. 我的理解是它们应该给出相同的结果,所以我一定是漏掉了什么。
英文:

I have a numpy array

  1. x_pcaed

of shape (2250,2) and a pandas dataframe

  1. seeds_train

with 2250 rows, which has a column

  1. y

whose values are 0 or 1.
I made x_pcaed into a dataframe as follows:

  1. pca_df = pd.DataFrame(data = x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])

I then created a new dataframe as follows:

  1. test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)

I'm confused why the following two commands yield different outputs:

  1. test1.loc[test1['y']==0,'pr_comp_1'].values
  1. x_pcaed[seeds_train.y==0, 0]

My understanding is that they should give the same result, so I must be missing something.


Update: Here's the full code. It uses Pumpkin Seeds Dataset from here.

  1. import pandas as pd
  2. from sklearn.model_selection import train_test_split
  3. from sklearn.decomposition import PCA
  4. from sklearn.preprocessing import StandardScaler
  5. seeds = pd.read_excel('Pumpkin_Seeds_Dataset.xlsx')
  6. seeds['y'] = 0
  7. seeds.loc[seeds.Class=='Ürgüp Sivrisi', 'y']=1
  8. seeds_train, seeds_test = train_test_split(seeds.copy(),
  9. shuffle=True,
  10. random_state=123,
  11. test_size=.1,
  12. stratify=seeds.y.values)
  13. features = seeds_train.columns[:-2]
  14. x = seeds_train.loc[:,features].values
  15. scaler = StandardScaler()
  16. x_scaled = scaler.fit_transform(x)
  17. pca = PCA(n_components = 2)
  18. x_pcaed = pca.fit_transform(x_scaled)
  19. pca_df = pd.DataFrame(data = x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])
  20. test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)
  21. print(test1.loc[test1['y']==0,'pr_comp_1'].values)
  22. print('----------')
  23. print(x_pcaed[seeds_train.y==0, 0])

Output:

  1. [-0.79984874 -2.75176272 -0.26329661 ... -2.03461928 -2.38149466
  2. -1.46663563]
  3. ----------
  4. [-1.36392527 -0.26329661 -4.91873745 ... -1.46508442 -1.07096868
  5. -4.79462993]

答案1

得分: 1

更新

当您创建pca_df时忘记了index,因此在使用沿着索引轴使用pd.concat时,pca_dfseed_train之间的索引不对齐。使用:

  1. pca_df = pd.DataFrame(data=x_pcaed,
  2. columns=['pr_comp_1', 'pr_comp_2'],
  3. index=seeds_train.index)
  4. test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)
  5. r1 = test1.loc[test1['y']==0, 'pr_comp_1'].values
  6. r2 = x_pcaed[seeds_train.y==0, 0]
  7. print(r1)
  8. print(r2)
  9. # 输出
  10. array([-1.36392527, -0.26329661, -4.91873745, ..., -1.46508442,
  11. -1.07096868, -4.79462993])
  12. array([-1.36392527, -0.26329661, -4.91873745, ..., -1.46508442,
  13. -1.07096868, -4.79462993])

您的代码没有问题:

  1. rng = np.random.default_rng(seed=2023)
  2. x_pcaed = rng.random((2250, 2))
  3. seeds_train = pd.DataFrame({'y': rng.choice([0, 1], 2250)})
  4. pca_df = pd.DataFrame(data=x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])
  5. test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)
  6. r1 = test1.loc[test1['y']==0, 'pr_comp_1'].values
  7. r2 = x_pcaed[seeds_train.y==0, 0]

测试:

  1. >>> np.all(r1 == r2)
  2. True

注意:在创建pca_df之后,如果您修改x_pcaed,您的数据框架也会被修改。您的NumPy数组未复制到数据框架,而是链接到了它。但对于test数据框架是不同的,因为pd.concat返回一个副本,因此数据与其引用分离。

英文:

Update

You forgot the index when you create pca_df so the indexes are not aligned between pca_df and seed_train when you use pd.concat along index axis. Use:

  1. pca_df = pd.DataFrame(data=x_pcaed,
  2. columns=['pr_comp_1', 'pr_comp_2'],
  3. index=seeds_train.index)
  4. test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)
  5. r1 = test1.loc[test1['y']==0, 'pr_comp_1'].values
  6. r2 = x_pcaed[seeds_train.y==0, 0]
  7. print(r1)
  8. print(r2)
  9. # Output
  10. array([-1.36392527, -0.26329661, -4.91873745, ..., -1.46508442,
  11. -1.07096868, -4.79462993])
  12. array([-1.36392527, -0.26329661, -4.91873745, ..., -1.46508442,
  13. -1.07096868, -4.79462993])

There is no problem with your code:

  1. rng = np.random.default_rng(seed=2023)
  2. x_pcaed = rng.random((2250, 2))
  3. seeds_train = pd.DataFrame({'y': rng.choice([0, 1], 2250)})
  4. pca_df = pd.DataFrame(data=x_pcaed, columns=['pr_comp_1', 'pr_comp_2'])
  5. test1 = pd.concat([pca_df, seeds_train[['y']]], axis=1)
  6. r1 = test1.loc[test1['y']==0, 'pr_comp_1'].values
  7. r2 = x_pcaed[seeds_train.y==0, 0]

Test:

  1. >>> np.all(r1 == r2)
  2. True

Note: take care after creating pca_df, if you modify x_pcaed, your dataframe will be modified too. Your numpy array is not copied to the dataframe but linked. It's different for test dataframe because pd.concat returns a copy so the data are detached from its reference.

huangapple
  • 本文由 发表于 2023年2月19日 05:37:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75496544.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定