2023年6月12日 17:05:45go评论71阅读模式

英文:

PCA on limited data in generative model latent space: what does almost 100% of variance captured mean?

问题

我正在研究生成模型的潜在空间，其中潜在空间中的数据具有形状为（64, 64, 3）。我想要在2D图中可视化这些数据的一个子集，比如n=5。为了实现这一目标，我已经将数据重塑为形状为（5, 12288），并使用PCA将其减少到前2个主成分，然后使用matplotlib进行绘制。

然而，我对PCA捕获的方差量不确定。当我检查时，显示捕获了超过99%的方差。我认为这可能是由于我使用的样本数量较小，因此在这种情况下奇异值最多只能为5。我的理解正确吗？这是否意味着PCA捕获的方差对于完整的潜在空间没有意义？

这是我用于重塑数据、使用PCA减少数据并检查捕获方差的代码：

import numpy as np
from sklearn.decomposition import PCA

def matrix_to_point(A):
    # 将矩阵转换为点，通过将其展平
    return A.reshape(-1)

n = 5
latent_sample = np.random.rand(n, *(64, 64, 3))
data =  np.asarray([ matrix_to_point(m) for m in latent_sample])

pca= PCA(n_components=2)
pca = pca.fit(data)

reduced_data = pca.transform(data)

print(f'PCA捕获的方差: {pca.explained_variance_ratio_}')
# 在代码中的输出：PCA捕获的方差: [0.25629761 0.25391076]
# 在完整代码中的输出：PCA捕获的方差: [0.96852827 0.03129395]

在这段代码中，我用一些随机样本替代了实际的潜在样本，以便能够执行代码。感谢您提前的帮助。

英文:

I am studying the latent space of a generative model, where the data in my latent space have a shape of (64, 64, 3). I would like to visualize a subset of this data, say n=5, in a 2D plot. To achieve this, I have reshaped the data to have a shape of (5, 12288) and used PCA to reduce it to the first 2 principal components, which I then plot using matplotlib.

However, I am uncertain about the amount of variance captured by the PCA. When I check, it shows that more than 99% of the variance is captured. I think this might be due to the small number of samples that I used, such that the singular values can only be at most 5 in this case. Is my understanding correct? Does this mean that the variance captured by the PCA is not meaningful for the full latent space?

Here is the code I used to reshape my data, reduce it with PCA, and check the captured variance:

import numpy as np
from sklearn.decomposition import PCA

def matrix_to_point(A):
    # Convert a matrix to a point by flattening it
    return A.reshape(-1)

n = 5
latent_sample = np.random.rand(n, *(64, 64, 3))
data =  np.asarray([ matrix_to_point(m) for m in latent_sample])

pca= PCA(n_components=2)
pca = pca.fit(data)

reduced_data = pca.transform(data)

print(f&#39;Variance captured by the PCA: {pca.explained_variance_ratio_}&#39;)
#output with the posted code: Variance captured by the PCA: [0.25629761 0.25391076]
#output with the complete code: Variance captured by the PCA: [0.96852827 0.03129395]

In this code, I substituted the actual latent sample with some random samples to make it executable. Thank you in advance for your assistance

答案1

得分: 1

我会尝试通过反向转换减少的数据来评估PCA的质量，并评估结果，这里我使用了RSME，但如果适合您的用例，您可以使用另一个度量标准：

import numpy as np
from sklearn.decomposition import PCA

def matrix_to_point(A):
    # 将矩阵转换为点，通过展平它
    return A.reshape(-1)

n = 5
latent_sample = np.random.rand(n, *(64, 64, 3))
data =  np.asarray([ matrix_to_point(m) for m in latent_sample])

pca= PCA(n_components=2)
pca = pca.fit(data)

reduced_data = pca.transform(data)
print(f'PCA捕获的方差: {pca.explained_variance_ratio_}')

expanded_data = pca.inverse_transform(reduced_data)
rmse = np.mean(np.sqrt((expanded_data - data)**2))
print(f'均方根误差: {rmse}')

如果您的数据实际上是整个空间的二维子空间，拟合将非常好，RSME将非常小。

英文:

I would try judging the quality of the PCA by inversely transforming the reduced data and judge the result, here I used RSME, but you can use another metric if that suits your use case better:

import numpy as np
from sklearn.decomposition import PCA

def matrix_to_point(A):
    # Convert a matrix to a point by flattening it
    return A.reshape(-1)

n = 5
latent_sample = np.random.rand(n, *(64, 64, 3))
data =  np.asarray([ matrix_to_point(m) for m in latent_sample])

pca= PCA(n_components=2)
pca = pca.fit(data)

reduced_data = pca.transform(data)
print(f&#39;Variance captured by the PCA: {pca.explained_variance_ratio_}&#39;)

expanded_data = pca.inverse_transform(reduced_data)
rmse = np.mean(np.sqrt((expanded_data - data)**2))
print(f&#39;Root mean square error: {rmse}&#39;)

In case your data is actually a two-dimensional subspace of the entire space, the fit will be very good and the RSME will be very small.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

PCA在生成模型潜在空间中的有限数据上：几乎100%的方差被捕获意味着什么？

问题

答案1

如何高效地将一维数组拉伸到任意大小，而不进行插值？

Getting TypeError: WebDriver.init() got an unexpected keyword argument ‘desired_capabilities’ when using Appium with Selenium 4.10

运行 Celery 时出现运行时错误，似乎在请求上下文之外工作。

如何基于ElasticSearch（Python）中两个子聚合指标的比较来筛选存储桶？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论