英文:
NVidia Rapids: Non-Euclidean metric in cuml UMAP
问题
我正尝试使用GPU(A100)来加速执行UMAP。我面临的问题是欧几里得距离度量似乎完全不起作用,但相关性/余弦度量看起来有希望。然而,我下面使用的代码似乎只在GPU上产生基于欧几里得距离的计算,而在CPU上运行良好。
工具:
- cuml 23.04.01 cuda11_py310_230421_g958186d07_0 rapidsai
- libcuml 23.04.01 cuda11_230421_g958186d07_0 rapidsai
- libcumlprims 23.04.00 cuda11_230412_g7502d8e_0 nvidia
- python 3.10.11 he550d4f_0_cpython conda-forge
相关代码:
def umap_cpu(ip_mat, n_components, n_neighbors, metric):
import umap
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
ip_std = scaler.fit_transform(ip_mat)
reducer = umap.UMAP(n_components=n_components, n_neighbors=n_neighbors, metric=metric)
umap_embed = reducer.fit_transform(ip_std)
return umap_embed
def umap_gpu(ip_mat, n_components, n_neighbors, metric):
import cuml
from cuml.manifold import UMAP
from sklearn.preprocessing import StandardScaler
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
scaler = StandardScaler()
ip_std = scaler.fit_transform(ip_mat)
reducer = UMAP(n_components=n_components, n_neighbors=n_neighbors, metric=metric)
umap_embed = reducer.fit_transform(ip_std)
return umap_embed
使用help
命令,我注意到支持其他度量方式。然而,我在一个旧帖子中发现了相反的说法。
这个PR将允许更改输入KNN图的度量方式,但目前仅支持的目标度量方式仍然是分类和欧几里得。我们可以支持不同的目标度量方式(并且我们有一个打开的问题来支持它们),但它们将需要在SGD中使用稍微不同的目标函数。我确实相信在引发Python异常时存在错误(在这个问题中指出)。
我想知道是否已经为其他度量方式实施了这一功能,或者帮助工具显示了错误的信息。
metric : 字符串(默认='euclidean')。
要使用的距离度量。支持的距离包括['l1', 'cityblock', 'taxicab', 'manhattan', 'euclidean', 'l2', 'sqeuclidean', 'canberra', 'minkowski', 'chebyshev', 'linf', 'cosine', 'correlation', 'hellinger', 'hamming', 'jaccard'],可以通过metric_kwds字典传递带参数的度量方式(例如minkowski)。
TIA
英文:
I am trying to use GPU (A100) to perform UMAP for speedup. I am facing problem as Euclidean metric does not seem to work for me at all but correlation/cosine are promising. However, the code I am using below seems to produce only Euclidean metric based computation on GPU while working well on CPU.
Tools:
cuml 23.04.01 cuda11_py310_230421_g958186d07_0 rapidsai
libcuml 23.04.01 cuda11_230421_g958186d07_0 rapidsai
libcumlprims 23.04.00 cuda11_230412_g7502d8e_0 nvidia
python 3.10.11 he550d4f_0_cpython conda-forge
Relevant code:
def umap_cpu(ip_mat, n_components, n_neighbors, metric):
import umap
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
ip_std = scaler.fit_transform(ip_mat)
reducer = umap.UMAP(n_components=n_components, n_neighbors=n_neighbors, metric=metric)
umap_embed = reducer.fit_transform(ip_std)
return umap_embed
def umap_gpu(ip_mat, n_components, n_neighbors, metric):
import cuml
from cuml.manifold import UMAP
from sklearn.preprocessing import StandardScaler
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
scaler = StandardScaler()
ip_std = scaler.fit_transform(ip_mat)
reducer = UMAP(n_components=n_components, n_neighbors=n_neighbors, metric=metric)
umap_embed = reducer.fit_transform(ip_std)
return umap_embed
Using help
I noticed that other metrics are supported. However, I found an old post that said otherwise in discussion.
> PR will allow the metric for the input KNN graph to be changed but the
> only supported target metrics currently remain to be categorical and
> Euclidean. We can support different target metrics (and we have issue
> open to support them) but they will require a slightly different
> objective function in the SGD. I do believe there's an error in the
> throwing of the Python exception (pointed out in this issue)
I would like to know if the implementation has been done for other metrics or the help tool shows wrong info.
> metric : string (default='euclidean').
> Distance metric to use. Supported distances are ['l1, 'cityblock', 'taxicab',
'manhattan', 'euclidean', 'l2', 'sqeuclidean', 'canberra', 'minkowski', 'chebyshev', 'linf', 'cosine', 'correlation', 'hellinger', 'hamming', 'jaccard'] Metrics that take arguments (such as minkowski) can have arguments passed via the metric_kwds dictionary.
>
TIA
答案1
得分: 2
metric
参数用于指定KNN图的距离度量。它支持许多距离度量。
还有另一个参数target_metric
,只支持欧氏距离和分类。
从你的问题看来,实际上你是在寻找更多target_metric
选项的支持。
如果你对添加这个功能感兴趣,请随时在此Github问题上表达你的兴趣。
英文:
The metric
argument of cuml UMAP is used to specify the distance metric of the KNN graph. This support many distance metrics.
There is another argument target_metric
that only support euclidean and categorical.
From your question it seems that support of more target_metric
options is actually what you are looking for.
Feel free to show your interest for the addition of this feature on this Github issue.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论