NVidia Rapids:cuml UMAP 中的非欧几里德度量

huangapple go评论81阅读模式
英文:

NVidia Rapids: Non-Euclidean metric in cuml UMAP

问题

我正尝试使用GPU(A100)来加速执行UMAP。我面临的问题是欧几里得距离度量似乎完全不起作用,但相关性/余弦度量看起来有希望。然而,我下面使用的代码似乎只在GPU上产生基于欧几里得距离的计算,而在CPU上运行良好。

工具:

  • cuml 23.04.01 cuda11_py310_230421_g958186d07_0 rapidsai
  • libcuml 23.04.01 cuda11_230421_g958186d07_0 rapidsai
  • libcumlprims 23.04.00 cuda11_230412_g7502d8e_0 nvidia
  • python 3.10.11 he550d4f_0_cpython conda-forge

相关代码:

def umap_cpu(ip_mat, n_components, n_neighbors, metric):
    import umap
    from sklearn.preprocessing import StandardScaler

    scaler = StandardScaler()
    ip_std = scaler.fit_transform(ip_mat)

    reducer = umap.UMAP(n_components=n_components, n_neighbors=n_neighbors, metric=metric)
    umap_embed = reducer.fit_transform(ip_std)

    return umap_embed

def umap_gpu(ip_mat, n_components, n_neighbors, metric):
    import cuml
    from cuml.manifold import UMAP
    from sklearn.preprocessing import StandardScaler

    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
    os.environ["CUDA_VISIBLE_DEVICES"] = "0"

    scaler = StandardScaler()
    ip_std = scaler.fit_transform(ip_mat)

    reducer = UMAP(n_components=n_components, n_neighbors=n_neighbors, metric=metric)
    umap_embed = reducer.fit_transform(ip_std)

    return umap_embed

使用help命令,我注意到支持其他度量方式。然而,我在一个旧帖子中发现了相反的说法。

这个PR将允许更改输入KNN图的度量方式,但目前仅支持的目标度量方式仍然是分类和欧几里得。我们可以支持不同的目标度量方式(并且我们有一个打开的问题来支持它们),但它们将需要在SGD中使用稍微不同的目标函数。我确实相信在引发Python异常时存在错误(在这个问题中指出)。

我想知道是否已经为其他度量方式实施了这一功能,或者帮助工具显示了错误的信息。

metric : 字符串(默认='euclidean')。
要使用的距离度量。支持的距离包括['l1', 'cityblock', 'taxicab', 'manhattan', 'euclidean', 'l2', 'sqeuclidean', 'canberra', 'minkowski', 'chebyshev', 'linf', 'cosine', 'correlation', 'hellinger', 'hamming', 'jaccard'],可以通过metric_kwds字典传递带参数的度量方式(例如minkowski)。

TIA

英文:

I am trying to use GPU (A100) to perform UMAP for speedup. I am facing problem as Euclidean metric does not seem to work for me at all but correlation/cosine are promising. However, the code I am using below seems to produce only Euclidean metric based computation on GPU while working well on CPU.

Tools:

cuml                      23.04.01        cuda11_py310_230421_g958186d07_0    rapidsai
libcuml                   23.04.01        cuda11_230421_g958186d07_0          rapidsai
libcumlprims              23.04.00        cuda11_230412_g7502d8e_0            nvidia
python                    3.10.11         he550d4f_0_cpython                  conda-forge

Relevant code:

def umap_cpu(ip_mat, n_components, n_neighbors, metric):
    import umap
    from sklearn.preprocessing import StandardScaler

    scaler = StandardScaler()
    ip_std = scaler.fit_transform(ip_mat)

    reducer = umap.UMAP(n_components=n_components, n_neighbors=n_neighbors, metric=metric)
    umap_embed = reducer.fit_transform(ip_std)

    return umap_embed

def umap_gpu(ip_mat, n_components, n_neighbors, metric):
    import cuml
    from cuml.manifold import UMAP
    from sklearn.preprocessing import StandardScaler

    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
    os.environ["CUDA_VISIBLE_DEVICES"] = "0"

    scaler = StandardScaler()
    ip_std = scaler.fit_transform(ip_mat)

    reducer = UMAP(n_components=n_components, n_neighbors=n_neighbors, metric=metric)
    umap_embed = reducer.fit_transform(ip_std)

    return umap_embed

Using help I noticed that other metrics are supported. However, I found an old post that said otherwise in discussion.

> PR will allow the metric for the input KNN graph to be changed but the
> only supported target metrics currently remain to be categorical and
> Euclidean. We can support different target metrics (and we have issue
> open to support them) but they will require a slightly different
> objective function in the SGD. I do believe there's an error in the
> throwing of the Python exception (pointed out in this issue)

I would like to know if the implementation has been done for other metrics or the help tool shows wrong info.

> metric : string (default='euclidean').
> Distance metric to use. Supported distances are ['l1, 'cityblock', 'taxicab',
'manhattan', 'euclidean', 'l2', 'sqeuclidean', 'canberra', 'minkowski', 'chebyshev', 'linf', 'cosine', 'correlation', 'hellinger', 'hamming', 'jaccard'] Metrics that take arguments (such as minkowski) can have arguments passed via the metric_kwds dictionary.
>

TIA

答案1

得分: 2

metric参数用于指定KNN图的距离度量。它支持许多距离度量。

还有另一个参数target_metric,只支持欧氏距离和分类。

从你的问题看来,实际上你是在寻找更多target_metric选项的支持。
如果你对添加这个功能感兴趣,请随时在此Github问题上表达你的兴趣。

英文:

The metric argument of cuml UMAP is used to specify the distance metric of the KNN graph. This support many distance metrics.

There is another argument target_metric that only support euclidean and categorical.

From your question it seems that support of more target_metric options is actually what you are looking for.
Feel free to show your interest for the addition of this feature on this Github issue.

huangapple
  • 本文由 发表于 2023年6月22日 02:13:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76526082.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定