Scipy聚类; 使用物理学的Minkowski度量?

huangapple go评论72阅读模式
英文:

Scipy clustering; use the physics Minkowski metric?

问题

今天早上我了解到明科夫斯基度量不总是指的是;

Scipy聚类; 使用物理学的Minkowski度量?
详细信息请参见 wolfram

显然,在scipy中,它只是一个p-范数。Scipy有一个选项可以加权p-范数,但只能使用正权重,因此无法实现相对论明科夫斯基度量。

我想在相对论四维空间中对点进行分层聚类;

a = [a_time, a_x, a_y, a_z]

b = [b_time, b_x, b_y, b_z]

它们之间的距离应该是;

invarient_s(a, b) = sqrt(-(a_time-b_time)^2 + (a_x-b_x)^2 + (a_y-b_y)^2 + (a_z-b_z)^2)

我在Python中工作,最好使用scipy的fcluster。在我自己编写聚类之前,有没有办法在fcluster中获取这个度量?我可以将可用度量列表添加到哪里吗?

编辑;似乎只有fclusterdata支持度量。

英文:

So this morning I learnt that the Minkowski metric does not always mean;

Scipy聚类; 使用物理学的Minkowski度量?
See wolfram for details.

Apparently in scipy it is just a p-norm. Scipy has an option to weight the p-norm, but only with positive weights, so that cannot achieve the relativistic Minkowski metric.

I would like to do hierarchical clustering on points in relativistic 4 dimensional space. For two points;

a = [a_time, a_x, a_y, a_z]

b = [b_time, b_x, b_y, b_z]

The distance between them should be;

invarient_s(a, b) = sqrt(-(a_time-b_time)^2 + (a_x-b_x)^2 + (a_y-b_y)^2 + (a_z-b_z)^2)

I'm working in python and ideally using scipy's fcluster. Before I go and write my own clustering is there anyway to get this metric in fcluster? Can I add to the list of available metrics?

Edit; it appears only fclusterdata supports metrics in the first place.

答案1

得分: 1

坏消息是,确实内置的度量(尤其是名为Minkowski的度量)不支持负权重。我怀疑这个问题的原因是,在适当的度量中,只有当x = y时才能有d(x, y) = 0,而Minkowski度量违反了这一规则。这可能是scipy中任何带权重的度量不支持负权重的原因,也可以参考这个github讨论中的评论。

好消息是scipy.cluster.hierarchy.fclusterdata的文档有错误(现在在主分支中已修复),因为它声称

metric: str, optional

    The distance metric for calculating pairwise distances.
    See distance.pdist for descriptions and linkage to verify
    compatibility with the linkage method.

而实际的fclusterdata实现只是将metric输入参数传递给pdist,这允许将自定义可调用函数传递为metric

metric: str or function, optional

确实,我们可以定义自己的Minkowski度量函数,并将其传递给fclusterdata,但我们必须确保所有点在空间上分开,否则我们会得到复杂的距离,pdist将会出现错误(关于“有限”数据的投诉,因为当给定一个负数时,np.sqrt会返回nan,而nanlinkagenp.isfinite检查中失败)。

在合理的情况下,类似以下的代码可以工作:

from scipy.cluster.hierarchy import fclusterdata 
from numpy.random import default_rng  # 仅用于虚拟数据 

# 生成随机数据,使用新的随机工具以获得最佳实践 
N = 10 
rng = default_rng() 
X = rng.random((N, 4)) * [0.01, 1, 1, 1]  # 使它们都在空间中

def physical_minkowski(v1, v2): 
    """返回带有-+++签名的4-向量的正确Minkowski度量"""
    return np.sqrt(([-1, 1, 1, 1] * v1).dot(v2)) 

fclusterdata(X, t=1, metric=physical_minkowski)                               
# 返回不太有趣的数组([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

由于上面的函数可能会被多次调用,因此使用numba.njit进行编译以提高性能可能是有意义的。只需要进行小的更改就可以做到这一点:

import numba

@numba.njit 
def jitted_minkowski(v1, v2): 
    return np.sqrt((np.array([-1, 1, 1, 1]) * v1).dot(v2)) 

我使用IPython的内置%timeit魔法函数对上述两个度量函数进行了计时,其中N = 1000用于进行合理的比较:

>>> %timeit scipy.spatial.distance.pdist(X, metric=physical_minkowski)
... %timeit scipy.spatial.distance.pdist(X, metric=jitted_minkowski)
2.2± 90.2毫秒每次循环均值±7次运行的标准差1次每次循环
385毫秒 ± 12.9毫秒每次循环均值±7次运行的标准差1次每次循环

这意味着对于更大的4-向量集,经JIT编译的版本要快5倍,并且只需要编译一次(甚至可以将已编译的函数缓存到磁盘上,以便不必每次运行脚本时都重新编译它)。

英文:

The bad news is that indeed built-in metrics (and especially the one named Minkowski) don't support negative weights. I suspect the reason for this is that in a proper metric you can only have d(x,y) = 0 if and only if x = y, which is violated by the Minkowski metric. This is probably the reason for the lack of support for negative weights in any of the weighted metrics in scipy, see also remarks in this github thread.

The good news is that the documentation of scipy.cluster.hierarchy.fclusterdata is buggy (now fixed in master), because it claimed

metric: str, optional

    The distance metric for calculating pairwise distances.
    See distance.pdist for descriptions and linkage to verify
    compatibility with the linkage method.

Whereas the actual implementation of fclusterdata simply passes the metric input parameter along to pdist, which allows custom callables to be passed as metric:

metric: str or function, optional

Sure enough, we can define our own Minkowski metric function and pass that on to fclusterdata, but we have to make sure that all the points are spatially separated, otherwise we get complex distances and pdist will loudly fail (complaining about "finite" data, because np.sqrt when given a negative number will return nan, and nan fails the np.isfinite check in linkage). With this reasonable caveat something like the following works:

from scipy.cluster.hierarchy import fclusterdata 
from numpy.random import default_rng  # only for dummy data 
 
# generate random data, use new random machinery for best practices 
N = 10 
rng = default_rng() 
X = rng.random((N, 4)) * [0.01, 1, 1, 1]  # make them all space-like 
 
def physical_minkowski(v1, v2): 
    """Return the proper Minkowski metric for 4-vectors with signature -+++"""
    return np.sqrt(([-1, 1, 1, 1] * v1).dot(v2)) 
 
fclusterdata(X, t=1, metric=physical_minkowski)                               
# returns uninteresting array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Since the above function might get called a lot of times, it might make sense to compile it with numba.njit for improved performance. It only needs a small change to make that possible:

import numba

@numba.njit 
def jitted_minkowski(v1, v2): 
    return np.sqrt((np.array([-1, 1, 1, 1]) * v1).dot(v2)) 

I timed both of the above metric functions using IPython's built-in %timeit magic with N = 1000 for a reasonable comparison:

>>> %timeit scipy.spatial.distance.pdist(X, metric=physical_minkowski)
... %timeit scipy.spatial.distance.pdist(X, metric=jitted_minkowski)
2.2 s ± 90.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
385 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This means that for larger sets of 4-vectors the JIT-compiled version is 5 times faster, and compilation only has to be done once (you can even cache the compiled function on disk so that you don't have to compile it each time you run your script).

huangapple
  • 本文由 发表于 2020年1月3日 19:02:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/59577418.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定