英文:
Scipy clustering; use the physics Minkowski metric?
问题
今天早上我了解到明科夫斯基度量不总是指的是;
详细信息请参见 wolfram。
显然,在scipy中,它只是一个p-范数。Scipy有一个选项可以加权p-范数,但只能使用正权重,因此无法实现相对论明科夫斯基度量。
我想在相对论四维空间中对点进行分层聚类;
a = [a_time, a_x, a_y, a_z]
b = [b_time, b_x, b_y, b_z]
它们之间的距离应该是;
invarient_s(a, b) = sqrt(-(a_time-b_time)^2 + (a_x-b_x)^2 + (a_y-b_y)^2 + (a_z-b_z)^2)
我在Python中工作,最好使用scipy的fcluster。在我自己编写聚类之前,有没有办法在fcluster中获取这个度量?我可以将可用度量列表添加到哪里吗?
编辑;似乎只有fclusterdata支持度量。
英文:
So this morning I learnt that the Minkowski metric does not always mean;
See wolfram for details.
Apparently in scipy it is just a p-norm. Scipy has an option to weight the p-norm, but only with positive weights, so that cannot achieve the relativistic Minkowski metric.
I would like to do hierarchical clustering on points in relativistic 4 dimensional space. For two points;
a = [a_time, a_x, a_y, a_z]
b = [b_time, b_x, b_y, b_z]
The distance between them should be;
invarient_s(a, b) = sqrt(-(a_time-b_time)^2 + (a_x-b_x)^2 + (a_y-b_y)^2 + (a_z-b_z)^2)
I'm working in python and ideally using scipy's fcluster. Before I go and write my own clustering is there anyway to get this metric in fcluster? Can I add to the list of available metrics?
Edit; it appears only fclusterdata supports metrics in the first place.
答案1
得分: 1
坏消息是,确实内置的度量(尤其是名为Minkowski的度量)不支持负权重。我怀疑这个问题的原因是,在适当的度量中,只有当x = y时才能有d(x, y) = 0
,而Minkowski度量违反了这一规则。这可能是scipy
中任何带权重的度量不支持负权重的原因,也可以参考这个github讨论中的评论。
好消息是scipy.cluster.hierarchy.fclusterdata
的文档有错误(现在在主分支中已修复),因为它声称
metric: str, optional
The distance metric for calculating pairwise distances.
See distance.pdist for descriptions and linkage to verify
compatibility with the linkage method.
而实际的fclusterdata
实现只是将metric
输入参数传递给pdist
,这允许将自定义可调用函数传递为metric
:
metric: str or function, optional
确实,我们可以定义自己的Minkowski度量函数,并将其传递给fclusterdata
,但我们必须确保所有点在空间上分开,否则我们会得到复杂的距离,pdist
将会出现错误(关于“有限”数据的投诉,因为当给定一个负数时,np.sqrt
会返回nan
,而nan
在linkage
的np.isfinite
检查中失败)。
在合理的情况下,类似以下的代码可以工作:
from scipy.cluster.hierarchy import fclusterdata
from numpy.random import default_rng # 仅用于虚拟数据
# 生成随机数据,使用新的随机工具以获得最佳实践
N = 10
rng = default_rng()
X = rng.random((N, 4)) * [0.01, 1, 1, 1] # 使它们都在空间中
def physical_minkowski(v1, v2):
"""返回带有-+++签名的4-向量的正确Minkowski度量"""
return np.sqrt(([-1, 1, 1, 1] * v1).dot(v2))
fclusterdata(X, t=1, metric=physical_minkowski)
# 返回不太有趣的数组([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
由于上面的函数可能会被多次调用,因此使用numba.njit
进行编译以提高性能可能是有意义的。只需要进行小的更改就可以做到这一点:
import numba
@numba.njit
def jitted_minkowski(v1, v2):
return np.sqrt((np.array([-1, 1, 1, 1]) * v1).dot(v2))
我使用IPython的内置%timeit
魔法函数对上述两个度量函数进行了计时,其中N = 1000
用于进行合理的比较:
>>> %timeit scipy.spatial.distance.pdist(X, metric=physical_minkowski)
... %timeit scipy.spatial.distance.pdist(X, metric=jitted_minkowski)
2.2秒 ± 90.2毫秒每次循环(均值±7次运行的标准差,1次每次循环)
385毫秒 ± 12.9毫秒每次循环(均值±7次运行的标准差,1次每次循环)
这意味着对于更大的4-向量集,经JIT编译的版本要快5倍,并且只需要编译一次(甚至可以将已编译的函数缓存到磁盘上,以便不必每次运行脚本时都重新编译它)。
英文:
The bad news is that indeed built-in metrics (and especially the one named Minkowski) don't support negative weights. I suspect the reason for this is that in a proper metric you can only have d(x,y) = 0
if and only if x = y
, which is violated by the Minkowski metric. This is probably the reason for the lack of support for negative weights in any of the weighted metrics in scipy
, see also remarks in this github thread.
The good news is that the documentation of scipy.cluster.hierarchy.fclusterdata
is buggy (now fixed in master), because it claimed
metric: str, optional
The distance metric for calculating pairwise distances.
See distance.pdist for descriptions and linkage to verify
compatibility with the linkage method.
Whereas the actual implementation of fclusterdata
simply passes the metric
input parameter along to pdist
, which allows custom callables to be passed as metric
:
metric: str or function, optional
Sure enough, we can define our own Minkowski metric function and pass that on to fclusterdata
, but we have to make sure that all the points are spatially separated, otherwise we get complex distances and pdist
will loudly fail (complaining about "finite" data, because np.sqrt
when given a negative number will return nan
, and nan
fails the np.isfinite
check in linkage
). With this reasonable caveat something like the following works:
from scipy.cluster.hierarchy import fclusterdata
from numpy.random import default_rng # only for dummy data
# generate random data, use new random machinery for best practices
N = 10
rng = default_rng()
X = rng.random((N, 4)) * [0.01, 1, 1, 1] # make them all space-like
def physical_minkowski(v1, v2):
"""Return the proper Minkowski metric for 4-vectors with signature -+++"""
return np.sqrt(([-1, 1, 1, 1] * v1).dot(v2))
fclusterdata(X, t=1, metric=physical_minkowski)
# returns uninteresting array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
Since the above function might get called a lot of times, it might make sense to compile it with numba.njit
for improved performance. It only needs a small change to make that possible:
import numba
@numba.njit
def jitted_minkowski(v1, v2):
return np.sqrt((np.array([-1, 1, 1, 1]) * v1).dot(v2))
I timed both of the above metric functions using IPython's built-in %timeit
magic with N = 1000
for a reasonable comparison:
>>> %timeit scipy.spatial.distance.pdist(X, metric=physical_minkowski)
... %timeit scipy.spatial.distance.pdist(X, metric=jitted_minkowski)
2.2 s ± 90.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
385 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This means that for larger sets of 4-vectors the JIT-compiled version is 5 times faster, and compilation only has to be done once (you can even cache the compiled function on disk so that you don't have to compile it each time you run your script).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论