多尺度特征匹配是如何工作的?ORB、SIFT等。

huangapple go评论52阅读模式
英文:

How does multiscale feature matching work? ORB, SIFT, etc

问题

  1. 如何处理在多个尺度上检测到相同特征的情况?如何决定要为哪个尺度创建描述符?

  2. 如何在不同尺度之间连接特征?例如,假设你在0.5尺度上检测到一个特征并与描述符匹配,那么这个位置是否会被转化为在初始尺度上的位置?

英文:

When reading about classic computer vision I am confused on how multiscale feature matching works.

Suppose we use an image pyramid,

  1. How do you deal with the same feature being detected at multiple scales? How do you decide which to make a deacriptor for?

  2. How do you connected features between scales? For example let's say you have a feature detected and matched to a descriptor at scale .5. Is this location then translated to its location in the initial scale?

答案1

得分: 2

我可以分享关于SIFT的一些信息,可能会回答你的问题(1)。对于你提出的问题(2),我不太确定你的意思,请澄清一下?

SIFT(尺度不变特征变换)是专门设计用来查找在不同图像尺度、旋转和变换下仍然可识别的特征的。

当你在一幅图像上运行SIFT(例如一辆汽车的图像),SIFT将尝试为相同的特征(例如车牌)创建相同的描述符,无论你应用什么图像变换。

理想情况下,SIFT将仅为图像中的每个特征生成一个描述符。然而,实际上并不总是如此,正如你可以在OpenCV示例中看到的那样,很多情况下描述符会重叠。我认为这就是你在问题(1)中所提到的“相同特征在多个尺度上被检测到”的意思。

据我所知,SIFT实际上并不关心这个问题。如果通过调整图像的尺度,你最终生成了“相同特征”的多个描述符,那么这些对于SIFT来说是不同的描述符。

在描述符匹配时,你只需粗略比较描述符列表,无论它是从哪个尺度生成的,然后尝试找到最接近的匹配。SIFT作为一个功能的整体,其目的是接受在不同变换下的图像特征,并在最后产生类似的数值输出。

因此,如果你最终得到了相同特征的多个描述符,你只需做更多的计算工作,但仍然基本上会匹配两幅图像之间的相同特征对。

编辑:

如果你问的是如何将图像金字塔中缩放后的图像的坐标转换回原始图像坐标,那么David Lowe的SIFT论文第4节专门讨论了这个话题。

一个朴素的方法是简单地计算缩放坐标与缩放图像尺寸的比率,然后回推到原始图像坐标和尺寸。然而,这是不准确的,并且随着图像的缩小,不准确性会逐渐增加。

例如,如果你从一个大小为1000x1000像素的图像开始,一个特征位于坐标(123,456),如果你将图像缩小到100x100像素,那么缩放后的关键点坐标可能是(12,46)。朴素地回推到原始坐标将给出坐标(120,460)

因此,SIFT会拟合高斯差分函数的Taylor展开,以尝试在亚像素级别准确地定位原始的有趣关键点,然后你可以使用这些坐标回推到原始图像坐标。

不幸的是,这一部分的数学对我来说有点复杂。但如果你精通数学、C编程,并且想要了解SIFT是如何实现的,我建议你深入研究Rob Hess的SIFT实现,其中的代码行467648可能是你能找到的最详细的内容。

英文:

I can share something about SIFT that might answer question (1) for you.
I'm not really sure what you mean in your question (2) though, so please clarify?


SIFT (Scale-Invariant Feature Transform) was designed specifically to find features that remains identifiable across different image scales, rotations, and transformations.

When you run SIFT on an image of some object (e.g. a car), SIFT will try to create the same descriptor for the same feature (e.g. the license plate), no matter what image transformation you apply.

Ideally, SIFT will only produce a single descriptor for each feature in an image.
However, this obviously doesn't always happen in practice, as you can see in an OpenCV example here:

多尺度特征匹配是如何工作的?ORB、SIFT等。

OpenCV illustrates each SIFT descriptor as a circle of different size. You can see many cases where the circles overlap. I assume this is what you meant in question (1) by "the same feature being detected at multiple scales".

And to my knowledge, SIFT doesn't really care about this issue. If by scaling the image enough you end up creating multiple descriptors from "the same feature", then those are distinct descriptors to SIFT.

During descriptor matching, you simply brute-force compare your list of descriptors, regardless of what scale it was generated from, and try to find the closest match.
The whole point of SIFT as a function, is to take in some image feature under different transformations, and produce a similar numerical output at the end.

So if you do end up with multiple descriptors of the same feature, you'll just end up having to do more computational work, but you will still essentially match the same pair of feature across two images regardless.


Edit:

If you are asking about how to convert coordinates from the scaled images in the image pyramid back into original image coordinates, then David Lowe's SIFT paper dedicates section 4 on that topic.

The naive approach would be to simply calculate the ratios of the scaled coordinates vs the scaled image dimensions, then extrapolate back to the original image coordinates and dimensions. However, this is inaccurate, and becomes increasingly so as you scale down an image.

> Example: You start with a 1000x1000 pixel image, where a feature is located at coordinates (123,456). If you had scaled down the image to 100x100 pixel, then the scaled keypoint coordinate would be something like (12,46). Extrapolating back to the original coordinates naively would give the coordinates (120,460).

So SIFT fits a Taylor expansion of the Difference of Gaussian function, to try and locate the original interesting keypoint down to sub-pixel levels of accuracy; which you can then use to extrapolate back to the original image coordinates.

Unfortunately, the math for this part is quite beyond me. But if you are fluent in math, C programming, and want to know specifically how SIFT is implemented; I suggest you dive into Rob Hess' SIFT implementation, lines 467 through 648 is probably the most detailed you can get.

huangapple
  • 本文由 发表于 2023年2月16日 13:03:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/75468047.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定