在PHP、Node和Golang中找到两个数组的差异。

huangapple go评论78阅读模式
英文:

Finding the difference between two arrays in PHP, Node and Golang

问题

这是一个我需要完成的典型示例:

$testArr = array(2.05080E6,29400,420);

$stockArrays =  array(
                      array(2.05080E6,29400,0),
                      array(2.05080E6,9800,420),
                      array(1.715E6,24500,280),
                      array(2.05080E6,29400,140),
                      array(2.05080E6,4900,7));

我需要找出与testArr最不同的stockArray。以下是一些澄清:

  • 每个位置上的数组元素的数值保证不重叠(即arr[0]的值始终最大,arr[1]至少比它小一个数量级)。
  • 绝对值的差异在确定"最不同"时不计算。只有不同的数组索引的数量才重要。
  • 位置差异有一定的权重。因此,在我的示例中,stockArr[1]被认为是"更不同的",尽管它与stockArr[0]stockArr[3]相比只在一个索引位置上不同,但该索引位置的值更大。
  • stockArrays的元素数量通常少于10个,但可能会更多(但永远不会超过3位数)。
  • stockArraystestArr始终具有相同数量的元素。testArr可能具有相同或更少的元素。但是,当testArr元素较少时,会进行填充,以便与stockArray中的潜在匹配元素始终处于相同的位置。例如:
$testArray(29400,140)

将被转换为:

$testArray(0,29400,140)

然后进行差异测试。

  • 最后,可能会出现平局。例如,在上面的示例中,匹配项将是stockArrays[0]stockArrays[3]

在我的示例中,结果将是:

$result = array(0=>array(0,0,1),3=>array(0,0,1));

表示最不同的stockArray位于索引0和3,差异在位置2。

在PHP中,我会以array_diff作为起点来处理所有这些。对于Node/JavaScript,我可能会倾向于使用php.js array_diff,尽管我可能会尝试一些其他方法,因为在最坏的情况下,它的时间复杂度是O(n^2)。

我对Golang还是个新手,所以我不确定如何在那里实现这个问题。我注意到Node有一个array_diff的npm模块。

我曾有一个离经叛道的想法,即将数组转换为填充的字符串(较小的数组元素用0填充),然后对每个字符的序数值进行异或运算,但我认为这可能是一个相当疯狂的做法。

我关注速度,但不是不计一切代价。在理想情况下,每种目标语言都将使用相同的解决方案(算法),但实际上它们之间的差异可能意味着这是不可能的/不是一个好主意。

也许这里的某个人可以指点我实现这个问题的更不平凡的方法,即不仅仅是array_diff的移植。

英文:

Here is a typical example of what I need to do

$testArr = array(2.05080E6,29400,420);

$stockArrays =  array(
                      array(2.05080E6,29400,0),
                      array(2.05080E6,9800,420),
                      array(1.715E6,24500,280),
                      array(2.05080E6,29400,140),
                      array(2.05080E6,4900,7));

I need to identify the stockArray that is the least different. A few clarifications

  • The numeric values of array elements at each position are guaranteed not to overlap. (i.e. arr[0] will always have the biggest values, arr1 will be at least an order of 10 magnitude smaller etc).

  • The absolute values of the differences do not count when determining least different. Only, the number of differing array indices matter.

  • Positional differences do have a weighting. Thus in my example stockArr1 is "more different" thought it too - like its stockArr[0] & stockArr[3] counterparts - differs in only one index position because that index position is bigger.

  • The number of stockArrays elements will typically be less than 10 but could potentially be much more (though never into 3 figures)

  • The stock arrays will always have the same number of elements. The test array will have the same or fewer elements. However, when fewer testArr would be padded out so that potentially matching elements are always in the same place as the stockArray. e.g.

    $testArray(29400,140)

would be transformed to

$testArray(0,29400,140);

prior to being subjected to difference testing.

  • Finally, a tie is possible. For instance my example above the matches would be stockArrays[0] and stockArrays[3].

In my example the result would be

$result = array(0=>array(0,0,1),3=>array(0,0,1));

indicating that the least different stock arrays are at indices 0 & 3 with the differences being at position 2.

In PHP I would handle all of this with array_diff as my starting point. For Node/JavaScript I would probably be tempted to the php.js array_diff port though I would be inclined to explore a bit given that in the worst cast scenario it is an O(n2) affair.

I am a newbie when it comes to Golang so I am not sure how I would implement this problem there. I have noted that Node does have an array_diff npm module.

One off-beat idea I have had is converting the array to a padded string (smaller array elements are 0 padded) and effectively do an XOR on the ordinal value of each character but have dismissed that as probably a rather nutty thing to do.

I am concerned with speed but not at all costs. In an ideal world the same solution (algorithm) would be used in each target language though in reality the differences between them might mean that is not possible/not a good idea.

Perhaps someone here might be able to point me to less pedestrian ways of accomplishing this - i.e. not just array_diff ports.

答案1

得分: 1

这是array_diff解决方案的等效代码(假设我没有犯错):

package main

import "fmt"

func FindLeastDifferent(needle []float64, haystack [][]float64) int {
	if len(haystack) == 0 {
		return -1
	}
	var currentIndex, currentDiff int
	for i, arr := range haystack {
		diff := 0
		for j := range needle {
			if arr[j] != needle[j] {
				diff++
			}
		}
		if i == 0 || diff < currentDiff {
			currentDiff = diff
			currentIndex = i
		}
	}

	return currentIndex
}

func main() {
	idx := FindLeastDifferent(
		[]float64{2.05080E6, 29400, 420},
		[][]float64{
			{2.05080E6, 29400, 0},
			{2.05080E6, 9800, 420},
			{1.715E6, 24500, 280},
			{2.05080E6, 29400, 140},
			{2.05080E6, 4900, 7},
			{2.05080E6, 29400, 420},
		},
	)
	fmt.Println(idx)
}

就像你说的,它的时间复杂度是O(n * m),其中n是needle数组中的元素数量,m是haystack中的数组数量。

如果你事先不知道haystack的内容,那么可能没有太多可以改进的方法。但是,如果你将这个列表存储在数据库中,我认为你关于字符串搜索的直觉可能有一些潜力。例如,PostgreSQL支持字符串相似性索引。(这里有一个关于正则表达式类似思想的解释:http://swtch.com/~rsc/regexp/regexp4.html)

另一个想法是,如果你的数组非常大,你可以计算模糊哈希(http://ssdeep.sourceforge.net/),这将使得n更小。

英文:

Here's the equivalent of the array_diff solution: (assuming I didn't make a mistake)

package main

import &quot;fmt&quot;

func FindLeastDifferent(needle []float64, haystack [][]float64) int {
	if len(haystack) == 0 {
		return -1
	}
	var currentIndex, currentDiff int
	for i, arr := range haystack {
		diff := 0
		for j := range needle {
			if arr[j] != needle[j] {
				diff++
			}
		}
		if i == 0 || diff &lt; currentDiff {
			currentDiff = diff
			currentIndex = i
		}
	}

	return currentIndex
}

func main() {
	idx := FindLeastDifferent(
		[]float64{2.05080E6, 29400, 420},
		[][]float64{
			{2.05080E6, 29400, 0},
			{2.05080E6, 9800, 420},
			{1.715E6, 24500, 280},
			{2.05080E6, 29400, 140},
			{2.05080E6, 4900, 7},
			{2.05080E6, 29400, 420},
		},
	)
	fmt.Println(idx)
}

Like you said its O(n * m) where n is the number of elements in the needle array, and m is the number of arrays in the haystack.

If you don't know the haystack ahead of time, then there's probably not much you can do to improve this. But if, instead, you're storing this list in a database, I think your intuition about string search has some potential. PostgreSQL for example supports string similarity indexes. (And here's an explanation of a similar idea for regular expressions: http://swtch.com/~rsc/regexp/regexp4.html)

One other idea: if your arrays are really big you can calculate fuzzy hashes (http://ssdeep.sourceforge.net/) which would make your n smaller.

huangapple
  • 本文由 发表于 2015年2月13日 17:11:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/28495993.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定