Fastest way to remove duplicates of a sorted Go slice

huangapple go评论87阅读模式
英文:

Fastest way to remove duplicates of a sorted Go slice

问题

mySlice := make([]uint32, 0, 4294967290)

// ...

// 对切片进行排序
sort.Slice(mySlice, func(i, j int) bool {
	x := mySlice[i]
	y := mySlice[j]
	return x < y
})

如何最快地删除切片中的重复元素?

如何利用切片已经排序的事实?

更新

我想到了以下解决方案:

func RemoveDuplicates(s []uint32) {
	if len(s) < 2 {
		return
	}
	tmp := make([]uint32, 0, len(s))

	for i := uint32(0); i < uint32(len(s))-1; i++ {
		// 如果当前元素不等于下一个元素,则将当前元素存储起来
		if s[i] != s[i+1] {
			tmp = append(tmp, s[i])
		}
	}

	// 最后一个元素必须存储
	// 注意,如果最后一个元素重复了,重复的元素不会被存储
	tmp = append(tmp, s[len(s)-1])

	// 修改原始切片
	s = nil
	s = append(s, tmp...)
}

有任何错误吗?有任何 bug 吗?有任何改进的方法吗?

更新

如 @mh-cbon 所指出的,循环的最大值应为 i < uint32(len(s)) - 1

for i := uint32(0); i < uint32(len(s)) - 1; i++ {
英文:
	mySlice := make([]uint32, 0, 4294967290)

// ...

        // Sort the slice
		sort.Slice(mySlice, func(i, j int) bool {
			x := mySlice[i]
			y := mySlice[j]
			return x &lt; y
		})

What's the fastest way to remove slice duplicates?

How can I take advantage of the fact that slice is already sorted?

Update

I came up with this:

func RemoveDuplicates(s []uint32) {
	if len(s) &lt; 2 {
		return
	}
	tmp := make([]uint32, 0, len(s))

	for i := uint32(0); i &lt; uint32(len(s)); i++ {
		// If current is not equal to next then store the current
		if s[i] != s[i+1] {
			tmp = append(tmp, s[i])
		}
	}

	// The last must be stored
	// Note that if it was repeated, the duplicates are NOT stored before
	tmp = append(tmp, s[len(s)-1])

	// Modify original slice
	s = nil
	s = append(s, tmp...)
}

Any mistake? Any bug? Any way to improve?

Update

As noted by @mh-cbon the correct loop max should be i &lt; uint32(len(s)) - 1:

for i := uint32(0); i &lt; uint32(len(s)) - 1; i++ {

答案1

得分: 5

不是关于最快方法的答案,而是关于使用Go语言优化代码的方法论的逐步说明。

首先,让我们编写一个main函数:

package main

import (
	"math/rand"
	"sort"
)

func main() {
}

func randSlice(max int) (ret []uint32) {
	// 检查max是否超过maxUINT32
	ret = make([]uint32, 0, max)
	r := rand.New(rand.NewSource(99))
	for i := 0; i < max; i++ {
		ret = append(ret, uint32(r.Intn(max)))
	}
	sort.Slice(ret, func(i, j int) bool {
		return ret[i] < ret[j]
	})
	return
}

func dedup1(s []uint32) []uint32 {
	if len(s) < 2 {
		return s
	}
	tmp := make([]uint32, 0, len(s))

	for i := uint32(0); i < uint32(len(s)); i++ {
		// 如果当前元素不等于下一个元素,则存储当前元素
		if s[i] != s[i+1] {
			tmp = append(tmp, s[i])
		}
	}

	// 最后一个元素必须存储
	// 注意,如果它是重复的,则重复项不会被存储在前面
	tmp = append(tmp, s[len(s)-1])

	// 修改原始切片
	s = nil
	s = append(s, tmp...)
	return s
}

接下来,编写相应的测试:

package main

import "testing"

func TestDedup1(t *testing.T) {
	s := randSlice(10)
	res := dedup1(s)
	uniq := map[uint32]bool{}
	for _, r := range res {
		_, ok := uniq[r]
		if ok {
			t.Fatalf("found duplicates\ninput=%#v\nresult=%#v\n", s, res)
		}
		uniq[r] = true
	}
}

运行测试:

$ go test -v .

然后修复代码中的错误。在dedup1函数中,将条件if s[i] != s[i+1] {修改为if i+1 < uint32(len(s)) && s[i] != s[i+1] {,或者更好地减少迭代的最大值for i := uint32(0); i < uint32(len(s))-1; i++ {

接下来,编写一个生成带有重复元素的随机切片的函数:

func randSliceWithDups(max int) (ret []uint32) {
	ret = randSlice(max / 2)
	ret = append(ret, ret...)
	sort.Slice(ret, func(i, j int) bool {
		return ret[i] < ret[j]
	})
	return
}

更新测试:

func TestDedup1_with_dups(t *testing.T) {
	s := randSliceWithDups(10)
	res := dedup1(s)
	uniq := map[uint32]bool{}
	for _, r := range res {
		_, ok := uniq[r]
		if ok {
			t.Fatalf("found duplicates\ninput=%#v\nresult=%#v\n", s, res)
		}
		uniq[r] = true
	}
}

接下来,添加一个基准测试,以便发现性能问题并保持性能稳定:

func BenchmarkDedup1_1000(b *testing.B) {
	s := randSliceWithDups(100)
	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		_ = dedup1(s)
	}
}

运行基准测试:

$ go test -v . -bench=.

接下来,我们需要找到一个更好的算法,它能够产生更少的执行次数和查找次数。

然而,最后一个版本中存在一个错误!你发现了吗?

修复错误:

func dedup2(s []uint32) []uint32 {
	if len(s) < 2 {
		return s
	}

	var e int = 1
	for i := 1; i < len(s); i++ {
		if s[i] == s[i-1] {
			continue
		}
		s[e] = s[i]
		e++
	}

	return s[:e]
}

再次添加测试和基准测试,并检查结果。

最后,运行测试和基准测试:

$ go test -v . -bench=.

这样就完成了。你可以根据需要进一步优化代码,但以上步骤应该能帮助你开始优化Go语言代码的过程。

英文:

not an answer as to the fastest, rather a step by step about the methodology to apply using the Go language to optimize a piece of code.

For a very formal insight of what is the fastest, see https://stackoverflow.com/a/6072100/4466350

Your code is buggy. Always write a test.

First, let use write a main

package main

import (
	&quot;math/rand&quot;
	&quot;sort&quot;
)

func main() {
}

func randSlice(max int) (ret []uint32) {
	// we should check that max does not exceed maxUINT32
	ret = make([]uint32, 0, max)
	r := rand.New(rand.NewSource(99))
	for i := 0; i &lt; max; i++ {
		ret = append(ret, uint32(r.Intn(max)))
	}
	sort.Slice(ret, func(i, j int) bool {
		return ret[i] &lt; ret[j]
	})
	return
}

func dedup1(s []uint32) []uint32 {
	if len(s) &lt; 2 {
		return s
	}
	tmp := make([]uint32, 0, len(s))

	for i := uint32(0); i &lt; uint32(len(s)); i++ {
		// If current is not equal to next then store the current
		if s[i] != s[i+1] {
			tmp = append(tmp, s[i])
		}
	}

	// The last must be stored
	// Note that if it was repeated, the duplicates are NOT stored before
	tmp = append(tmp, s[len(s)-1])

	// Modify original slice
	s = nil
	s = append(s, tmp...)
	return s
}

And the accompanying test

package main

import &quot;testing&quot;

func TestDedup1(t *testing.T) {
	s := randSlice(10)
	res := dedup1(s)
	uniq := map[uint32]bool{}
	for _, r := range res {
		_, ok := uniq[r]
		if ok {
			t.Fatalf(&quot;found duplicates\ninput=%#v\nresult=%#v\n&quot;, s, res)
		}
		uniq[r] = true
	}
}

running this we get

$ go test -v . 
=== RUN   TestDedup1
--- FAIL: TestDedup1 (0.00s)
panic: runtime error: index out of range [10] with length 10 [recovered]
	panic: runtime error: index out of range [10] with length 10

goroutine 18 [running]:
testing.tRunner.func1.1(0x536680, 0xc0000da040)
	/home/mh-cbon/.gvm/gos/go1.15.2/src/testing/testing.go:1076 +0x30d
testing.tRunner.func1(0xc000082600)
	/home/mh-cbon/.gvm/gos/go1.15.2/src/testing/testing.go:1079 +0x41a
panic(0x536680, 0xc0000da040)
	/home/mh-cbon/.gvm/gos/go1.15.2/src/runtime/panic.go:969 +0x175
test/d/dup.dedup1(0xc000094060, 0xa, 0xa, 0xa, 0x6124a0, 0xc00003c770)
	/home/mh-cbon/gow/src/test/d/dup/main.go:32 +0x248
test/d/dup.TestDedup1(0xc000082600)
	/home/mh-cbon/gow/src/test/d/dup/main_test.go:7 +0x70
testing.tRunner(0xc000082600, 0x54fbf0)
	/home/mh-cbon/.gvm/gos/go1.15.2/src/testing/testing.go:1127 +0xef
created by testing.(*T).Run
	/home/mh-cbon/.gvm/gos/go1.15.2/src/testing/testing.go:1178 +0x386
FAIL	test/d/dup	0.006s
FAIL

we fix this by appropriately checking for the slice bounds.

In dedup1, change this condition if s[i] != s[i+1] { to if i+1 &lt; uint32(len(s)) &amp;&amp; s[i] != s[i+1] {, or even better, reduce iteration max value by one for i := uint32(0); i &lt; uint32(len(s))-1; i++ {

Next, write a function to generate a slice with random duplicates.

func randSliceWithDups(max int) (ret []uint32) {
	ret = randSlice(max / 2)
	ret = append(ret, ret...)
	sort.Slice(ret, func(i, j int) bool {
		return ret[i] &lt; ret[j]
	})
	return
}

writing randSliceWithDups(50) get slices such as [0 0 1 1 2 2 3 3 3 3 4 4 5 5 5 5 6 6 6 6 7 7 8 8 9 9 9 9 12 12 13 13 18 18 18 18 18 18 19 19 20 20 20 20 21 21 22 22 24 24]

update the tests again

func TestDedup1_with_dups(t *testing.T) {
	s := randSliceWithDups(10)
	res := dedup1(s)
	uniq := map[uint32]bool{}
	for _, r := range res {
		_, ok := uniq[r]
		if ok {
			t.Fatalf(&quot;found duplicates\ninput=%#v\nresult=%#v\n&quot;, s, res)
		}
		uniq[r] = true
	}
}

Next, add a benchmark; It will help us spot performance issue and maintain performance over time.

func BenchmarkDedup1_1000(b *testing.B) {
	s := randSliceWithDups(100)
	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i &lt; b.N; i++ {
		_ = dedup1(s)
	}
}

running this we get :

$ go test -v . -bench=.
=== RUN   TestDedup1
--- PASS: TestDedup1 (0.00s)
=== RUN   TestDedup1_with_dups
--- PASS: TestDedup1_with_dups (0.00s)
goos: linux
goarch: amd64
pkg: test/d/dup
BenchmarkDedup1_1000
BenchmarkDedup1_1000-4   	  172087	      6353 ns/op	    5504 B/op	       2 allocs/op
PASS
ok  	test/d/dup	1.174s

Let us state the obvious every one has spotted reading your initial code without even writing a benchmark, you are allocating.

That raises the question as to figure out if you are allowed to modify the input slice in place or not. If you can change it in place, we might take advantage of this to prevent that allocations and speed up your program.

One solution, wrote from scratch (consider search on geekforgeeks like website for a generally accepted solution), is to iterate over the slice and maintain an index of the next position to write. When a non duplicate is found, save the non duplicate to this last position, then increment that index by one. Finally, return the slice up to the value of the incremented indice.

func dedup2(s []uint32) []uint32 {
	if len(s) &lt; 2 {
		return s
	}

	var e int
	for i := 1; i &lt; len(s); i++ {
		if s[i] == s[i-1] {
			continue
		}
		s[e] = s[i]
		e++
	}

	return s[:e]
}

Again, add tests and benchs, and check for the result.

func TestDedup2(t *testing.T) {
	s := randSlice(10)
	res := dedup2(s)
	uniq := map[uint32]bool{}
	for _, r := range res {
		_, ok := uniq[r]
		if ok {
			t.Fatalf(&quot;found duplicates\ninput=%#v\nresult=%#v\n&quot;, s, res)
		}
		uniq[r] = true
	}
}

func TestDedup2_with_dups(t *testing.T) {
	s := randSliceWithDups(10)
	res := dedup2(s)
	uniq := map[uint32]bool{}
	for _, r := range res {
		_, ok := uniq[r]
		if ok {
			t.Fatalf(&quot;found duplicates\ninput=%#v\nresult=%#v\n&quot;, s, res)
		}
		uniq[r] = true
	}
}

func BenchmarkDedup2_1000(b *testing.B) {
	s := randSliceWithDups(100)
	b.ResetTimer()
	b.ReportAllocs()
	for i := 0; i &lt; b.N; i++ {
		_ = dedup2(s)
	}
}

Which yields:

$ go test -v . -bench=.
=== RUN   TestDedup1
--- PASS: TestDedup1 (0.00s)
=== RUN   TestDedup1_with_dups
--- PASS: TestDedup1_with_dups (0.00s)
=== RUN   TestDedup2
--- PASS: TestDedup2 (0.00s)
=== RUN   TestDedup2_with_dups
--- PASS: TestDedup2_with_dups (0.00s)
goos: linux
goarch: amd64
pkg: test/d/dup
BenchmarkDedup1_1000
BenchmarkDedup1_1000-4   	 1764574	       673 ns/op	     544 B/op	       2 allocs/op
BenchmarkDedup2_1000
BenchmarkDedup2_1000-4   	 7758907	       152 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	test/d/dup	3.224s

a 4x improvement ! cool ! What s next ? Next is to find an even better algorithm which produces less executions, less lookup and so on.

Though, the last version contains a bug ! Have you spot it ?

See this test, which is better than the other because it does not rely on random numbers, but on static values with strong equality checks. Using those kind of tests you can tailor made your input to check for fine grained situation.

func TestDedup2_static(t *testing.T) {
	type expectation struct {
		input []uint32
		want  []uint32
	}

	expectations := []expectation{
		expectation{
			input: []uint32{0, 0, 1, 2, 3, 3, 3, 4, 4, 5},
			want:  []uint32{0, 1, 2, 3, 4, 5},
		},
		expectation{
			input: []uint32{0, 1, 2, 3, 3, 3, 4, 4, 5},
			want:  []uint32{0, 1, 2, 3, 4, 5},
		},
	}

	for _, e := range expectations {
		res := dedup2(e.input)
		if !reflect.DeepEqual(res, e.want) {
			t.Fatalf(&quot;invlaid result, wanted=%#v\ngot=%#v\n&quot;, e.want, res)
		}
	}
}

It uses table drive testing as described at https://dave.cheney.net/2013/06/09/writing-table-driven-tests-in-go

Lets fix this:

func dedup2(s []uint32) []uint32 {
	if len(s) &lt; 2 {
		return s
	}

	var e int = 1
	for i := 1; i &lt; len(s); i++ {
		if s[i] == s[i-1] {
			continue
		}
		s[e] = s[i]
		e++
	}

	return s[:e]
}

答案2

得分: 3

要删除切片中的重复元素,你可以创建一个映射(map),将切片的值作为映射的键,然后遍历映射并将键的值追加到新的切片中。以下是相同逻辑的代码:

package main

import (
	"fmt"
	"sort"
)

func removeDupe(slc []int) []int {
	var tmpSlice []int
	chkMap := make(map[int]bool)

	for _, val := range slc {
		chkMap[val] = true
	}

	for k := range chkMap {
		tmpSlice = append(tmpSlice, k)
	}
	sort.Ints(tmpSlice)
	return tmpSlice
}

func main() {

	mySlice := []int{1, 2, 3, 4, 5, 1, 2, 3, 9, 0}
	formattedSlice := removeDupe(mySlice)

	fmt.Println(formattedSlice)

}

输出结果:

[0 1 2 3 4 5 9]
英文:

To remove duplicate elements of a slice you can create a map and assign the the slice values as keys of the map ,then iterate over the map and append the key values to the new slice.Here is the code for the same logic:

package main

import (
	&quot;fmt&quot;
	&quot;sort&quot;
)

func removeDupe(slc []int) []int {
	var tmpSlice []int
	chkMap := make(map[int]bool)

	for _, val := range slc {
		chkMap[val] = true
	}

	for k, _ := range chkMap {
		tmpSlice = append(tmpSlice, k)
	}
	sort.Ints(tmpSlice)
	return tmpSlice
}

func main() {

	mySlice := []int{1, 2, 3, 4, 5, 1, 2, 3, 9, 0}
	formattedSlice := removeDupe(mySlice)

	fmt.Println(formattedSlice)

} 

Output:

[0 1 2 3 4 5 9]

答案3

得分: 0

以下是翻译好的内容:

以下是一个通用的方法,适用于任何可比较的类型,它简单地检查项目是否已经存在于映射中,如果不存在,则假设它是唯一的,并将其添加到最终的数组中。

这种方法避免了上面迭代两个数组的方法,这样我们只需要迭代一个数组。

这段代码保持了数组的顺序。

如果你想要去重指针数组,只要每个指向的对象实现了.String()方法,就可以做到。

在这里,数组的顺序也被保持了。

英文:

What about the following generic approach, this works on any comparable type and simply checks if the item is already in the map, if it's not it assumes it's unique and adds it to our final array.

This avoids the approach above of iterating over two arrays, in this way we only iterate over one.

package main

import &quot;fmt&quot;

func RemoveDuplicates[T comparable](slice []T) []T {
	itemMap := make(map[T]bool, len(slice))
	items := make([]T, 0)
	for _, item := range slice {
		if _, ok := itemMap[item]; !ok {
			items = append(items, item)
			itemMap[item] = true
		}
	}
	return items
}

func main() {
	duplicateStr := []string{&quot;a&quot;, &quot;a&quot;, &quot;b&quot;, &quot;c&quot;}
	dedupeStr := RemoveDuplicates(duplicateStr)
	fmt.Printf(&quot;%v\n&quot;, dedupeStr)

	duplicateInts := []int{0, 1, 2, 3, 3, 2}
	dedupeInts := RemoveDuplicates(duplicateInts)
	fmt.Printf(&quot;%v\n&quot;, dedupeInts)
}

This keeps the order of the array.

[a b c]
[0 1 2 3]

Golang Playground: https://go.dev/play/p/9Y8AHe2AkIM

If you want to deduplicate an array of pointers you can do it as long as each object pointed to implements .String() method.

package main

import &quot;fmt&quot;

type Stringer interface {
	String() string
}

// Remove all duplicate objects in an array of pointers
func RemoveDuplicateStringers[T Stringer](slice []*T) []*T {
	itemMap := make(map[string]*T)
	items := make([]*T, 0)
	for _, item := range slice {
		if item != nil {
			key := (*item).String()
			if _, ok := itemMap[key]; !ok {
				itemMap[key] = item
				items = append(items, item)
			}
		}
	}
	return items
}

func main() {
	// Use net.IP as a reference
	ip1 := net.ParseIP(&quot;1.1.1.1&quot;)
	ip2 := net.ParseIP(&quot;1.0.0.1&quot;)
	ip3 := net.ParseIP(&quot;1.1.1.1&quot;)
	ip4 := net.ParseIP(&quot;1.2.3.4&quot;)
	ip5 := net.ParseIP(&quot;1.1.1.1&quot;)
	ip6 := net.ParseIP(&quot;1.0.0.1&quot;)
	ip7 := net.ParseIP(&quot;1.2.3.4&quot;)
	duplicates := []*net.IP{&amp;ip1, &amp;ip2, &amp;ip3, &amp;ip4, &amp;ip5, &amp;ip6, &amp;ip7}
	dedupe := RemoveDuplicateStringers(duplicates)
    fmt.Printf(&quot;%s&quot;, dedupe)
}

Here the order of the array is also kept

[1.1.1.1 1.0.0.1 1.2.3.4]

Goland Playground: https://go.dev/play/p/Q5yr_2FxQ_H

huangapple
  • 本文由 发表于 2021年8月9日 19:16:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/68711149.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定