Fastest way to remove duplicates of a sorted Go slice

huangapple go评论118阅读模式
英文:

Fastest way to remove duplicates of a sorted Go slice

问题

  1. mySlice := make([]uint32, 0, 4294967290)
  2. // ...
  3. // 对切片进行排序
  4. sort.Slice(mySlice, func(i, j int) bool {
  5. x := mySlice[i]
  6. y := mySlice[j]
  7. return x < y
  8. })

如何最快地删除切片中的重复元素?

如何利用切片已经排序的事实?

更新

我想到了以下解决方案:

  1. func RemoveDuplicates(s []uint32) {
  2. if len(s) < 2 {
  3. return
  4. }
  5. tmp := make([]uint32, 0, len(s))
  6. for i := uint32(0); i < uint32(len(s))-1; i++ {
  7. // 如果当前元素不等于下一个元素,则将当前元素存储起来
  8. if s[i] != s[i+1] {
  9. tmp = append(tmp, s[i])
  10. }
  11. }
  12. // 最后一个元素必须存储
  13. // 注意,如果最后一个元素重复了,重复的元素不会被存储
  14. tmp = append(tmp, s[len(s)-1])
  15. // 修改原始切片
  16. s = nil
  17. s = append(s, tmp...)
  18. }

有任何错误吗?有任何 bug 吗?有任何改进的方法吗?

更新

如 @mh-cbon 所指出的,循环的最大值应为 i < uint32(len(s)) - 1

  1. for i := uint32(0); i < uint32(len(s)) - 1; i++ {
英文:
  1. mySlice := make([]uint32, 0, 4294967290)
  2. // ...
  3. // Sort the slice
  4. sort.Slice(mySlice, func(i, j int) bool {
  5. x := mySlice[i]
  6. y := mySlice[j]
  7. return x &lt; y
  8. })

What's the fastest way to remove slice duplicates?

How can I take advantage of the fact that slice is already sorted?

Update

I came up with this:

  1. func RemoveDuplicates(s []uint32) {
  2. if len(s) &lt; 2 {
  3. return
  4. }
  5. tmp := make([]uint32, 0, len(s))
  6. for i := uint32(0); i &lt; uint32(len(s)); i++ {
  7. // If current is not equal to next then store the current
  8. if s[i] != s[i+1] {
  9. tmp = append(tmp, s[i])
  10. }
  11. }
  12. // The last must be stored
  13. // Note that if it was repeated, the duplicates are NOT stored before
  14. tmp = append(tmp, s[len(s)-1])
  15. // Modify original slice
  16. s = nil
  17. s = append(s, tmp...)
  18. }

Any mistake? Any bug? Any way to improve?

Update

As noted by @mh-cbon the correct loop max should be i &lt; uint32(len(s)) - 1:

  1. for i := uint32(0); i &lt; uint32(len(s)) - 1; i++ {

答案1

得分: 5

不是关于最快方法的答案,而是关于使用Go语言优化代码的方法论的逐步说明。

首先,让我们编写一个main函数:

  1. package main
  2. import (
  3. "math/rand"
  4. "sort"
  5. )
  6. func main() {
  7. }
  8. func randSlice(max int) (ret []uint32) {
  9. // 检查max是否超过maxUINT32
  10. ret = make([]uint32, 0, max)
  11. r := rand.New(rand.NewSource(99))
  12. for i := 0; i < max; i++ {
  13. ret = append(ret, uint32(r.Intn(max)))
  14. }
  15. sort.Slice(ret, func(i, j int) bool {
  16. return ret[i] < ret[j]
  17. })
  18. return
  19. }
  20. func dedup1(s []uint32) []uint32 {
  21. if len(s) < 2 {
  22. return s
  23. }
  24. tmp := make([]uint32, 0, len(s))
  25. for i := uint32(0); i < uint32(len(s)); i++ {
  26. // 如果当前元素不等于下一个元素,则存储当前元素
  27. if s[i] != s[i+1] {
  28. tmp = append(tmp, s[i])
  29. }
  30. }
  31. // 最后一个元素必须存储
  32. // 注意,如果它是重复的,则重复项不会被存储在前面
  33. tmp = append(tmp, s[len(s)-1])
  34. // 修改原始切片
  35. s = nil
  36. s = append(s, tmp...)
  37. return s
  38. }

接下来,编写相应的测试:

  1. package main
  2. import "testing"
  3. func TestDedup1(t *testing.T) {
  4. s := randSlice(10)
  5. res := dedup1(s)
  6. uniq := map[uint32]bool{}
  7. for _, r := range res {
  8. _, ok := uniq[r]
  9. if ok {
  10. t.Fatalf("found duplicates\ninput=%#v\nresult=%#v\n", s, res)
  11. }
  12. uniq[r] = true
  13. }
  14. }

运行测试:

  1. $ go test -v .

然后修复代码中的错误。在dedup1函数中,将条件if s[i] != s[i+1] {修改为if i+1 < uint32(len(s)) && s[i] != s[i+1] {,或者更好地减少迭代的最大值for i := uint32(0); i < uint32(len(s))-1; i++ {

接下来,编写一个生成带有重复元素的随机切片的函数:

  1. func randSliceWithDups(max int) (ret []uint32) {
  2. ret = randSlice(max / 2)
  3. ret = append(ret, ret...)
  4. sort.Slice(ret, func(i, j int) bool {
  5. return ret[i] < ret[j]
  6. })
  7. return
  8. }

更新测试:

  1. func TestDedup1_with_dups(t *testing.T) {
  2. s := randSliceWithDups(10)
  3. res := dedup1(s)
  4. uniq := map[uint32]bool{}
  5. for _, r := range res {
  6. _, ok := uniq[r]
  7. if ok {
  8. t.Fatalf("found duplicates\ninput=%#v\nresult=%#v\n", s, res)
  9. }
  10. uniq[r] = true
  11. }
  12. }

接下来,添加一个基准测试,以便发现性能问题并保持性能稳定:

  1. func BenchmarkDedup1_1000(b *testing.B) {
  2. s := randSliceWithDups(100)
  3. b.ResetTimer()
  4. b.ReportAllocs()
  5. for i := 0; i < b.N; i++ {
  6. _ = dedup1(s)
  7. }
  8. }

运行基准测试:

  1. $ go test -v . -bench=.

接下来,我们需要找到一个更好的算法,它能够产生更少的执行次数和查找次数。

然而,最后一个版本中存在一个错误!你发现了吗?

修复错误:

  1. func dedup2(s []uint32) []uint32 {
  2. if len(s) < 2 {
  3. return s
  4. }
  5. var e int = 1
  6. for i := 1; i < len(s); i++ {
  7. if s[i] == s[i-1] {
  8. continue
  9. }
  10. s[e] = s[i]
  11. e++
  12. }
  13. return s[:e]
  14. }

再次添加测试和基准测试,并检查结果。

最后,运行测试和基准测试:

  1. $ go test -v . -bench=.

这样就完成了。你可以根据需要进一步优化代码,但以上步骤应该能帮助你开始优化Go语言代码的过程。

英文:

not an answer as to the fastest, rather a step by step about the methodology to apply using the Go language to optimize a piece of code.

For a very formal insight of what is the fastest, see https://stackoverflow.com/a/6072100/4466350

Your code is buggy. Always write a test.

First, let use write a main

  1. package main
  2. import (
  3. &quot;math/rand&quot;
  4. &quot;sort&quot;
  5. )
  6. func main() {
  7. }
  8. func randSlice(max int) (ret []uint32) {
  9. // we should check that max does not exceed maxUINT32
  10. ret = make([]uint32, 0, max)
  11. r := rand.New(rand.NewSource(99))
  12. for i := 0; i &lt; max; i++ {
  13. ret = append(ret, uint32(r.Intn(max)))
  14. }
  15. sort.Slice(ret, func(i, j int) bool {
  16. return ret[i] &lt; ret[j]
  17. })
  18. return
  19. }
  20. func dedup1(s []uint32) []uint32 {
  21. if len(s) &lt; 2 {
  22. return s
  23. }
  24. tmp := make([]uint32, 0, len(s))
  25. for i := uint32(0); i &lt; uint32(len(s)); i++ {
  26. // If current is not equal to next then store the current
  27. if s[i] != s[i+1] {
  28. tmp = append(tmp, s[i])
  29. }
  30. }
  31. // The last must be stored
  32. // Note that if it was repeated, the duplicates are NOT stored before
  33. tmp = append(tmp, s[len(s)-1])
  34. // Modify original slice
  35. s = nil
  36. s = append(s, tmp...)
  37. return s
  38. }

And the accompanying test

  1. package main
  2. import &quot;testing&quot;
  3. func TestDedup1(t *testing.T) {
  4. s := randSlice(10)
  5. res := dedup1(s)
  6. uniq := map[uint32]bool{}
  7. for _, r := range res {
  8. _, ok := uniq[r]
  9. if ok {
  10. t.Fatalf(&quot;found duplicates\ninput=%#v\nresult=%#v\n&quot;, s, res)
  11. }
  12. uniq[r] = true
  13. }
  14. }

running this we get

  1. $ go test -v .
  2. === RUN TestDedup1
  3. --- FAIL: TestDedup1 (0.00s)
  4. panic: runtime error: index out of range [10] with length 10 [recovered]
  5. panic: runtime error: index out of range [10] with length 10
  6. goroutine 18 [running]:
  7. testing.tRunner.func1.1(0x536680, 0xc0000da040)
  8. /home/mh-cbon/.gvm/gos/go1.15.2/src/testing/testing.go:1076 +0x30d
  9. testing.tRunner.func1(0xc000082600)
  10. /home/mh-cbon/.gvm/gos/go1.15.2/src/testing/testing.go:1079 +0x41a
  11. panic(0x536680, 0xc0000da040)
  12. /home/mh-cbon/.gvm/gos/go1.15.2/src/runtime/panic.go:969 +0x175
  13. test/d/dup.dedup1(0xc000094060, 0xa, 0xa, 0xa, 0x6124a0, 0xc00003c770)
  14. /home/mh-cbon/gow/src/test/d/dup/main.go:32 +0x248
  15. test/d/dup.TestDedup1(0xc000082600)
  16. /home/mh-cbon/gow/src/test/d/dup/main_test.go:7 +0x70
  17. testing.tRunner(0xc000082600, 0x54fbf0)
  18. /home/mh-cbon/.gvm/gos/go1.15.2/src/testing/testing.go:1127 +0xef
  19. created by testing.(*T).Run
  20. /home/mh-cbon/.gvm/gos/go1.15.2/src/testing/testing.go:1178 +0x386
  21. FAIL test/d/dup 0.006s
  22. FAIL

we fix this by appropriately checking for the slice bounds.

In dedup1, change this condition if s[i] != s[i+1] { to if i+1 &lt; uint32(len(s)) &amp;&amp; s[i] != s[i+1] {, or even better, reduce iteration max value by one for i := uint32(0); i &lt; uint32(len(s))-1; i++ {

Next, write a function to generate a slice with random duplicates.

  1. func randSliceWithDups(max int) (ret []uint32) {
  2. ret = randSlice(max / 2)
  3. ret = append(ret, ret...)
  4. sort.Slice(ret, func(i, j int) bool {
  5. return ret[i] &lt; ret[j]
  6. })
  7. return
  8. }

writing randSliceWithDups(50) get slices such as [0 0 1 1 2 2 3 3 3 3 4 4 5 5 5 5 6 6 6 6 7 7 8 8 9 9 9 9 12 12 13 13 18 18 18 18 18 18 19 19 20 20 20 20 21 21 22 22 24 24]

update the tests again

  1. func TestDedup1_with_dups(t *testing.T) {
  2. s := randSliceWithDups(10)
  3. res := dedup1(s)
  4. uniq := map[uint32]bool{}
  5. for _, r := range res {
  6. _, ok := uniq[r]
  7. if ok {
  8. t.Fatalf(&quot;found duplicates\ninput=%#v\nresult=%#v\n&quot;, s, res)
  9. }
  10. uniq[r] = true
  11. }
  12. }

Next, add a benchmark; It will help us spot performance issue and maintain performance over time.

  1. func BenchmarkDedup1_1000(b *testing.B) {
  2. s := randSliceWithDups(100)
  3. b.ResetTimer()
  4. b.ReportAllocs()
  5. for i := 0; i &lt; b.N; i++ {
  6. _ = dedup1(s)
  7. }
  8. }

running this we get :

  1. $ go test -v . -bench=.
  2. === RUN TestDedup1
  3. --- PASS: TestDedup1 (0.00s)
  4. === RUN TestDedup1_with_dups
  5. --- PASS: TestDedup1_with_dups (0.00s)
  6. goos: linux
  7. goarch: amd64
  8. pkg: test/d/dup
  9. BenchmarkDedup1_1000
  10. BenchmarkDedup1_1000-4 172087 6353 ns/op 5504 B/op 2 allocs/op
  11. PASS
  12. ok test/d/dup 1.174s

Let us state the obvious every one has spotted reading your initial code without even writing a benchmark, you are allocating.

That raises the question as to figure out if you are allowed to modify the input slice in place or not. If you can change it in place, we might take advantage of this to prevent that allocations and speed up your program.

One solution, wrote from scratch (consider search on geekforgeeks like website for a generally accepted solution), is to iterate over the slice and maintain an index of the next position to write. When a non duplicate is found, save the non duplicate to this last position, then increment that index by one. Finally, return the slice up to the value of the incremented indice.

  1. func dedup2(s []uint32) []uint32 {
  2. if len(s) &lt; 2 {
  3. return s
  4. }
  5. var e int
  6. for i := 1; i &lt; len(s); i++ {
  7. if s[i] == s[i-1] {
  8. continue
  9. }
  10. s[e] = s[i]
  11. e++
  12. }
  13. return s[:e]
  14. }

Again, add tests and benchs, and check for the result.

  1. func TestDedup2(t *testing.T) {
  2. s := randSlice(10)
  3. res := dedup2(s)
  4. uniq := map[uint32]bool{}
  5. for _, r := range res {
  6. _, ok := uniq[r]
  7. if ok {
  8. t.Fatalf(&quot;found duplicates\ninput=%#v\nresult=%#v\n&quot;, s, res)
  9. }
  10. uniq[r] = true
  11. }
  12. }
  13. func TestDedup2_with_dups(t *testing.T) {
  14. s := randSliceWithDups(10)
  15. res := dedup2(s)
  16. uniq := map[uint32]bool{}
  17. for _, r := range res {
  18. _, ok := uniq[r]
  19. if ok {
  20. t.Fatalf(&quot;found duplicates\ninput=%#v\nresult=%#v\n&quot;, s, res)
  21. }
  22. uniq[r] = true
  23. }
  24. }
  25. func BenchmarkDedup2_1000(b *testing.B) {
  26. s := randSliceWithDups(100)
  27. b.ResetTimer()
  28. b.ReportAllocs()
  29. for i := 0; i &lt; b.N; i++ {
  30. _ = dedup2(s)
  31. }
  32. }

Which yields:

  1. $ go test -v . -bench=.
  2. === RUN TestDedup1
  3. --- PASS: TestDedup1 (0.00s)
  4. === RUN TestDedup1_with_dups
  5. --- PASS: TestDedup1_with_dups (0.00s)
  6. === RUN TestDedup2
  7. --- PASS: TestDedup2 (0.00s)
  8. === RUN TestDedup2_with_dups
  9. --- PASS: TestDedup2_with_dups (0.00s)
  10. goos: linux
  11. goarch: amd64
  12. pkg: test/d/dup
  13. BenchmarkDedup1_1000
  14. BenchmarkDedup1_1000-4 1764574 673 ns/op 544 B/op 2 allocs/op
  15. BenchmarkDedup2_1000
  16. BenchmarkDedup2_1000-4 7758907 152 ns/op 0 B/op 0 allocs/op
  17. PASS
  18. ok test/d/dup 3.224s

a 4x improvement ! cool ! What s next ? Next is to find an even better algorithm which produces less executions, less lookup and so on.

Though, the last version contains a bug ! Have you spot it ?

See this test, which is better than the other because it does not rely on random numbers, but on static values with strong equality checks. Using those kind of tests you can tailor made your input to check for fine grained situation.

  1. func TestDedup2_static(t *testing.T) {
  2. type expectation struct {
  3. input []uint32
  4. want []uint32
  5. }
  6. expectations := []expectation{
  7. expectation{
  8. input: []uint32{0, 0, 1, 2, 3, 3, 3, 4, 4, 5},
  9. want: []uint32{0, 1, 2, 3, 4, 5},
  10. },
  11. expectation{
  12. input: []uint32{0, 1, 2, 3, 3, 3, 4, 4, 5},
  13. want: []uint32{0, 1, 2, 3, 4, 5},
  14. },
  15. }
  16. for _, e := range expectations {
  17. res := dedup2(e.input)
  18. if !reflect.DeepEqual(res, e.want) {
  19. t.Fatalf(&quot;invlaid result, wanted=%#v\ngot=%#v\n&quot;, e.want, res)
  20. }
  21. }
  22. }

It uses table drive testing as described at https://dave.cheney.net/2013/06/09/writing-table-driven-tests-in-go

Lets fix this:

  1. func dedup2(s []uint32) []uint32 {
  2. if len(s) &lt; 2 {
  3. return s
  4. }
  5. var e int = 1
  6. for i := 1; i &lt; len(s); i++ {
  7. if s[i] == s[i-1] {
  8. continue
  9. }
  10. s[e] = s[i]
  11. e++
  12. }
  13. return s[:e]
  14. }

答案2

得分: 3

要删除切片中的重复元素,你可以创建一个映射(map),将切片的值作为映射的键,然后遍历映射并将键的值追加到新的切片中。以下是相同逻辑的代码:

  1. package main
  2. import (
  3. "fmt"
  4. "sort"
  5. )
  6. func removeDupe(slc []int) []int {
  7. var tmpSlice []int
  8. chkMap := make(map[int]bool)
  9. for _, val := range slc {
  10. chkMap[val] = true
  11. }
  12. for k := range chkMap {
  13. tmpSlice = append(tmpSlice, k)
  14. }
  15. sort.Ints(tmpSlice)
  16. return tmpSlice
  17. }
  18. func main() {
  19. mySlice := []int{1, 2, 3, 4, 5, 1, 2, 3, 9, 0}
  20. formattedSlice := removeDupe(mySlice)
  21. fmt.Println(formattedSlice)
  22. }

输出结果:

  1. [0 1 2 3 4 5 9]
英文:

To remove duplicate elements of a slice you can create a map and assign the the slice values as keys of the map ,then iterate over the map and append the key values to the new slice.Here is the code for the same logic:

  1. package main
  2. import (
  3. &quot;fmt&quot;
  4. &quot;sort&quot;
  5. )
  6. func removeDupe(slc []int) []int {
  7. var tmpSlice []int
  8. chkMap := make(map[int]bool)
  9. for _, val := range slc {
  10. chkMap[val] = true
  11. }
  12. for k, _ := range chkMap {
  13. tmpSlice = append(tmpSlice, k)
  14. }
  15. sort.Ints(tmpSlice)
  16. return tmpSlice
  17. }
  18. func main() {
  19. mySlice := []int{1, 2, 3, 4, 5, 1, 2, 3, 9, 0}
  20. formattedSlice := removeDupe(mySlice)
  21. fmt.Println(formattedSlice)
  22. }

Output:

  1. [0 1 2 3 4 5 9]

答案3

得分: 0

以下是翻译好的内容:

以下是一个通用的方法,适用于任何可比较的类型,它简单地检查项目是否已经存在于映射中,如果不存在,则假设它是唯一的,并将其添加到最终的数组中。

这种方法避免了上面迭代两个数组的方法,这样我们只需要迭代一个数组。

这段代码保持了数组的顺序。

如果你想要去重指针数组,只要每个指向的对象实现了.String()方法,就可以做到。

在这里,数组的顺序也被保持了。

英文:

What about the following generic approach, this works on any comparable type and simply checks if the item is already in the map, if it's not it assumes it's unique and adds it to our final array.

This avoids the approach above of iterating over two arrays, in this way we only iterate over one.

  1. package main
  2. import &quot;fmt&quot;
  3. func RemoveDuplicates[T comparable](slice []T) []T {
  4. itemMap := make(map[T]bool, len(slice))
  5. items := make([]T, 0)
  6. for _, item := range slice {
  7. if _, ok := itemMap[item]; !ok {
  8. items = append(items, item)
  9. itemMap[item] = true
  10. }
  11. }
  12. return items
  13. }
  14. func main() {
  15. duplicateStr := []string{&quot;a&quot;, &quot;a&quot;, &quot;b&quot;, &quot;c&quot;}
  16. dedupeStr := RemoveDuplicates(duplicateStr)
  17. fmt.Printf(&quot;%v\n&quot;, dedupeStr)
  18. duplicateInts := []int{0, 1, 2, 3, 3, 2}
  19. dedupeInts := RemoveDuplicates(duplicateInts)
  20. fmt.Printf(&quot;%v\n&quot;, dedupeInts)
  21. }

This keeps the order of the array.

  1. [a b c]
  2. [0 1 2 3]

Golang Playground: https://go.dev/play/p/9Y8AHe2AkIM

If you want to deduplicate an array of pointers you can do it as long as each object pointed to implements .String() method.

  1. package main
  2. import &quot;fmt&quot;
  3. type Stringer interface {
  4. String() string
  5. }
  6. // Remove all duplicate objects in an array of pointers
  7. func RemoveDuplicateStringers[T Stringer](slice []*T) []*T {
  8. itemMap := make(map[string]*T)
  9. items := make([]*T, 0)
  10. for _, item := range slice {
  11. if item != nil {
  12. key := (*item).String()
  13. if _, ok := itemMap[key]; !ok {
  14. itemMap[key] = item
  15. items = append(items, item)
  16. }
  17. }
  18. }
  19. return items
  20. }
  21. func main() {
  22. // Use net.IP as a reference
  23. ip1 := net.ParseIP(&quot;1.1.1.1&quot;)
  24. ip2 := net.ParseIP(&quot;1.0.0.1&quot;)
  25. ip3 := net.ParseIP(&quot;1.1.1.1&quot;)
  26. ip4 := net.ParseIP(&quot;1.2.3.4&quot;)
  27. ip5 := net.ParseIP(&quot;1.1.1.1&quot;)
  28. ip6 := net.ParseIP(&quot;1.0.0.1&quot;)
  29. ip7 := net.ParseIP(&quot;1.2.3.4&quot;)
  30. duplicates := []*net.IP{&amp;ip1, &amp;ip2, &amp;ip3, &amp;ip4, &amp;ip5, &amp;ip6, &amp;ip7}
  31. dedupe := RemoveDuplicateStringers(duplicates)
  32. fmt.Printf(&quot;%s&quot;, dedupe)
  33. }

Here the order of the array is also kept

  1. [1.1.1.1 1.0.0.1 1.2.3.4]

Goland Playground: https://go.dev/play/p/Q5yr_2FxQ_H

huangapple
  • 本文由 发表于 2021年8月9日 19:16:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/68711149.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定