英文:
Fastest way to remove duplicates of a sorted Go slice
问题
mySlice := make([]uint32, 0, 4294967290)
// ...
// 对切片进行排序
sort.Slice(mySlice, func(i, j int) bool {
x := mySlice[i]
y := mySlice[j]
return x < y
})
如何最快地删除切片中的重复元素?
如何利用切片已经排序的事实?
更新
我想到了以下解决方案:
func RemoveDuplicates(s []uint32) {
if len(s) < 2 {
return
}
tmp := make([]uint32, 0, len(s))
for i := uint32(0); i < uint32(len(s))-1; i++ {
// 如果当前元素不等于下一个元素,则将当前元素存储起来
if s[i] != s[i+1] {
tmp = append(tmp, s[i])
}
}
// 最后一个元素必须存储
// 注意,如果最后一个元素重复了,重复的元素不会被存储
tmp = append(tmp, s[len(s)-1])
// 修改原始切片
s = nil
s = append(s, tmp...)
}
有任何错误吗?有任何 bug 吗?有任何改进的方法吗?
更新
如 @mh-cbon 所指出的,循环的最大值应为 i < uint32(len(s)) - 1
:
for i := uint32(0); i < uint32(len(s)) - 1; i++ {
英文:
mySlice := make([]uint32, 0, 4294967290)
// ...
// Sort the slice
sort.Slice(mySlice, func(i, j int) bool {
x := mySlice[i]
y := mySlice[j]
return x < y
})
What's the fastest way to remove slice duplicates?
How can I take advantage of the fact that slice is already sorted?
Update
I came up with this:
func RemoveDuplicates(s []uint32) {
if len(s) < 2 {
return
}
tmp := make([]uint32, 0, len(s))
for i := uint32(0); i < uint32(len(s)); i++ {
// If current is not equal to next then store the current
if s[i] != s[i+1] {
tmp = append(tmp, s[i])
}
}
// The last must be stored
// Note that if it was repeated, the duplicates are NOT stored before
tmp = append(tmp, s[len(s)-1])
// Modify original slice
s = nil
s = append(s, tmp...)
}
Any mistake? Any bug? Any way to improve?
Update
As noted by @mh-cbon the correct loop max should be i < uint32(len(s)) - 1
:
for i := uint32(0); i < uint32(len(s)) - 1; i++ {
答案1
得分: 5
不是关于最快方法的答案,而是关于使用Go语言优化代码的方法论的逐步说明。
首先,让我们编写一个main
函数:
package main
import (
"math/rand"
"sort"
)
func main() {
}
func randSlice(max int) (ret []uint32) {
// 检查max是否超过maxUINT32
ret = make([]uint32, 0, max)
r := rand.New(rand.NewSource(99))
for i := 0; i < max; i++ {
ret = append(ret, uint32(r.Intn(max)))
}
sort.Slice(ret, func(i, j int) bool {
return ret[i] < ret[j]
})
return
}
func dedup1(s []uint32) []uint32 {
if len(s) < 2 {
return s
}
tmp := make([]uint32, 0, len(s))
for i := uint32(0); i < uint32(len(s)); i++ {
// 如果当前元素不等于下一个元素,则存储当前元素
if s[i] != s[i+1] {
tmp = append(tmp, s[i])
}
}
// 最后一个元素必须存储
// 注意,如果它是重复的,则重复项不会被存储在前面
tmp = append(tmp, s[len(s)-1])
// 修改原始切片
s = nil
s = append(s, tmp...)
return s
}
接下来,编写相应的测试:
package main
import "testing"
func TestDedup1(t *testing.T) {
s := randSlice(10)
res := dedup1(s)
uniq := map[uint32]bool{}
for _, r := range res {
_, ok := uniq[r]
if ok {
t.Fatalf("found duplicates\ninput=%#v\nresult=%#v\n", s, res)
}
uniq[r] = true
}
}
运行测试:
$ go test -v .
然后修复代码中的错误。在dedup1
函数中,将条件if s[i] != s[i+1] {
修改为if i+1 < uint32(len(s)) && s[i] != s[i+1] {
,或者更好地减少迭代的最大值for i := uint32(0); i < uint32(len(s))-1; i++ {
。
接下来,编写一个生成带有重复元素的随机切片的函数:
func randSliceWithDups(max int) (ret []uint32) {
ret = randSlice(max / 2)
ret = append(ret, ret...)
sort.Slice(ret, func(i, j int) bool {
return ret[i] < ret[j]
})
return
}
更新测试:
func TestDedup1_with_dups(t *testing.T) {
s := randSliceWithDups(10)
res := dedup1(s)
uniq := map[uint32]bool{}
for _, r := range res {
_, ok := uniq[r]
if ok {
t.Fatalf("found duplicates\ninput=%#v\nresult=%#v\n", s, res)
}
uniq[r] = true
}
}
接下来,添加一个基准测试,以便发现性能问题并保持性能稳定:
func BenchmarkDedup1_1000(b *testing.B) {
s := randSliceWithDups(100)
b.ResetTimer()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_ = dedup1(s)
}
}
运行基准测试:
$ go test -v . -bench=.
接下来,我们需要找到一个更好的算法,它能够产生更少的执行次数和查找次数。
然而,最后一个版本中存在一个错误!你发现了吗?
修复错误:
func dedup2(s []uint32) []uint32 {
if len(s) < 2 {
return s
}
var e int = 1
for i := 1; i < len(s); i++ {
if s[i] == s[i-1] {
continue
}
s[e] = s[i]
e++
}
return s[:e]
}
再次添加测试和基准测试,并检查结果。
最后,运行测试和基准测试:
$ go test -v . -bench=.
这样就完成了。你可以根据需要进一步优化代码,但以上步骤应该能帮助你开始优化Go语言代码的过程。
英文:
not an answer as to the fastest, rather a step by step about the methodology to apply using the Go language to optimize a piece of code.
For a very formal insight of what is the fastest, see https://stackoverflow.com/a/6072100/4466350
Your code is buggy. Always write a test.
First, let use write a main
package main
import (
"math/rand"
"sort"
)
func main() {
}
func randSlice(max int) (ret []uint32) {
// we should check that max does not exceed maxUINT32
ret = make([]uint32, 0, max)
r := rand.New(rand.NewSource(99))
for i := 0; i < max; i++ {
ret = append(ret, uint32(r.Intn(max)))
}
sort.Slice(ret, func(i, j int) bool {
return ret[i] < ret[j]
})
return
}
func dedup1(s []uint32) []uint32 {
if len(s) < 2 {
return s
}
tmp := make([]uint32, 0, len(s))
for i := uint32(0); i < uint32(len(s)); i++ {
// If current is not equal to next then store the current
if s[i] != s[i+1] {
tmp = append(tmp, s[i])
}
}
// The last must be stored
// Note that if it was repeated, the duplicates are NOT stored before
tmp = append(tmp, s[len(s)-1])
// Modify original slice
s = nil
s = append(s, tmp...)
return s
}
And the accompanying test
package main
import "testing"
func TestDedup1(t *testing.T) {
s := randSlice(10)
res := dedup1(s)
uniq := map[uint32]bool{}
for _, r := range res {
_, ok := uniq[r]
if ok {
t.Fatalf("found duplicates\ninput=%#v\nresult=%#v\n", s, res)
}
uniq[r] = true
}
}
running this we get
$ go test -v .
=== RUN TestDedup1
--- FAIL: TestDedup1 (0.00s)
panic: runtime error: index out of range [10] with length 10 [recovered]
panic: runtime error: index out of range [10] with length 10
goroutine 18 [running]:
testing.tRunner.func1.1(0x536680, 0xc0000da040)
/home/mh-cbon/.gvm/gos/go1.15.2/src/testing/testing.go:1076 +0x30d
testing.tRunner.func1(0xc000082600)
/home/mh-cbon/.gvm/gos/go1.15.2/src/testing/testing.go:1079 +0x41a
panic(0x536680, 0xc0000da040)
/home/mh-cbon/.gvm/gos/go1.15.2/src/runtime/panic.go:969 +0x175
test/d/dup.dedup1(0xc000094060, 0xa, 0xa, 0xa, 0x6124a0, 0xc00003c770)
/home/mh-cbon/gow/src/test/d/dup/main.go:32 +0x248
test/d/dup.TestDedup1(0xc000082600)
/home/mh-cbon/gow/src/test/d/dup/main_test.go:7 +0x70
testing.tRunner(0xc000082600, 0x54fbf0)
/home/mh-cbon/.gvm/gos/go1.15.2/src/testing/testing.go:1127 +0xef
created by testing.(*T).Run
/home/mh-cbon/.gvm/gos/go1.15.2/src/testing/testing.go:1178 +0x386
FAIL test/d/dup 0.006s
FAIL
we fix this by appropriately checking for the slice bounds.
In dedup1
, change this condition if s[i] != s[i+1] {
to if i+1 < uint32(len(s)) && s[i] != s[i+1] {
, or even better, reduce iteration max value by one for i := uint32(0); i < uint32(len(s))-1; i++ {
Next, write a function to generate a slice with random duplicates.
func randSliceWithDups(max int) (ret []uint32) {
ret = randSlice(max / 2)
ret = append(ret, ret...)
sort.Slice(ret, func(i, j int) bool {
return ret[i] < ret[j]
})
return
}
writing randSliceWithDups(50)
get slices such as [0 0 1 1 2 2 3 3 3 3 4 4 5 5 5 5 6 6 6 6 7 7 8 8 9 9 9 9 12 12 13 13 18 18 18 18 18 18 19 19 20 20 20 20 21 21 22 22 24 24]
update the tests again
func TestDedup1_with_dups(t *testing.T) {
s := randSliceWithDups(10)
res := dedup1(s)
uniq := map[uint32]bool{}
for _, r := range res {
_, ok := uniq[r]
if ok {
t.Fatalf("found duplicates\ninput=%#v\nresult=%#v\n", s, res)
}
uniq[r] = true
}
}
Next, add a benchmark; It will help us spot performance issue and maintain performance over time.
func BenchmarkDedup1_1000(b *testing.B) {
s := randSliceWithDups(100)
b.ResetTimer()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_ = dedup1(s)
}
}
running this we get :
$ go test -v . -bench=.
=== RUN TestDedup1
--- PASS: TestDedup1 (0.00s)
=== RUN TestDedup1_with_dups
--- PASS: TestDedup1_with_dups (0.00s)
goos: linux
goarch: amd64
pkg: test/d/dup
BenchmarkDedup1_1000
BenchmarkDedup1_1000-4 172087 6353 ns/op 5504 B/op 2 allocs/op
PASS
ok test/d/dup 1.174s
Let us state the obvious every one has spotted reading your initial code without even writing a benchmark, you are allocating.
That raises the question as to figure out if you are allowed to modify the input slice in place or not. If you can change it in place, we might take advantage of this to prevent that allocations and speed up your program.
One solution, wrote from scratch (consider search on geekforgeeks like website for a generally accepted solution), is to iterate over the slice and maintain an index of the next position to write. When a non duplicate is found, save the non duplicate to this last position, then increment that index by one. Finally, return the slice up to the value of the incremented indice.
func dedup2(s []uint32) []uint32 {
if len(s) < 2 {
return s
}
var e int
for i := 1; i < len(s); i++ {
if s[i] == s[i-1] {
continue
}
s[e] = s[i]
e++
}
return s[:e]
}
Again, add tests and benchs, and check for the result.
func TestDedup2(t *testing.T) {
s := randSlice(10)
res := dedup2(s)
uniq := map[uint32]bool{}
for _, r := range res {
_, ok := uniq[r]
if ok {
t.Fatalf("found duplicates\ninput=%#v\nresult=%#v\n", s, res)
}
uniq[r] = true
}
}
func TestDedup2_with_dups(t *testing.T) {
s := randSliceWithDups(10)
res := dedup2(s)
uniq := map[uint32]bool{}
for _, r := range res {
_, ok := uniq[r]
if ok {
t.Fatalf("found duplicates\ninput=%#v\nresult=%#v\n", s, res)
}
uniq[r] = true
}
}
func BenchmarkDedup2_1000(b *testing.B) {
s := randSliceWithDups(100)
b.ResetTimer()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_ = dedup2(s)
}
}
Which yields:
$ go test -v . -bench=.
=== RUN TestDedup1
--- PASS: TestDedup1 (0.00s)
=== RUN TestDedup1_with_dups
--- PASS: TestDedup1_with_dups (0.00s)
=== RUN TestDedup2
--- PASS: TestDedup2 (0.00s)
=== RUN TestDedup2_with_dups
--- PASS: TestDedup2_with_dups (0.00s)
goos: linux
goarch: amd64
pkg: test/d/dup
BenchmarkDedup1_1000
BenchmarkDedup1_1000-4 1764574 673 ns/op 544 B/op 2 allocs/op
BenchmarkDedup2_1000
BenchmarkDedup2_1000-4 7758907 152 ns/op 0 B/op 0 allocs/op
PASS
ok test/d/dup 3.224s
a 4x improvement ! cool ! What s next ? Next is to find an even better algorithm which produces less executions, less lookup and so on.
Though, the last version contains a bug ! Have you spot it ?
See this test, which is better than the other because it does not rely on random numbers, but on static values with strong equality checks. Using those kind of tests you can tailor made your input to check for fine grained situation.
func TestDedup2_static(t *testing.T) {
type expectation struct {
input []uint32
want []uint32
}
expectations := []expectation{
expectation{
input: []uint32{0, 0, 1, 2, 3, 3, 3, 4, 4, 5},
want: []uint32{0, 1, 2, 3, 4, 5},
},
expectation{
input: []uint32{0, 1, 2, 3, 3, 3, 4, 4, 5},
want: []uint32{0, 1, 2, 3, 4, 5},
},
}
for _, e := range expectations {
res := dedup2(e.input)
if !reflect.DeepEqual(res, e.want) {
t.Fatalf("invlaid result, wanted=%#v\ngot=%#v\n", e.want, res)
}
}
}
It uses table drive testing as described at https://dave.cheney.net/2013/06/09/writing-table-driven-tests-in-go
Lets fix this:
func dedup2(s []uint32) []uint32 {
if len(s) < 2 {
return s
}
var e int = 1
for i := 1; i < len(s); i++ {
if s[i] == s[i-1] {
continue
}
s[e] = s[i]
e++
}
return s[:e]
}
答案2
得分: 3
要删除切片中的重复元素,你可以创建一个映射(map),将切片的值作为映射的键,然后遍历映射并将键的值追加到新的切片中。以下是相同逻辑的代码:
package main
import (
"fmt"
"sort"
)
func removeDupe(slc []int) []int {
var tmpSlice []int
chkMap := make(map[int]bool)
for _, val := range slc {
chkMap[val] = true
}
for k := range chkMap {
tmpSlice = append(tmpSlice, k)
}
sort.Ints(tmpSlice)
return tmpSlice
}
func main() {
mySlice := []int{1, 2, 3, 4, 5, 1, 2, 3, 9, 0}
formattedSlice := removeDupe(mySlice)
fmt.Println(formattedSlice)
}
输出结果:
[0 1 2 3 4 5 9]
英文:
To remove duplicate elements of a slice you can create a map and assign the the slice values as keys of the map ,then iterate over the map and append the key values to the new slice.Here is the code for the same logic:
package main
import (
"fmt"
"sort"
)
func removeDupe(slc []int) []int {
var tmpSlice []int
chkMap := make(map[int]bool)
for _, val := range slc {
chkMap[val] = true
}
for k, _ := range chkMap {
tmpSlice = append(tmpSlice, k)
}
sort.Ints(tmpSlice)
return tmpSlice
}
func main() {
mySlice := []int{1, 2, 3, 4, 5, 1, 2, 3, 9, 0}
formattedSlice := removeDupe(mySlice)
fmt.Println(formattedSlice)
}
Output:
[0 1 2 3 4 5 9]
答案3
得分: 0
以下是翻译好的内容:
以下是一个通用的方法,适用于任何可比较的类型,它简单地检查项目是否已经存在于映射中,如果不存在,则假设它是唯一的,并将其添加到最终的数组中。
这种方法避免了上面迭代两个数组的方法,这样我们只需要迭代一个数组。
这段代码保持了数组的顺序。
如果你想要去重指针数组,只要每个指向的对象实现了.String()
方法,就可以做到。
在这里,数组的顺序也被保持了。
英文:
What about the following generic approach, this works on any comparable type and simply checks if the item is already in the map, if it's not it assumes it's unique and adds it to our final array.
This avoids the approach above of iterating over two arrays, in this way we only iterate over one.
package main
import "fmt"
func RemoveDuplicates[T comparable](slice []T) []T {
itemMap := make(map[T]bool, len(slice))
items := make([]T, 0)
for _, item := range slice {
if _, ok := itemMap[item]; !ok {
items = append(items, item)
itemMap[item] = true
}
}
return items
}
func main() {
duplicateStr := []string{"a", "a", "b", "c"}
dedupeStr := RemoveDuplicates(duplicateStr)
fmt.Printf("%v\n", dedupeStr)
duplicateInts := []int{0, 1, 2, 3, 3, 2}
dedupeInts := RemoveDuplicates(duplicateInts)
fmt.Printf("%v\n", dedupeInts)
}
This keeps the order of the array.
[a b c]
[0 1 2 3]
Golang Playground: https://go.dev/play/p/9Y8AHe2AkIM
If you want to deduplicate an array of pointers you can do it as long as each object pointed to implements .String() method.
package main
import "fmt"
type Stringer interface {
String() string
}
// Remove all duplicate objects in an array of pointers
func RemoveDuplicateStringers[T Stringer](slice []*T) []*T {
itemMap := make(map[string]*T)
items := make([]*T, 0)
for _, item := range slice {
if item != nil {
key := (*item).String()
if _, ok := itemMap[key]; !ok {
itemMap[key] = item
items = append(items, item)
}
}
}
return items
}
func main() {
// Use net.IP as a reference
ip1 := net.ParseIP("1.1.1.1")
ip2 := net.ParseIP("1.0.0.1")
ip3 := net.ParseIP("1.1.1.1")
ip4 := net.ParseIP("1.2.3.4")
ip5 := net.ParseIP("1.1.1.1")
ip6 := net.ParseIP("1.0.0.1")
ip7 := net.ParseIP("1.2.3.4")
duplicates := []*net.IP{&ip1, &ip2, &ip3, &ip4, &ip5, &ip6, &ip7}
dedupe := RemoveDuplicateStringers(duplicates)
fmt.Printf("%s", dedupe)
}
Here the order of the array is also kept
[1.1.1.1 1.0.0.1 1.2.3.4]
Goland Playground: https://go.dev/play/p/Q5yr_2FxQ_H
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论