英文:
leaner algorithm to group a slice of structs by multiple criterion
问题
在下面的代码中,我实现了一个简单的策略,依赖于map
来按DepartmentID
和Country
对数据进行分组。然后我通过Team
对数据进行子分组。
一个可运行的示例,包括基准测试工具,可以在https://play.golang.org/p/oDx39XtEsQs找到。
代码的相关部分如下:
package main
type Country struct {
Code string `json:"code"`
Employees []Employee `json:"employees"`
}
type Department struct {
ID int `json:"department"`
Employees []Employee `json:"employees"`
}
type DepartmentCountry struct {
keyDepartmentCountry
Employees []Employee `json:"employees"`
}
type DepartmentCountryTeam struct {
keyDepartmentCountryTeam
Employees []Employee `json:"employees"`
}
type keyDepartmentCountry struct {
ID int `json:"department"`
Code string `json:"code"`
}
type keyDepartmentCountryTeam struct {
ID int `json:"department"`
Code string `json:"code"`
Team string `json:"team"`
}
type Employee struct {
ID int `json:"id"`
Name string `json:"name"`
DepartmentID int `json:"departmentid"`
Country string `json:"country"`
Team string `json:"team"`
}
func main(){
// ...
// 按部门和国家值进行分组。
byDeptAndCty := make(map[keyDepartmentCountry]DepartmentCountry)
for _, employee := range employees {
key := keyDepartmentCountry{ID: employee.DepartmentID, Code: employee.Country}
byDeptAndCty[key] = DepartmentCountry{
keyDepartmentCountry: key,
Employees: append(byDeptAndCty[key].Employees, employee),
}
}
// 对于每个分组,
for _, dept := range byDeptAndCty {
fmt.Printf("g1 %#v\n", dept)
// 按团队对员工进行分组。
byDeptAndCtyAndTeam := make(map[keyDepartmentCountryTeam]DepartmentCountryTeam)
for _, employee := range dept.Employees {
key := keyDepartmentCountryTeam{ID: employee.DepartmentID, Code: employee.Country, Team: employee.Team}
byDeptAndCtyAndTeam[key] = DepartmentCountryTeam{
keyDepartmentCountryTeam: key,
Employees: append(byDeptAndCtyAndTeam[key].Employees, employee),
}
}
// 打印团队成员
for _, team := range byDeptAndCtyAndTeam {
fmt.Printf(" sub %#v\n", team)
}
}
// ...
}
我正在寻找一种不需要分配太多内存的算法,而且可能执行速度更快,因为我的数据集太大,无法同时在我们计划使用的硬件上保留map
和原始数据。
英文:
In below code i have implemented a simple strategy relying on map
to group the data by DepartmentID
and Country
. I then sub-group the data by Team
.
A runnable example, including benchmark facilities, is at https://play.golang.org/p/oDx39XtEsQs
The relevant pieces of code
package main
type Country struct {
Code string `json:"code"`
Employees []Employee `json:"employees"`
}
type Department struct {
ID int `json:"department"`
Employees []Employee `json:"employees"`
}
type DepartmentCountry struct {
keyDepartmentCountry
Employees []Employee `json:"employees"`
}
type DepartmentCountryTeam struct {
keyDepartmentCountryTeam
Employees []Employee `json:"employees"`
}
type keyDepartmentCountry struct {
ID int `json:"department"`
Code string `json:"code"`
}
type keyDepartmentCountryTeam struct {
ID int `json:"department"`
Code string `json:"code"`
Team string `json:"team"`
}
type Employee struct {
ID int `json:"id"`
Name string `json:"name"`
DepartmentID int `json:"departmentid"`
Country string `json:"country"`
Team string `json:"team"`
}
func main(){
// ...
// group by department and country values.
byDeptAndCty := make(map[keyDepartmentCountry]DepartmentCountry)
for _, employee := range employees {
key := keyDepartmentCountry{ID: employee.DepartmentID, Code: employee.Country}
byDeptAndCty[key] = DepartmentCountry{
keyDepartmentCountry: key,
Employees: append(byDeptAndCty[key].Employees, employee),
}
}
// foreach groups,
for _, dept := range byDeptAndCty {
fmt.Printf("g1 %#v\n", dept)
// group employees by team.
byDeptAndCtyAndTeam := make(map[keyDepartmentCountryTeam]DepartmentCountryTeam)
for _, employee := range dept.Employees {
key := keyDepartmentCountryTeam{ID: employee.DepartmentID, Code: employee.Country, Team: employee.Team}
byDeptAndCtyAndTeam[key] = DepartmentCountryTeam{
keyDepartmentCountryTeam: key,
Employees: append(byDeptAndCtyAndTeam[key].Employees, employee),
}
}
// print team members
for _, team := range byDeptAndCtyAndTeam {
fmt.Printf(" sub %#v\n", team)
}
}
// ...
}
I am searching to implement an algorithm that does not need to allocate so much,
and which possibly performs faster, because my dataset is too big to keep both the map
and the original data in memory on the hardware we plan to use.
答案1
得分: 2
可以使用类似以下的代码:
countries := make(map[string]*CountryData)
type CountryData struct {
Country *Country
Depts map[int]*DeptData
}
type DeptData struct {
Dept *Department
Employees []*Employee // Employees in country-dept
Teams map[int][]*Employee // Employees in country-dept-team
}
上述数据结构保存了对员工/国家/部门结构的指针,因此它们在内存中只有一个实例。带有嵌套映射的树结构允许您使用不同的键遍历树。
英文:
Something like this can be used:
countries:=map[string]*CountryData{}
type CountryData struct {
Country *Country
Depts map[int]*DeptData
}
type DeptData struct {
Dept *Department
Employees []*Employee // Employees in country-dept
Teams map[int][]*Employee // Employees in country-dept-team
}
The above data structure keeps pointers to the employee/country/department structures, so only one instance of them are in memory. The tree structure with nested maps allow you to traverse the tree using different keys.
答案2
得分: 1
之前我发现了一种基于排序的解决方案。
func withSort(employees []Employee, quiet bool) {
sort.Slice(employees, multiless(
func(i, j int) bool { return employees[i].DepartmentID < employees[j].DepartmentID },
func(i, j int) bool { return employees[i].Country < employees[j].Country },
func(i, j int) bool { return employees[i].Team < employees[j].Team },
))
g1 := and(
func(i, j int) bool { return employees[i].DepartmentID == employees[j].DepartmentID },
func(i, j int) bool { return employees[i].Country == employees[j].Country },
)
g2 := and(g1,
func(i, j int) bool { return employees[i].Team == employees[j].Team },
)
iter2 := iter2(len(employees), g1)
for iter2.More() {
i, j := iter2.Next()
if !quiet {
fmt.Printf("g1 %#v\n", employees[i:j])
}
sub := iter2.Sub(g2)
for sub.More() {
i, j = sub.Next()
if !quiet {
fmt.Printf(" sub %#v\n", employees[i:j])
}
}
}
}
内存分配确实减少了,但是网络速度不如被接受的解决方案好。
$ go test -v -bench=. -count=1 | prettybench
benchmark iter time/iter bytes alloc allocs
--------- ---- --------- ----------- ------
BenchmarkAll/_____map-__getEmployees-________10-4 44650 26.15 μs/op 2729 B/op 23 allocs/op
BenchmarkAll/_____map-__getEmployees-_______100-4 10000 102.03 μs/op 44968 B/op 111 allocs/op
BenchmarkAll/_____map-__getEmployees-______1000-4 1834 621.99 μs/op 360355 B/op 177 allocs/op
BenchmarkAll/_____map-__getEmployees-_____10000-4 194 6187.37 μs/op 2883473 B/op 243 allocs/op
BenchmarkAll/_____map-_getEmployees2-________10-4 46804 25.66 μs/op 2688 B/op 17 allocs/op
BenchmarkAll/_____map-_getEmployees2-_______100-4 13699 87.88 μs/op 44672 B/op 40 allocs/op
BenchmarkAll/_____map-_getEmployees2-______1000-4 2038 586.26 μs/op 392832 B/op 59 allocs/op
BenchmarkAll/_____map-_getEmployees2-_____10000-4 126 8893.38 μs/op 5832321 B/op 96 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-________10-4 47112 25.35 μs/op 1561 B/op 24 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-_______100-4 14434 83.64 μs/op 6842 B/op 112 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-______1000-4 2450 477.21 μs/op 46267 B/op 178 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-_____10000-4 276 4400.12 μs/op 361643 B/op 244 allocs/op
BenchmarkAll/__mapPtr-_getEmployees2-________10-4 50901 23.82 μs/op 416 B/op 15 allocs/op
BenchmarkAll/__mapPtr-_getEmployees2-_______100-4 18030 66.31 μs/op 6176 B/op 31 allocs/op
BenchmarkAll/__mapPtr-_getEmployees2-______1000-4 2732 420.04 μs/op 49184 B/op 43 allocs/op
BenchmarkAll/__mapPtr-_getEmployees2-_____10000-4 284 4234.28 μs/op 660512 B/op 69 allocs/op
BenchmarkAll/____tree-__getEmployees-________10-4 35811 33.35 μs/op 8765 B/op 112 allocs/op
BenchmarkAll/____tree-__getEmployees-_______100-4 16935 70.20 μs/op 14045 B/op 200 allocs/op
BenchmarkAll/____tree-__getEmployees-______1000-4 4036 286.15 μs/op 53469 B/op 266 allocs/op
BenchmarkAll/____tree-__getEmployees-_____10000-4 498 2434.40 μs/op 368865 B/op 332 allocs/op
BenchmarkAll/____tree-_getEmployees2-________10-4 51200 23.19 μs/op 1616 B/op 30 allocs/op
BenchmarkAll/____tree-_getEmployees2-_______100-4 24274 49.31 μs/op 6864 B/op 53 allocs/op
BenchmarkAll/____tree-_getEmployees2-______1000-4 4863 246.96 μs/op 50384 B/op 72 allocs/op
BenchmarkAll/____tree-_getEmployees2-_____10000-4 510 2393.52 μs/op 647888 B/op 106 allocs/op
BenchmarkAll/____sort-__getEmployees-________10-4 49437 22.20 μs/op 496 B/op 15 allocs/op
BenchmarkAll/____sort-__getEmployees-_______100-4 17683 67.53 μs/op 496 B/op 15 allocs/op
BenchmarkAll/____sort-__getEmployees-______1000-4 2367 484.35 μs/op 496 B/op 15 allocs/op
BenchmarkAll/____sort-__getEmployees-_____10000-4 244 4939.36 μs/op 496 B/op 15 allocs/op
BenchmarkAll/____sort-_getEmployees2-________10-4 54217 22.17 μs/op 496 B/op 15 allocs/op
BenchmarkAll/____sort-_getEmployees2-_______100-4 20193 59.68 μs/op 496 B/op 15 allocs/op
BenchmarkAll/____sort-_getEmployees2-______1000-4 3055 386.61 μs/op 496 B/op 15 allocs/op
BenchmarkAll/____sort-_getEmployees2-_____10000-4 310 3758.04 μs/op 496 B/op 15 allocs/op
ok test/d/sort/post 48.982s
以一些内存为代价,被接受的解决方案提供了更好的网络速度和更容易维护的优势。
BenchmarkAll/_____map-__getEmployees-_____10000-4 194 6187.37 μs/op 2883473 B/op 243 allocs/op
BenchmarkAll/____sort-__getEmployees-_____10000-4 244 4939.36 μs/op 496 B/op 15 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-_____10000-4 276 4400.12 μs/op 361643 B/op 244 allocs/op
BenchmarkAll/____tree-__getEmployees-_____10000-4 498 2434.40 μs/op 368865 B/op 332 allocs/op
英文:
previously i had found that sort based solution.
https://play.golang.org/p/RfRmAuJgsj8
func withSort(employees []Employee, quiet bool) {
sort.Slice(employees, multiless(
func(i, j int) bool { return employees[i].DepartmentID < employees[j].DepartmentID },
func(i, j int) bool { return employees[i].Country < employees[j].Country },
func(i, j int) bool { return employees[i].Team < employees[j].Team },
))
g1 := and(
func(i, j int) bool { return employees[i].DepartmentID == employees[j].DepartmentID },
func(i, j int) bool { return employees[i].Country == employees[j].Country },
)
g2 := and(g1,
func(i, j int) bool { return employees[i].Team == employees[j].Team },
)
iter2 := iter2(len(employees), g1)
for iter2.More() {
i, j := iter2.Next()
if !quiet {
fmt.Printf("g1 %#v\n", employees[i:j])
}
sub := iter2.Sub(g2)
for sub.More() {
i, j = sub.Next()
if !quiet {
fmt.Printf(" sub %#v\n", employees[i:j])
}
}
}
}
memory allocations is really reduced, but, the net speed is not as good as the accepted solution.
$ go test -v -bench=. -count=1 | prettybench
benchmark iter time/iter bytes alloc allocs
--------- ---- --------- ----------- ------
BenchmarkAll/_____map-__getEmployees-________10-4 44650 26.15 μs/op 2729 B/op 23 allocs/op
BenchmarkAll/_____map-__getEmployees-_______100-4 10000 102.03 μs/op 44968 B/op 111 allocs/op
BenchmarkAll/_____map-__getEmployees-______1000-4 1834 621.99 μs/op 360355 B/op 177 allocs/op
BenchmarkAll/_____map-__getEmployees-_____10000-4 194 6187.37 μs/op 2883473 B/op 243 allocs/op
BenchmarkAll/_____map-_getEmployees2-________10-4 46804 25.66 μs/op 2688 B/op 17 allocs/op
BenchmarkAll/_____map-_getEmployees2-_______100-4 13699 87.88 μs/op 44672 B/op 40 allocs/op
BenchmarkAll/_____map-_getEmployees2-______1000-4 2038 586.26 μs/op 392832 B/op 59 allocs/op
BenchmarkAll/_____map-_getEmployees2-_____10000-4 126 8893.38 μs/op 5832321 B/op 96 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-________10-4 47112 25.35 μs/op 1561 B/op 24 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-_______100-4 14434 83.64 μs/op 6842 B/op 112 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-______1000-4 2450 477.21 μs/op 46267 B/op 178 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-_____10000-4 276 4400.12 μs/op 361643 B/op 244 allocs/op
BenchmarkAll/__mapPtr-_getEmployees2-________10-4 50901 23.82 μs/op 416 B/op 15 allocs/op
BenchmarkAll/__mapPtr-_getEmployees2-_______100-4 18030 66.31 μs/op 6176 B/op 31 allocs/op
BenchmarkAll/__mapPtr-_getEmployees2-______1000-4 2732 420.04 μs/op 49184 B/op 43 allocs/op
BenchmarkAll/__mapPtr-_getEmployees2-_____10000-4 284 4234.28 μs/op 660512 B/op 69 allocs/op
BenchmarkAll/____tree-__getEmployees-________10-4 35811 33.35 μs/op 8765 B/op 112 allocs/op
BenchmarkAll/____tree-__getEmployees-_______100-4 16935 70.20 μs/op 14045 B/op 200 allocs/op
BenchmarkAll/____tree-__getEmployees-______1000-4 4036 286.15 μs/op 53469 B/op 266 allocs/op
BenchmarkAll/____tree-__getEmployees-_____10000-4 498 2434.40 μs/op 368865 B/op 332 allocs/op
BenchmarkAll/____tree-_getEmployees2-________10-4 51200 23.19 μs/op 1616 B/op 30 allocs/op
BenchmarkAll/____tree-_getEmployees2-_______100-4 24274 49.31 μs/op 6864 B/op 53 allocs/op
BenchmarkAll/____tree-_getEmployees2-______1000-4 4863 246.96 μs/op 50384 B/op 72 allocs/op
BenchmarkAll/____tree-_getEmployees2-_____10000-4 510 2393.52 μs/op 647888 B/op 106 allocs/op
BenchmarkAll/____sort-__getEmployees-________10-4 49437 22.20 μs/op 496 B/op 15 allocs/op
BenchmarkAll/____sort-__getEmployees-_______100-4 17683 67.53 μs/op 496 B/op 15 allocs/op
BenchmarkAll/____sort-__getEmployees-______1000-4 2367 484.35 μs/op 496 B/op 15 allocs/op
BenchmarkAll/____sort-__getEmployees-_____10000-4 244 4939.36 μs/op 496 B/op 15 allocs/op
BenchmarkAll/____sort-_getEmployees2-________10-4 54217 22.17 μs/op 496 B/op 15 allocs/op
BenchmarkAll/____sort-_getEmployees2-_______100-4 20193 59.68 μs/op 496 B/op 15 allocs/op
BenchmarkAll/____sort-_getEmployees2-______1000-4 3055 386.61 μs/op 496 B/op 15 allocs/op
BenchmarkAll/____sort-_getEmployees2-_____10000-4 310 3758.04 μs/op 496 B/op 15 allocs/op
ok test/d/sort/post 48.982s
At the cost of some memory the accepted solution offers an easier maintenance with much better net speed.
BenchmarkAll/_____map-__getEmployees-_____10000-4 194 6187.37 μs/op 2883473 B/op 243 allocs/op
BenchmarkAll/____sort-__getEmployees-_____10000-4 244 4939.36 μs/op 496 B/op 15 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-_____10000-4 276 4400.12 μs/op 361643 B/op 244 allocs/op
BenchmarkAll/____tree-__getEmployees-_____10000-4 498 2434.40 μs/op 368865 B/op 332 allocs/op
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论