一个用于按多个标准对结构体切片进行分组的更高效的算法。

huangapple go评论80阅读模式
英文:

leaner algorithm to group a slice of structs by multiple criterion

问题

在下面的代码中,我实现了一个简单的策略,依赖于map来按DepartmentIDCountry对数据进行分组。然后我通过Team对数据进行子分组。

一个可运行的示例,包括基准测试工具,可以在https://play.golang.org/p/oDx39XtEsQs找到。

代码的相关部分如下:

package main

type Country struct {
	Code      string     `json:"code"`
	Employees []Employee `json:"employees"`
}
type Department struct {
	ID        int        `json:"department"`
	Employees []Employee `json:"employees"`
}

type DepartmentCountry struct {
	keyDepartmentCountry
	Employees []Employee `json:"employees"`
}
type DepartmentCountryTeam struct {
	keyDepartmentCountryTeam
	Employees []Employee `json:"employees"`
}
type keyDepartmentCountry struct {
	ID   int    `json:"department"`
	Code string `json:"code"`
}
type keyDepartmentCountryTeam struct {
	ID   int    `json:"department"`
	Code string `json:"code"`
	Team string `json:"team"`
}

type Employee struct {
	ID           int    `json:"id"`
	Name         string `json:"name"`
	DepartmentID int    `json:"departmentid"`
	Country      string `json:"country"`
	Team         string `json:"team"`
}

func main(){
// ...
	// 按部门和国家值进行分组。
	byDeptAndCty := make(map[keyDepartmentCountry]DepartmentCountry)
	for _, employee := range employees {
		key := keyDepartmentCountry{ID: employee.DepartmentID, Code: employee.Country}
		byDeptAndCty[key] = DepartmentCountry{
			keyDepartmentCountry: key,
			Employees:            append(byDeptAndCty[key].Employees, employee),
		}
	}

	// 对于每个分组,
	for _, dept := range byDeptAndCty {
		fmt.Printf("g1 %#v\n", dept)
		// 按团队对员工进行分组。
		byDeptAndCtyAndTeam := make(map[keyDepartmentCountryTeam]DepartmentCountryTeam)
		for _, employee := range dept.Employees {
			key := keyDepartmentCountryTeam{ID: employee.DepartmentID, Code: employee.Country, Team: employee.Team}
			byDeptAndCtyAndTeam[key] = DepartmentCountryTeam{
				keyDepartmentCountryTeam: key,
				Employees:                append(byDeptAndCtyAndTeam[key].Employees, employee),
			}
		}
		// 打印团队成员
		for _, team := range byDeptAndCtyAndTeam {
			fmt.Printf("  sub %#v\n", team)
		}
	}
// ...
}

我正在寻找一种不需要分配太多内存的算法,而且可能执行速度更快,因为我的数据集太大,无法同时在我们计划使用的硬件上保留map和原始数据。

英文:

In below code i have implemented a simple strategy relying on map to group the data by DepartmentID
and Country. I then sub-group the data by Team.

A runnable example, including benchmark facilities, is at https://play.golang.org/p/oDx39XtEsQs

The relevant pieces of code

package main

type Country struct {
	Code      string     `json:"code"`
	Employees []Employee `json:"employees"`
}
type Department struct {
	ID        int        `json:"department"`
	Employees []Employee `json:"employees"`
}

type DepartmentCountry struct {
	keyDepartmentCountry
	Employees []Employee `json:"employees"`
}
type DepartmentCountryTeam struct {
	keyDepartmentCountryTeam
	Employees []Employee `json:"employees"`
}
type keyDepartmentCountry struct {
	ID   int    `json:"department"`
	Code string `json:"code"`
}
type keyDepartmentCountryTeam struct {
	ID   int    `json:"department"`
	Code string `json:"code"`
	Team string `json:"team"`
}

type Employee struct {
	ID           int    `json:"id"`
	Name         string `json:"name"`
	DepartmentID int    `json:"departmentid"`
	Country      string `json:"country"`
	Team         string `json:"team"`
}

func main(){
// ...
	// group by department and country values.
	byDeptAndCty := make(map[keyDepartmentCountry]DepartmentCountry)
	for _, employee := range employees {
		key := keyDepartmentCountry{ID: employee.DepartmentID, Code: employee.Country}
		byDeptAndCty[key] = DepartmentCountry{
			keyDepartmentCountry: key,
			Employees:            append(byDeptAndCty[key].Employees, employee),
		}
	}

	// foreach groups,
	for _, dept := range byDeptAndCty {
		fmt.Printf("g1 %#v\n", dept)
		// group employees by team.
		byDeptAndCtyAndTeam := make(map[keyDepartmentCountryTeam]DepartmentCountryTeam)
		for _, employee := range dept.Employees {
			key := keyDepartmentCountryTeam{ID: employee.DepartmentID, Code: employee.Country, Team: employee.Team}
			byDeptAndCtyAndTeam[key] = DepartmentCountryTeam{
				keyDepartmentCountryTeam: key,
				Employees:                append(byDeptAndCtyAndTeam[key].Employees, employee),
			}
		}
		// print team members
		for _, team := range byDeptAndCtyAndTeam {
			fmt.Printf("  sub %#v\n", team)
		}
	}
// ...
}

I am searching to implement an algorithm that does not need to allocate so much,
and which possibly performs faster, because my dataset is too big to keep both the map
and the original data in memory on the hardware we plan to use.

答案1

得分: 2

可以使用类似以下的代码:

countries := make(map[string]*CountryData)

type CountryData struct {
   Country *Country
   Depts   map[int]*DeptData
}

type DeptData struct {
   Dept      *Department
   Employees []*Employee  // Employees in country-dept
   Teams     map[int][]*Employee // Employees in country-dept-team
}

上述数据结构保存了对员工/国家/部门结构的指针,因此它们在内存中只有一个实例。带有嵌套映射的树结构允许您使用不同的键遍历树。

英文:

Something like this can be used:

countries:=map[string]*CountryData{}
type CountryData struct {
Country *Country
Depts map[int]*DeptData
}
type DeptData struct {
Dept *Department
Employees []*Employee  // Employees in country-dept
Teams map[int][]*Employee // Employees in country-dept-team
}

The above data structure keeps pointers to the employee/country/department structures, so only one instance of them are in memory. The tree structure with nested maps allow you to traverse the tree using different keys.

答案2

得分: 1

之前我发现了一种基于排序的解决方案。

func withSort(employees []Employee, quiet bool) {
	sort.Slice(employees, multiless(
		func(i, j int) bool { return employees[i].DepartmentID < employees[j].DepartmentID },
		func(i, j int) bool { return employees[i].Country < employees[j].Country },
		func(i, j int) bool { return employees[i].Team < employees[j].Team },
	))
	g1 := and(
		func(i, j int) bool { return employees[i].DepartmentID == employees[j].DepartmentID },
		func(i, j int) bool { return employees[i].Country == employees[j].Country },
	)
	g2 := and(g1,
		func(i, j int) bool { return employees[i].Team == employees[j].Team },
	)
	iter2 := iter2(len(employees), g1)
	for iter2.More() {
		i, j := iter2.Next()
		if !quiet {
			fmt.Printf("g1 %#v\n", employees[i:j])
		}
		sub := iter2.Sub(g2)
		for sub.More() {
			i, j = sub.Next()
			if !quiet {
				fmt.Printf("  sub %#v\n", employees[i:j])
			}
		}
	}
}

内存分配确实减少了,但是网络速度不如被接受的解决方案好。

$ go test -v -bench=. -count=1 | prettybench
benchmark                                            iter        time/iter    bytes alloc          allocs
---------                                            ----        ---------    -----------          ------
BenchmarkAll/_____map-__getEmployees-________10-4   44650      26.15 μs/op      2729 B/op    23 allocs/op
BenchmarkAll/_____map-__getEmployees-_______100-4   10000     102.03 μs/op     44968 B/op   111 allocs/op
BenchmarkAll/_____map-__getEmployees-______1000-4    1834     621.99 μs/op    360355 B/op   177 allocs/op
BenchmarkAll/_____map-__getEmployees-_____10000-4     194    6187.37 μs/op   2883473 B/op   243 allocs/op
BenchmarkAll/_____map-_getEmployees2-________10-4   46804      25.66 μs/op      2688 B/op    17 allocs/op
BenchmarkAll/_____map-_getEmployees2-_______100-4   13699      87.88 μs/op     44672 B/op    40 allocs/op
BenchmarkAll/_____map-_getEmployees2-______1000-4    2038     586.26 μs/op    392832 B/op    59 allocs/op
BenchmarkAll/_____map-_getEmployees2-_____10000-4     126    8893.38 μs/op   5832321 B/op    96 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-________10-4   47112      25.35 μs/op      1561 B/op    24 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-_______100-4   14434      83.64 μs/op      6842 B/op   112 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-______1000-4    2450     477.21 μs/op     46267 B/op   178 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-_____10000-4     276    4400.12 μs/op    361643 B/op   244 allocs/op
BenchmarkAll/__mapPtr-_getEmployees2-________10-4   50901      23.82 μs/op       416 B/op    15 allocs/op
BenchmarkAll/__mapPtr-_getEmployees2-_______100-4   18030      66.31 μs/op      6176 B/op    31 allocs/op
BenchmarkAll/__mapPtr-_getEmployees2-______1000-4    2732     420.04 μs/op     49184 B/op    43 allocs/op
BenchmarkAll/__mapPtr-_getEmployees2-_____10000-4     284    4234.28 μs/op    660512 B/op    69 allocs/op
BenchmarkAll/____tree-__getEmployees-________10-4   35811      33.35 μs/op      8765 B/op   112 allocs/op
BenchmarkAll/____tree-__getEmployees-_______100-4   16935      70.20 μs/op     14045 B/op   200 allocs/op
BenchmarkAll/____tree-__getEmployees-______1000-4    4036     286.15 μs/op     53469 B/op   266 allocs/op
BenchmarkAll/____tree-__getEmployees-_____10000-4     498    2434.40 μs/op    368865 B/op   332 allocs/op
BenchmarkAll/____tree-_getEmployees2-________10-4   51200      23.19 μs/op      1616 B/op    30 allocs/op
BenchmarkAll/____tree-_getEmployees2-_______100-4   24274      49.31 μs/op      6864 B/op    53 allocs/op
BenchmarkAll/____tree-_getEmployees2-______1000-4    4863     246.96 μs/op     50384 B/op    72 allocs/op
BenchmarkAll/____tree-_getEmployees2-_____10000-4     510    2393.52 μs/op    647888 B/op   106 allocs/op
BenchmarkAll/____sort-__getEmployees-________10-4   49437      22.20 μs/op       496 B/op    15 allocs/op
BenchmarkAll/____sort-__getEmployees-_______100-4   17683      67.53 μs/op       496 B/op    15 allocs/op
BenchmarkAll/____sort-__getEmployees-______1000-4    2367     484.35 μs/op       496 B/op    15 allocs/op
BenchmarkAll/____sort-__getEmployees-_____10000-4     244    4939.36 μs/op       496 B/op    15 allocs/op
BenchmarkAll/____sort-_getEmployees2-________10-4   54217      22.17 μs/op       496 B/op    15 allocs/op
BenchmarkAll/____sort-_getEmployees2-_______100-4   20193      59.68 μs/op       496 B/op    15 allocs/op
BenchmarkAll/____sort-_getEmployees2-______1000-4    3055     386.61 μs/op       496 B/op    15 allocs/op
BenchmarkAll/____sort-_getEmployees2-_____10000-4     310    3758.04 μs/op       496 B/op    15 allocs/op
ok  	test/d/sort/post	48.982s

以一些内存为代价,被接受的解决方案提供了更好的网络速度和更容易维护的优势。

BenchmarkAll/_____map-__getEmployees-_____10000-4     194    6187.37 μs/op   2883473 B/op   243 allocs/op
BenchmarkAll/____sort-__getEmployees-_____10000-4     244    4939.36 μs/op       496 B/op    15 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-_____10000-4     276    4400.12 μs/op    361643 B/op   244 allocs/op
BenchmarkAll/____tree-__getEmployees-_____10000-4     498    2434.40 μs/op    368865 B/op   332 allocs/op
英文:

previously i had found that sort based solution.

https://play.golang.org/p/RfRmAuJgsj8


func withSort(employees []Employee, quiet bool) {
	sort.Slice(employees, multiless(
		func(i, j int) bool { return employees[i].DepartmentID &lt; employees[j].DepartmentID },
		func(i, j int) bool { return employees[i].Country &lt; employees[j].Country },
		func(i, j int) bool { return employees[i].Team &lt; employees[j].Team },
	))
	g1 := and(
		func(i, j int) bool { return employees[i].DepartmentID == employees[j].DepartmentID },
		func(i, j int) bool { return employees[i].Country == employees[j].Country },
	)
	g2 := and(g1,
		func(i, j int) bool { return employees[i].Team == employees[j].Team },
	)
	iter2 := iter2(len(employees), g1)
	for iter2.More() {
		i, j := iter2.Next()
		if !quiet {
			fmt.Printf(&quot;g1 %#v\n&quot;, employees[i:j])
		}
		sub := iter2.Sub(g2)
		for sub.More() {
			i, j = sub.Next()
			if !quiet {
				fmt.Printf(&quot;  sub %#v\n&quot;, employees[i:j])
			}
		}
	}
}

memory allocations is really reduced, but, the net speed is not as good as the accepted solution.

$ go test -v -bench=. -count=1 | prettybench
benchmark                                            iter        time/iter    bytes alloc          allocs
---------                                            ----        ---------    -----------          ------
BenchmarkAll/_____map-__getEmployees-________10-4   44650      26.15 μs/op      2729 B/op    23 allocs/op
BenchmarkAll/_____map-__getEmployees-_______100-4   10000     102.03 μs/op     44968 B/op   111 allocs/op
BenchmarkAll/_____map-__getEmployees-______1000-4    1834     621.99 μs/op    360355 B/op   177 allocs/op
BenchmarkAll/_____map-__getEmployees-_____10000-4     194    6187.37 μs/op   2883473 B/op   243 allocs/op
BenchmarkAll/_____map-_getEmployees2-________10-4   46804      25.66 μs/op      2688 B/op    17 allocs/op
BenchmarkAll/_____map-_getEmployees2-_______100-4   13699      87.88 μs/op     44672 B/op    40 allocs/op
BenchmarkAll/_____map-_getEmployees2-______1000-4    2038     586.26 μs/op    392832 B/op    59 allocs/op
BenchmarkAll/_____map-_getEmployees2-_____10000-4     126    8893.38 μs/op   5832321 B/op    96 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-________10-4   47112      25.35 μs/op      1561 B/op    24 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-_______100-4   14434      83.64 μs/op      6842 B/op   112 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-______1000-4    2450     477.21 μs/op     46267 B/op   178 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-_____10000-4     276    4400.12 μs/op    361643 B/op   244 allocs/op
BenchmarkAll/__mapPtr-_getEmployees2-________10-4   50901      23.82 μs/op       416 B/op    15 allocs/op
BenchmarkAll/__mapPtr-_getEmployees2-_______100-4   18030      66.31 μs/op      6176 B/op    31 allocs/op
BenchmarkAll/__mapPtr-_getEmployees2-______1000-4    2732     420.04 μs/op     49184 B/op    43 allocs/op
BenchmarkAll/__mapPtr-_getEmployees2-_____10000-4     284    4234.28 μs/op    660512 B/op    69 allocs/op
BenchmarkAll/____tree-__getEmployees-________10-4   35811      33.35 μs/op      8765 B/op   112 allocs/op
BenchmarkAll/____tree-__getEmployees-_______100-4   16935      70.20 μs/op     14045 B/op   200 allocs/op
BenchmarkAll/____tree-__getEmployees-______1000-4    4036     286.15 μs/op     53469 B/op   266 allocs/op
BenchmarkAll/____tree-__getEmployees-_____10000-4     498    2434.40 μs/op    368865 B/op   332 allocs/op
BenchmarkAll/____tree-_getEmployees2-________10-4   51200      23.19 μs/op      1616 B/op    30 allocs/op
BenchmarkAll/____tree-_getEmployees2-_______100-4   24274      49.31 μs/op      6864 B/op    53 allocs/op
BenchmarkAll/____tree-_getEmployees2-______1000-4    4863     246.96 μs/op     50384 B/op    72 allocs/op
BenchmarkAll/____tree-_getEmployees2-_____10000-4     510    2393.52 μs/op    647888 B/op   106 allocs/op
BenchmarkAll/____sort-__getEmployees-________10-4   49437      22.20 μs/op       496 B/op    15 allocs/op
BenchmarkAll/____sort-__getEmployees-_______100-4   17683      67.53 μs/op       496 B/op    15 allocs/op
BenchmarkAll/____sort-__getEmployees-______1000-4    2367     484.35 μs/op       496 B/op    15 allocs/op
BenchmarkAll/____sort-__getEmployees-_____10000-4     244    4939.36 μs/op       496 B/op    15 allocs/op
BenchmarkAll/____sort-_getEmployees2-________10-4   54217      22.17 μs/op       496 B/op    15 allocs/op
BenchmarkAll/____sort-_getEmployees2-_______100-4   20193      59.68 μs/op       496 B/op    15 allocs/op
BenchmarkAll/____sort-_getEmployees2-______1000-4    3055     386.61 μs/op       496 B/op    15 allocs/op
BenchmarkAll/____sort-_getEmployees2-_____10000-4     310    3758.04 μs/op       496 B/op    15 allocs/op
ok  	test/d/sort/post	48.982s

At the cost of some memory the accepted solution offers an easier maintenance with much better net speed.

BenchmarkAll/_____map-__getEmployees-_____10000-4     194    6187.37 μs/op   2883473 B/op   243 allocs/op
BenchmarkAll/____sort-__getEmployees-_____10000-4     244    4939.36 μs/op       496 B/op    15 allocs/op
BenchmarkAll/__mapPtr-__getEmployees-_____10000-4     276    4400.12 μs/op    361643 B/op   244 allocs/op
BenchmarkAll/____tree-__getEmployees-_____10000-4     498    2434.40 μs/op    368865 B/op   332 allocs/op

huangapple
  • 本文由 发表于 2021年9月8日 03:32:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/69093695.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定