英文:
Grouping numbers based on occurrences?
问题
给定以下三个数字序列,我想找出如何将这些数字分组以找到它们之间最接近的关系。
1,2,3,4
4,3,5
2,1,3
...
我不确定我要找的算法叫什么,但我们可以看到某些数字之间的关系比其他数字更强。
这些数字出现在一起两次:
1和2
1和3
2和3
3和4
出现在一起一次:
1和4
2和4
3和5
4和5
例如,我们可以看到1, 2和3
之间必定存在关系,因为它们至少出现在一起两次。你也可以说3和4
之间关系密切,因为它们也出现了两次。然而,算法可能会选择[1,2,3]
(而不是[3,4]
),因为它是一个更大的分组(更包容)。
如果我们将最常用的数字放在一起组成一个组,我们可以形成以下任意一种分组:
[1,2,3]和[4,5]
[1,2]和[3,4]和[5]
[1,2]和[3,4,5]
[1,2]和[3,4]和[5]
如果允许重复,甚至可以得到以下分组:
[1,2,3,4] [1,2,3] [3,4] [5]
我无法确定哪种分组最“正确”,但这四种组合都以不同的方式半正确地分组了这些数字。我不是在寻找特定的分组方式,只是一个相当好理解且工作良好的通用聚类算法。
我相信还有许多其他使用出现次数来分组的方法。对于这些数字,什么是一个好的基本分组算法?最好提供Go、JavaScript或PHP的示例代码。
英文:
Given the following three sequences of numbers, I would like to figure out how to group the numbers to find the closest relations between them.
1,2,3,4
4,3,5
2,1,3
...
I'm not sure what the algorithm(s) I'm looking for are called, but we can see stronger relations with some of the numbers than with others.
These numbers appear together twice:
1 & 2
1 & 3
2 & 3
3 & 4
Together once:
1 & 4
2 & 4
3 & 5
4 & 5
So for example, we can see there must be a relationship between 1, 2, & 3
since they all appear together at least twice. You could also say that 3 & 4
are closely related since they also appear twice. However, the algorithm might pick [1,2,3]
(over [3,4]
) since it's a bigger grouping (more inclusive).
We can form any of the following groupings if we stick the numbers used most often together in a group:
[1,2,3] & [4,5]
[1,2] & [3,4] & [5]
[1,2] & [3,4,5]
[1,2] & [3,4] & [5]
If duplicates are allowed, you could even end up with the following groups:
[1,2,3,4] [1,2,3] [3,4] [5]
I can't say which grouping is most "correct", but all four of these combos all find different ways of semi-correctly grouping the numbers. I'm not looking for a specific grouping - just a general cluster algorithm that works fairly well and is easy to understand.
I'm sure there are many other ways to use the occurrence count to group them as well. What would be a good base grouping algorithm for these? Samples in Go, Javascript, or PHP are preferred.
答案1
得分: 31
你的三个序列可以理解为多重图中的团。在一个团中,每个顶点都与其他每个顶点相连。
下图表示了你的示例情况,其中每个团的边分别用红色、蓝色和绿色表示。
正如你已经展示的那样,我们可以根据它们之间的边的数量对顶点对进行分类。在图中,我们可以看到有四对顶点由两条边连接,还有四对顶点由一条边连接。
我们可以继续根据顶点出现在的团的数量对顶点进行分类。在某种意义上,我们正在根据它们的连通性对顶点进行排序。出现在k
个团中的顶点可以被认为与出现在k
个团中的其他顶点具有相同的连接程度。在图中,我们可以看到三组顶点:顶点3出现在三个团中;顶点1、2和4分别出现在两个团中;顶点5出现在一个团中。
下面的Go程序计算了边的分类以及顶点的分类。程序的输入包含在第一行中,其中包括顶点的数量n
和团的数量m
。我们假设顶点从1到n
编号。输入中的每个后续的m
行都是一个由空格分隔的属于一个团的顶点列表。因此,问题中给出的实例可以用以下输入表示:
<!-- language=lang-html -->
5 3
1 2 3 4
4 3 5
2 1 3
相应的输出是:
<!-- language=lang-html -->
Number of edges between pairs of vertices:
2 edges: (1, 2) (1, 3) (2, 3) (3, 4)
1 edge: (1, 4) (2, 4) (3, 5) (4, 5)
Number of cliques in which a vertex appears:
3 cliques: 3
2 cliques: 1 2 4
1 clique: 5
以下是Go程序的代码:
<!-- language=lang-go -->
package main
import (
"bufio"
"fmt"
"os"
"strconv"
"strings"
)
func main() {
// 设置输入和输出。
reader := bufio.NewReader(os.Stdin)
writer := bufio.NewWriter(os.Stdout)
defer writer.Flush()
// 从第一行获取顶点的数量和团的数量。
line, err := reader.ReadString('\n')
if err != nil {
fmt.Fprintf(os.Stderr, "读取第一行时出错:%s\n", err)
return
}
var numVertices, numCliques int
numScanned, err := fmt.Sscanf(line, "%d %d", &numVertices, &numCliques)
if numScanned != 2 || err != nil {
fmt.Fprintf(os.Stderr, "解析输入参数时出错:%s\n", err)
return
}
// 初始化边的计数和顶点的计数。
edgeCounts := make([][]int, numVertices+1)
for u := 1; u <= numVertices; u++ {
edgeCounts[u] = make([]int, numVertices+1)
}
vertexCounts := make([]int, numVertices+1)
// 读取每个团并更新边的计数。
for c := 0; c < numCliques; c++ {
line, err = reader.ReadString('\n')
if err != nil {
fmt.Fprintf(os.Stderr, "读取团时出错:%s\n", err)
return
}
tokens := strings.Split(strings.TrimSpace(line), " ")
clique := make([]int, len(tokens))
for i, token := range tokens {
u, err := strconv.Atoi(token)
if err != nil {
fmt.Fprintf(os.Stderr, "Atoi 错误:%s\n", err)
return
}
vertexCounts[u]++
clique[i] = u
for j := 0; j < i; j++ {
v := clique[j]
edgeCounts[u][v]++
edgeCounts[v][u]++
}
}
}
// 计算每对顶点之间的边的数量。
count2edges := make([][][]int, numCliques+1)
for u := 1; u < numVertices; u++ {
for v := u + 1; v <= numVertices; v++ {
count := edgeCounts[u][v]
count2edges[count] = append(count2edges[count],
[]int{u, v})
}
}
writer.WriteString("顶点对之间的边的数量:\n")
for count := numCliques; count >= 1; count-- {
edges := count2edges[count]
if len(edges) == 0 {
continue
}
label := "边"
if count > 1 {
label += "s:"
} else {
label += ":"
}
writer.WriteString(fmt.Sprintf("%5d %s", count, label))
for _, edge := range edges {
writer.WriteString(fmt.Sprintf(" (%d, %d)",
edge[0], edge[1]))
}
writer.WriteString("\n")
}
// 根据团的成员数量对顶点进行分组。
count2vertices := make([][]int, numCliques+1)
for u := 1; u <= numVertices; u++ {
count := vertexCounts[u]
count2vertices[count] = append(count2vertices[count], u)
}
writer.WriteString("\n顶点出现在的团的数量:\n")
for count := numCliques; count >= 1; count-- {
vertices := count2vertices[count]
if len(vertices) == 0 {
continue
}
label := "团"
if count > 1 {
label += "s:"
} else {
label += ":"
}
writer.WriteString(fmt.Sprintf("%5d %s", count, label))
for _, u := range vertices {
writer.WriteString(fmt.Sprintf(" %d", u))
}
writer.WriteString("\n")
}
}
英文:
Each of your three sequences can be understood as a clique in a multigraph. Within a clique, every vertex is connected to every other vertex.
The following graph represents your sample case with the edges in each clique colored red, blue, and green, respectively.
As you have already shown, we can classify pairs of vertices according to the number of edges between them. In the illustration, we can see that four pairs of vertices are connected by two edges each, and four other pairs of vertices are connected by one edge each.
We can go on to classify vertices according to the number of cliques in which they appear. In some sense we are ranking vertices according to their connectedness. A vertex that appears in k
cliques can be thought of as connected to the same degree as other vertices that appear in k
cliques. In the image, we see three groups of vertices: vertex 3 appears in three cliques; vertices 1, 2, and 4 each appear in two cliques; vertex 5 appears in one clique.
The Go program below computes the edge classification as well as the vertex classification. The input to the program contains, on the first line, the number of vertices n
and the number of cliques m
. We assume that the vertices are numbered from 1 to n
. Each of the succeeding m
lines of input is a space-separated list of vertices belonging to a clique. Thus, the problem instance given in the question is represented by this input:
<!-- language=lang-html -->
5 3
1 2 3 4
4 3 5
2 1 3
The corresponding output is:
<!-- language=lang-html -->
Number of edges between pairs of vertices:
2 edges: (1, 2) (1, 3) (2, 3) (3, 4)
1 edge: (1, 4) (2, 4) (3, 5) (4, 5)
Number of cliques in which a vertex appears:
3 cliques: 3
2 cliques: 1 2 4
1 clique: 5
And here is the Go program:
<!-- language=lang-go -->
package main
import (
"bufio"
"fmt"
"os"
"strconv"
"strings"
)
func main() {
// Set up input and output.
reader := bufio.NewReader(os.Stdin)
writer := bufio.NewWriter(os.Stdout)
defer writer.Flush()
// Get the number of vertices and number of cliques from the first line.
line, err := reader.ReadString('\n')
if err != nil {
fmt.Fprintf(os.Stderr, "Error reading first line: %s\n", err)
return
}
var numVertices, numCliques int
numScanned, err := fmt.Sscanf(line, "%d %d", &numVertices, &numCliques)
if numScanned != 2 || err != nil {
fmt.Fprintf(os.Stderr, "Error parsing input parameters: %s\n", err)
return
}
// Initialize the edge counts and vertex counts.
edgeCounts := make([][]int, numVertices+1)
for u := 1; u <= numVertices; u++ {
edgeCounts[u] = make([]int, numVertices+1)
}
vertexCounts := make([]int, numVertices+1)
// Read each clique and update the edge counts.
for c := 0; c < numCliques; c++ {
line, err = reader.ReadString('\n')
if err != nil {
fmt.Fprintf(os.Stderr, "Error reading clique: %s\n", err)
return
}
tokens := strings.Split(strings.TrimSpace(line), " ")
clique := make([]int, len(tokens))
for i, token := range tokens {
u, err := strconv.Atoi(token)
if err != nil {
fmt.Fprintf(os.Stderr, "Atoi error: %s\n", err)
return
}
vertexCounts[u]++
clique[i] = u
for j := 0; j < i; j++ {
v := clique[j]
edgeCounts[u][v]++
edgeCounts[v][u]++
}
}
}
// Compute the number of edges between each pair of vertices.
count2edges := make([][][]int, numCliques+1)
for u := 1; u < numVertices; u++ {
for v := u + 1; v <= numVertices; v++ {
count := edgeCounts[u][v]
count2edges[count] = append(count2edges[count],
[]int{u, v})
}
}
writer.WriteString("Number of edges between pairs of vertices:\n")
for count := numCliques; count >= 1; count-- {
edges := count2edges[count]
if len(edges) == 0 {
continue
}
label := "edge"
if count > 1 {
label += "s:"
} else {
label += ": "
}
writer.WriteString(fmt.Sprintf("%5d %s", count, label))
for _, edge := range edges {
writer.WriteString(fmt.Sprintf(" (%d, %d)",
edge[0], edge[1]))
}
writer.WriteString("\n")
}
// Group vertices according to the number of clique memberships.
count2vertices := make([][]int, numCliques+1)
for u := 1; u <= numVertices; u++ {
count := vertexCounts[u]
count2vertices[count] = append(count2vertices[count], u)
}
writer.WriteString("\nNumber of cliques in which a vertex appears:\n")
for count := numCliques; count >= 1; count-- {
vertices := count2vertices[count]
if len(vertices) == 0 {
continue
}
label := "clique"
if count > 1 {
label += "s:"
} else {
label += ": "
}
writer.WriteString(fmt.Sprintf("%5d %s", count, label))
for _, u := range vertices {
writer.WriteString(fmt.Sprintf(" %d", u))
}
writer.WriteString("\n")
}
}
答案2
得分: 10
如前所述,这是关于团的问题。如果你想要确切的答案,你将面临最大团问题,这是一个NP完全问题。因此,只有在你的符号(数字)的字母表大小合理时,下面的所有内容才有意义。在这种情况下,最直接、不太优化的最大团问题的伪代码算法如下:
Function Main
Cm ← ∅ // 最大团
Clique(∅,V) // V 顶点集合
return Cm
End function Main
Function Clique(set C, set P) // C 当前团,P 候选集合
if (|C| > |Cm|) then
Cm ← C
End if
if (|C|+|P|>|Cm|)then
for all p ∈ P in predetermined order, do
P ← P \ {p}
Cp ←C ∪ {p}
Pp ←P ∩ N(p) //N(p) 是与 p 相邻的顶点集合
Clique(Cp,Pp)
End for
End if
End function Clique
由于Go是我选择的语言,这里是实现代码:
package main
import (
"bufio"
"fmt"
"sort"
"strconv"
"strings"
)
var adjmatrix map[int]map[int]int = make(map[int]map[int]int)
var Cm []int = make([]int, 0)
var frequency int
// 用于过滤
type resoult [][]int
var res resoult
var filter map[int]bool = make(map[int]bool)
var bf int
// 用于过滤
// 用于排序
func (r resoult) Less(i, j int) bool {
return len(r[i]) > len(r[j])
}
func (r resoult) Swap(i, j int) {
r[i], r[j] = r[j], r[i]
}
func (r resoult) Len() int {
return len(r)
}
// 用于排序
// 工作在这里
func Clique(C []int, P map[int]bool) {
if len(C) >= len(Cm) {
Cm = make([]int, len(C))
copy(Cm, C)
}
if len(C)+len(P) >= len(Cm) {
for k, _ := range P {
delete(P, k)
Cp := make([]int, len(C)+1)
copy(Cp, append(C, k))
Pp := make(map[int]bool)
for n, m := range adjmatrix[k] {
_, ok := P[n]
if ok && m >= frequency {
Pp[n] = true
}
}
Clique(Cp, Pp)
res = append(res, Cp)
// 清理结果
bf := 0
for _, v := range Cp {
bf += 1 << uint(v)
}
_, ok := filter[bf]
if !ok {
filter[bf] = true
res = append(res, Cp)
}
// 清理结果
}
}
}
// 工作在这里
func main() {
var toks []string
var numbers []int
var number int
// 解析输入
StrReader := strings.NewReader(`1,2,3
4,3,5
4,1,6
4,2,7
4,1,7
2,1,3
5,1,2
3,6`)
scanner := bufio.NewScanner(StrReader)
for scanner.Scan() {
toks = strings.Split(scanner.Text(), ",")
numbers = []int{}
for _, v := range toks {
number, _ = strconv.Atoi(v)
numbers = append(numbers, number)
}
for k, v := range numbers {
for _, m := range numbers[k:] {
_, ok := adjmatrix[v]
if !ok {
adjmatrix[v] = make(map[int]int)
}
_, ok = adjmatrix[m]
if !ok {
adjmatrix[m] = make(map[int]int)
}
if m != v {
adjmatrix[v][m]++
adjmatrix[m][v]++
if adjmatrix[v][m] > frequency {
frequency = adjmatrix[v][m]
}
}
}
}
}
// 解析输入
P1 := make(map[int]bool)
// 迭代出现频率
for ; frequency > 0; frequency-- {
for k, _ := range adjmatrix {
P1[k] = true
}
Cm = make([]int, 0)
res = make(resoult, 0)
Clique(make([]int, 0), P1)
sort.Sort(res)
fmt.Print(frequency, "x-times ", res, " ")
}
// 迭代出现频率
}
你可以在这里看到它的工作原理:https://play.golang.org/p/ZiJfH4Q6GJ,并尝试不同的输入数据。但是再次强调,这种方法适用于字母表大小合理(任意大小的输入数据)。
英文:
As already been mentioned it's about clique. If you want exact answer you will face Maximum Clique Problem which is NP-complete. So all below make any sense only if alphabet of your symbols(numbers) has reasonable size. In this case strait-forward, not very optimised algorithm for Maximum Clique Problem in pseudo-code would be
Function Main
Cm ← ∅ // the maximum clique
Clique(∅,V) // V vertices set
return Cm
End function Main
Function Clique(set C, set P) // C the current clique, P candidat set
if (|C| > |Cm|) then
Cm ← C
End if
if (|C|+|P|>|Cm|)then
for all p ∈ P in predetermined order, do
P ← P \ {p}
Cp ←C ∪ {p}
Pp ←P ∩ N(p) //N(p) set of the vertices adjacent to p
Clique(Cp,Pp)
End for
End if
End function Clique
Because of Go is my language of choice here is implementation
package main
import (
"bufio"
"fmt"
"sort"
"strconv"
"strings"
)
var adjmatrix map[int]map[int]int = make(map[int]map[int]int)
var Cm []int = make([]int, 0)
var frequency int
//For filter
type resoult [][]int
var res resoult
var filter map[int]bool = make(map[int]bool)
var bf int
//For filter
//That's for sorting
func (r resoult) Less(i, j int) bool {
return len(r[i]) > len(r[j])
}
func (r resoult) Swap(i, j int) {
r[i], r[j] = r[j], r[i]
}
func (r resoult) Len() int {
return len(r)
}
//That's for sorting
//Work done here
func Clique(C []int, P map[int]bool) {
if len(C) >= len(Cm) {
Cm = make([]int, len(C))
copy(Cm, C)
}
if len(C)+len(P) >= len(Cm) {
for k, _ := range P {
delete(P, k)
Cp := make([]int, len(C)+1)
copy(Cp, append(C, k))
Pp := make(map[int]bool)
for n, m := range adjmatrix[k] {
_, ok := P[n]
if ok && m >= frequency {
Pp[n] = true
}
}
Clique(Cp, Pp)
res = append(res, Cp)
//Cleanup resoult
bf := 0
for _, v := range Cp {
bf += 1 << uint(v)
}
_, ok := filter[bf]
if !ok {
filter[bf] = true
res = append(res, Cp)
}
//Cleanup resoult
}
}
}
//Work done here
func main() {
var toks []string
var numbers []int
var number int
//Input parsing
StrReader := strings.NewReader(`1,2,3
4,3,5
4,1,6
4,2,7
4,1,7
2,1,3
5,1,2
3,6`)
scanner := bufio.NewScanner(StrReader)
for scanner.Scan() {
toks = strings.Split(scanner.Text(), ",")
numbers = []int{}
for _, v := range toks {
number, _ = strconv.Atoi(v)
numbers = append(numbers, number)
}
for k, v := range numbers {
for _, m := range numbers[k:] {
_, ok := adjmatrix[v]
if !ok {
adjmatrix[v] = make(map[int]int)
}
_, ok = adjmatrix[m]
if !ok {
adjmatrix[m] = make(map[int]int)
}
if m != v {
adjmatrix[v][m]++
adjmatrix[m][v]++
if adjmatrix[v][m] > frequency {
frequency = adjmatrix[v][m]
}
}
}
}
}
//Input parsing
P1 := make(map[int]bool)
//Iterating for frequency of appearance in group
for ; frequency > 0; frequency-- {
for k, _ := range adjmatrix {
P1[k] = true
}
Cm = make([]int, 0)
res = make(resoult, 0)
Clique(make([]int, 0), P1)
sort.Sort(res)
fmt.Print(frequency, "x-times ", res, " ")
}
//Iterating for frequency of appearing together
}
And here you can see it works https://play.golang.org/p/ZiJfH4Q6GJ and play with input data. But once more, this approach is for reasonable size alphabet(and input data of any size).
答案3
得分: 3
这个问题经常在分析销售数据时出现,特别是在规则挖掘的背景下。(哪些商品经常一起购买?这样它们可以在超市中放在一起。)
我遇到的一类算法是关联规则学习。其中一个固有的步骤是找到与你的任务匹配的频繁项集。一个算法是Apriori算法。但当你搜索这些关键词时,你会找到更多的算法。
英文:
This problem often arises in the context of rule mining when analyzing sales data. (Which items are bought together? So they can be placed next to each other in the supermarket)
One class of algorithms I came across is Association Rule Learning. And one inherent step is finding frequent itemsets which matches your task. One algorithm is Apriori. But you can find a lot more when searching for those keywords.
答案4
得分: 1
如果你能描述一下这种分组的目标,那会更好。如果没有,我可以尝试提供一种简单(我认为是简单的)且最基本的方法。这种方法不适用于需要计算大量分散的数字(如1、999999、31)或大数或非正数的情况。
你可以按照以下方式重新排列数字集合的数组位置:
|1|2|3|4|5|6| - 数字作为数组位置
==============
*1|1|1|1|1|0|0| *1
*2|0|0|1|1|1|0| *2
*4|1|1|1|0|0|0| *4
==============
+|2|2|3|2|1|0 - 只是出现次数的计数器
*|5|5|7|3|2|0 - 因此,对于第一列数字1,掩码将为:1*1+1*4 = 5
在+行中,你可以看到最常见的组合是[3],然后是[1,2]和[4],然后是[5],你还可以指示和区分不同组合的共现。
function grps(a) {
var r = [];
var sum = []; var mask = [];
var max = 0;
var val;
for (i=0; i < a.length; i++) {
for (j=0; j < a[i].length; j++) {
val = a[i][j];
//r[i][val] = 1;
sum[val] = sum[val]?sum[val]+1:1;
mask[val] = mask[val]?mask[val]+Math.pow(2, i):1;
if (val > max) { max = val; }
}
}
for (j = 0; j < max; j++){
for (i = 0; i < max; i++){
r[sum[j]][mask[j]] = j;
}
}
return r;
}
希望这能帮到你!
英文:
It would be better it you describe goal of such a grouping. If no i may try to suggest the simples (as i think) approach, and thow most limeted. It is not suitable if you need to count huge amount of wide spreaded (like 1, 999999, 31) or big or nonpositive numbers .
you can rearrange number sets in array positions like so:
|1|2|3|4|5|6| - numers as array positions
==============
*1|1|1|1|1|0|0| *1
*2|0|0|1|1|1|0| *2
*4|1|1|1|0|0|0| *4
==============
+|2|2|3|2|1|0 - just a counters of occurence
*|5|5|7|3|2|0 - so for first column number 1 mask will be: 1*1+1*4 = 5
here you can see in + row that most frequent combination is [3], then [1,2] and [4] and then [5],
also you can indicate and distinguish the cooccurence of different combinations
function grps(a) {
var r = [];
var sum = []; var mask = [];
var max = 0;
var val;
for (i=0; i < a.length; i++) {
for (j=0; j < a[i].length; j++) {
val = a[i][j];
//r[i][val] = 1;
sum[val] = sum[val]?sum[val]+1:1;
mask[val] = mask[val]?mask[val]+Math.pow(2, i):1;
if (val > max) { max = val; }
}
}
for (j = 0; j < max; j++){
for (i = 0; i < max; i++){
r[sum[j]][mask[j]] = j;
}
}
return r;
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论