Why does the following golang program throw a runtime out of memory error?

huangapple go评论83阅读模式
英文:

Why does the following golang program throw a runtime out of memory error?

问题

这个程序的作用是读取一个由一对整数组成的文件(每行一对),并删除重复的对。虽然它在小文件上运行正常,但在大文件上(比如1.5GB的文件)会抛出运行时错误。最初,我以为是map数据结构导致了这个问题,但即使将其注释掉,它仍然会耗尽内存。有什么想法为什么会发生这种情况?如何纠正它?这是一个会耗尽内存的数据文件:http://snap.stanford.edu/data/com-Orkut.html

package main
import (
	"fmt"
	"bufio"
	"os"
	"strings"
	"strconv"
)

func main() {
	file, err := os.Open(os.Args[1])
	if err != nil {
		panic(err.Error())
	}
	defer file.Close()
	type Edge struct {
		u, v int
	}
	//seen := make(map[Edge]bool)
	edges := []Edge{}
	scanner := bufio.NewScanner(file)
	
	for i, _ := strconv.Atoi(os.Args[2]); i > 0; i-- {
		scanner.Scan()
	}
	
	for scanner.Scan() {
		str := scanner.Text()
		edge := strings.Split(str, ",")
		u, _ := strconv.Atoi(edge[0])
		v, _ := strconv.Atoi(edge[1])
		var key Edge
		if u < v {
			key = Edge{u,v}
		} else {
			key = Edge{v,u}
		}
		//if seen[key] {
		//	continue
		//}
		//seen[key] = true
		edges = append(edges, key)
	}
	for _, e := range edges {
		s := strconv.Itoa(e.u) + "," + strconv.Itoa(e.v)
		fmt.Println(s)
	}
}

以下是一个示例输入。可以按照以下方式运行程序(最后一个输入表示要跳过多少行)。
go run undup.go a.txt 1

# 3072441,117185083
1,2
1,3
1,4
1,5
1,6
1,7
1,8
英文:

This program is supposed to read a file consisting of pairs of ints (one pair per line) and remove duplicate pairs. While it works on small files, it throws a runtime error on huge files (say a file of 1.5 GB). Initially, I thought that it is the map data structure which is causing this, but even after commenting it out, it still runs out of memory. Any ideas why this is happening? How to rectify it? Here's a data file on which it runs out of memory: http://snap.stanford.edu/data/com-Orkut.html

package main
import (
	&quot;fmt&quot;
	&quot;bufio&quot;
	&quot;os&quot;
	&quot;strings&quot;
	&quot;strconv&quot;
)

func main() {
	file, err := os.Open(os.Args[1])
	if err != nil {
		panic(err.Error())
	}
	defer file.Close()
	type Edge struct {
		u, v int
	}
	//seen := make(map[Edge]bool)
	edges := []Edge{}
	scanner := bufio.NewScanner(file)
	
	for i, _ := strconv.Atoi(os.Args[2]); i &gt; 0; i-- {
		scanner.Scan()
	}
	
	for scanner.Scan() {
		str := scanner.Text()
		edge := strings.Split(str, &quot;,&quot;)
		u, _ := strconv.Atoi(edge[0])
		v, _ := strconv.Atoi(edge[1])
		var key Edge
		if u &lt; v {
			key = Edge{u,v}
		} else {
			key = Edge{v,u}
		}
		//if seen[key] {
		//	continue
		//}
		//seen[key] = true
		edges = append(edges, key)
	}
	for _, e := range edges {
		s := strconv.Itoa(e.u) + &quot;,&quot; + strconv.Itoa(e.v)
		fmt.Println(s)
	}
}

A sample input is given below. The program can be run as follows (where the last input says how many lines to skip).
go run undup.go a.txt 1

# 3072441,117185083
1,2
1,3
1,4
1,5
1,6
1,7
1,8

答案1

得分: 5

我查看了这个文件:com-orkut.ungraph.txt,它包含117,185,082行。根据你的数据结构,每行至少占用16个字节(边是两个64位整数)。仅这一点就已经占用了1.7GB的空间。我过去也遇到过这个问题,它可能比较棘手。你是为了特定的用例(即所讨论的文件)还是一般情况下解决这个问题?

对于特定情况,你可以利用数据的一些特点:(1)键是排序的,(2)它似乎将每个连接存储了两次,(3)数字似乎不是很大。以下是一些想法:

  1. 如果你使用较小的类型作为键,将会使用更少的内存。尝试使用uint32

  2. 你可以通过检查第二列是否大于第一列,将键流式传输(而不使用映射)到另一个文件中:

     if u < v {
         // 将键写入另一个文件
     } else {
         // 跳过它,因为v最终会显示v -> u
     }
    

对于一般情况,你可以使用以下几种策略:

  1. 如果结果列表的顺序不重要:使用磁盘上的哈希表存储映射。有很多这样的哈希表可供选择:leveldb、sqlite、tokyo tyrant等。对于Go语言来说,一个非常好的选择是bolt

    在你的for循环中,你只需要检查一个桶是否包含给定的键即可(你可以使用encoding/binary将整数转换为字节切片)。如果包含,就跳过它并继续。你需要将第二个for循环的处理步骤移到第一个for循环中,这样就不需要存储所有的键了。

  2. 如果结果列表的顺序很重要(且无法保证输入是有序的):你也可以使用磁盘上的哈希表,但它需要是有序的。Bolt是有序的,所以可以使用它。将所有的键添加到哈希表中,然后在第二个循环中遍历它。

以下是一个示例程序(使用1亿条记录运行时间会比较长):

package main

import (
	&quot;bufio&quot;
	&quot;encoding/binary&quot;
	&quot;fmt&quot;
	&quot;github.com/boltdb/bolt&quot;
	&quot;os&quot;
	&quot;strconv&quot;
	&quot;strings&quot;
)

type Edge struct {
	u, v int
}

func FromKey(bs []byte) Edge {
	return Edge{int(binary.BigEndian.Uint64(bs[:8])), int(binary.BigEndian.Uint64(bs[8:]))}
}

func (e Edge) Key() [16]byte {
	var k [16]byte
	binary.BigEndian.PutUint64(k[:8], uint64(e.u))
	binary.BigEndian.PutUint64(k[8:], uint64(e.v))
	return k
}

func main() {
	file, err := os.Open(os.Args[1])
	if err != nil {
		panic(err.Error())
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)

	for i, _ := strconv.Atoi(os.Args[2]); i > 0; i-- {
		scanner.Scan()
	}

	db, _ := bolt.Open(&quot;ex.db&quot;, 0777, nil)
	defer db.Close()

	bucketName := []byte(&quot;edges&quot;)
	db.Update(func(tx *bolt.Tx) error {
		tx.CreateBucketIfNotExists(bucketName)
		return nil
	})

	batchSize := 10000
	total := 0
	batch := make([]Edge, 0, batchSize)
	writeBatch := func() {
		total += len(batch)
		fmt.Println(&quot;write batch. total:&quot;, total)
		db.Update(func(tx *bolt.Tx) error {
			bucket := tx.Bucket(bucketName)
			for _, edge := range batch {
				key := edge.Key()
				bucket.Put(key[:], nil)
			}
			return nil
		})
	}

	for scanner.Scan() {
		str := scanner.Text()
		edge := strings.Split(str, &quot;\t&quot;)
		u, _ := strconv.Atoi(edge[0])
		v, _ := strconv.Atoi(edge[1])
		var key Edge
		if u < v {
			key = Edge{u, v}
		} else {
			key = Edge{v, u}
		}
		batch = append(batch, key)
		if len(batch) == batchSize {
			writeBatch()
			// 重置批处理长度为0
			batch = batch[:0]
		}
	}
	// 写入剩余的内容
	writeBatch()

	db.View(func(tx *bolt.Tx) error {
		tx.Bucket(bucketName).ForEach(func(k, v []byte) error {
			edge := FromKey(k)
			fmt.Println(edge)
			return nil
		})
		return nil
	})
}
英文:

I looked at this file: com-orkut.ungraph.txt and it contains 117,185,082 lines. The way your data is structured, that's at least 16 bytes per line. (Edge is two 64bit ints) That alone is 1.7GB. I have had this problem in the past, and it can be a tricky one. Are you trying to solve this for a specific use case (the file in question) or the general case?

In the specific case there are a few things about the data you could leverage: (1) the keys are sorted and (2) it looks it stores every connection twice, (3) the numbers don't seem huge. Here are a couple ideas:

  1. If you use a smaller type for the key you will use less memory. Try a uint32.

  2. You could stream (without using a map) the keys to another file by simply seeing if the 2nd column is greater than the first:

     if u &lt; v {
         // write the key to another file
     } else {
         // skip it because v will eventually show v -&gt; u
     }
    

For the general case there are a couple strategies you could use:

  1. If the order of the resulting list doesn't matter: Use an on-disk hash table to store the map. There are a bunch of these: leveldb, sqlite, tokyo tyrant, ... A really nice one for go is bolt.

In your for loop you would just check to see if a bucket contains the given key. (You can convert the ints into byte slices using encoding/binary) If it does, just skip it and continue. You will need to move the second for loop processing step into the first for loop so that you don't have to store all the keys.

  1. If the order of the resulting list does matter (and you can't guarantee the input is in order): You can also use an on-disk hash table, but it needs to be sorted. Bolt is sorted so that will work. Add all the keys to it, then traverse it in the second loop.

Here is an example: (this program will take a while to run with 100 million records)

package main

import (
	&quot;bufio&quot;
	&quot;encoding/binary&quot;
	&quot;fmt&quot;
	&quot;github.com/boltdb/bolt&quot;
	&quot;os&quot;
	&quot;strconv&quot;
	&quot;strings&quot;
)

type Edge struct {
	u, v int
}

func FromKey(bs []byte) Edge {
	return Edge{int(binary.BigEndian.Uint64(bs[:8])), int(binary.BigEndian.Uint64(bs[8:]))}
}

func (e Edge) Key() [16]byte {
	var k [16]byte
	binary.BigEndian.PutUint64(k[:8], uint64(e.u))
	binary.BigEndian.PutUint64(k[8:], uint64(e.v))
	return k
}

func main() {
	file, err := os.Open(os.Args[1])
	if err != nil {
		panic(err.Error())
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)

	for i, _ := strconv.Atoi(os.Args[2]); i &gt; 0; i-- {
		scanner.Scan()
	}

	db, _ := bolt.Open(&quot;ex.db&quot;, 0777, nil)
	defer db.Close()

	bucketName := []byte(&quot;edges&quot;)
	db.Update(func(tx *bolt.Tx) error {
		tx.CreateBucketIfNotExists(bucketName)
		return nil
	})

	batchSize := 10000
	total := 0
	batch := make([]Edge, 0, batchSize)
	writeBatch := func() {
		total += len(batch)
		fmt.Println(&quot;write batch. total:&quot;, total)
		db.Update(func(tx *bolt.Tx) error {
			bucket := tx.Bucket(bucketName)
			for _, edge := range batch {
				key := edge.Key()
				bucket.Put(key[:], nil)
			}
			return nil
		})
	}

	for scanner.Scan() {
		str := scanner.Text()
		edge := strings.Split(str, &quot;\t&quot;)
		u, _ := strconv.Atoi(edge[0])
		v, _ := strconv.Atoi(edge[1])
		var key Edge
		if u &lt; v {
			key = Edge{u, v}
		} else {
			key = Edge{v, u}
		}
		batch = append(batch, key)
		if len(batch) == batchSize {
			writeBatch()
			// reset the batch length to 0
			batch = batch[:0]
		}
	}
	// write anything leftover
	writeBatch()

	db.View(func(tx *bolt.Tx) error {
		tx.Bucket(bucketName).ForEach(func(k, v []byte) error {
			edge := FromKey(k)
			fmt.Println(edge)
			return nil
		})
		return nil
	})
}

答案2

得分: 2

你在浪费内存。以下是如何纠正的方法。

你给出了示例输入 a.txt,大小为48字节。

# 3072441,117185083
1,2
1,3
1,4
1,5

http://snap.stanford.edu/data/com-Orkut.html 上,我找到了 http://snap.stanford.edu/data/bigdata/communities/com-orkut.ungraph.txt.gz,未压缩大小为1.8 GB,包含117,185,083条边。

# Undirected graph: ../../data/output/orkut.txt
# Orkut
# Nodes: 3072441 Edges: 117185083
# FromNodeId	ToNodeId
1	2
1	3
1	4
1	5

http://socialnetworks.mpi-sws.org/data-imc2007.html 上,我找到了 http://socialnetworks.mpi-sws.mpg.de/data/orkut-links.txt.gz,未压缩大小为3.4 GB,包含223,534,301条边。

1	2
1	3
1	4
1	5

由于它们的格式相似,一个程序可以处理所有格式。

你的 Edge 类型是

type Edge struct {
    u, v int
}

在64位架构上占用16字节。

使用

type Edge struct {
	U, V uint32
}

它占用8字节,足够了。

如果切片的容量不足以容纳额外的值,append 函数会分配一个新的足够大的底层数组,以适应现有切片元素和额外值的大小。否则,append 函数会重用底层数组。对于大型切片,新数组的大小是旧数组大小的1.25倍。在将旧数组复制到新数组时,需要旧数组大小的1 + 1.25 = 2.25倍的内存。因此,分配底层数组时要确保所有值都能容纳。

make(T, n) 用于初始化具有初始空间为 n 的类型为 T 的映射。为 n 提供一个值可以限制添加元素时的重组和碎片化成本。哈希函数通常不完美,导致浪费空间。因此,可以消除映射,因为它是不必要的。为了消除重复项,可以原地对切片进行排序,并将唯一元素向下移动。

string 是不可变的,因此为了将 scanner.Text()byte 切片缓冲区转换为新的 string,会分配一个新的 string。为了解析数字,我们使用 strconv 包。为了最小化临时分配,使用 scanner.Bytes() 并适应接受字节数组参数的 strconv.ParseUint 函数(bytconv)。

例如,

orkut.go

package main

import (
	&quot;bufio&quot;
	&quot;bytes&quot;
	&quot;errors&quot;
	&quot;fmt&quot;
	&quot;os&quot;
	&quot;runtime&quot;
	&quot;sort&quot;
	&quot;strconv&quot;
)

type Edge struct {
	U, V uint32
}

func (e Edge) String() string {
	return fmt.Sprintf(&quot;%d,%d&quot;, e.U, e.V)
}

type ByKey []Edge

func (a ByKey) Len() int      { return len(a) }
func (a ByKey) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
func (a ByKey) Less(i, j int) bool {
	if a[i].U &lt; a[j].U {
		return true
	}
	if a[i].U == a[j].U &amp;&amp; a[i].V &lt; a[j].V {
		return true
	}
	return false
}

func countEdges(scanner *bufio.Scanner) int {
	var nNodes, nEdges int
	for scanner.Scan() {
		line := scanner.Bytes()
		if !(len(line) &gt; 0 &amp;&amp; line[0] == &#39;#&#39;) {
			nEdges++
			continue
		}
		n, err := fmt.Sscanf(string(line), &quot;# Nodes: %d Edges: %d&quot;, &amp;nNodes, &amp;nEdges)
		if err != nil || n != 2 {
			n, err = fmt.Sscanf(string(line), &quot;# %d,%d&quot;, &amp;nNodes, &amp;nEdges)
			if err != nil || n != 2 {
				continue
			}
		}
		fmt.Println(string(line))
		break
	}
	if err := scanner.Err(); err != nil {
		panic(err.Error())
	}
	fmt.Println(nEdges)
	return nEdges
}

func loadEdges(filename string) []Edge {
	file, err := os.Open(filename)
	if err != nil {
		panic(err.Error())
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)
	nEdges := countEdges(scanner)
	edges := make([]Edge, 0, nEdges)
	offset, err := file.Seek(0, os.SEEK_SET)
	if err != nil || offset != 0 {
		panic(err.Error())
	}

	var sep byte = &#39;\t&#39;
	scanner = bufio.NewScanner(file)
	for scanner.Scan() {
		line := scanner.Bytes()
		if len(line) &gt; 0 &amp;&amp; line[0] == &#39;#&#39; {
			continue
		}
		i := bytes.IndexByte(line, sep)
		if i &lt; 0 || i+1 &gt;= len(line) {
			sep = &#39;,&#39;
			i = bytes.IndexByte(line, sep)
			if i &lt; 0 || i+1 &gt;= len(line) {
				err := errors.New(&quot;Invalid line format: &quot; + string(line))
				panic(err.Error())
			}
		}
		u, err := ParseUint(line[:i], 10, 32)
		if err != nil {
			panic(err.Error())
		}
		v, err := ParseUint(line[i+1:], 10, 32)
		if err != nil {
			panic(err.Error())
		}
		if u &gt; v {
			u, v = v, u
		}
		edges = append(edges, Edge{uint32(u), uint32(v)})
	}
	if err := scanner.Err(); err != nil {
		panic(err.Error())
	}

	if len(edges) &lt;= 1 {
		return edges
	}
	sort.Sort(ByKey(edges))
	j := 0
	i := j + 1
	for ; i &lt; len(edges); i, j = i+1, j+1 {
		if edges[i] == edges[j] {
			break
		}
	}
	for ; i &lt; len(edges); i++ {
		if edges[i] != edges[j] {
			j++
			edges[j] = edges[i]
		}
	}
	edges = edges[:j+1]
	return edges
}

func main() {
	if len(os.Args) &lt;= 1 {
		err := errors.New(&quot;Missing file name&quot;)
		panic(err.Error())
	}
	filename := os.Args[1]
	fmt.Println(filename)
	edges := loadEdges(filename)

	var ms runtime.MemStats
	runtime.ReadMemStats(&amp;ms)
	fmt.Println(ms.Alloc, ms.TotalAlloc, ms.Sys, ms.Mallocs, ms.Frees)
	fmt.Println(len(edges), cap(edges))
	for i, e := range edges {
		fmt.Println(e)
		if i &gt;= 10 {
			break
		}
	}
}

// bytconv from strconv

// Return the first number n such that n*base &gt;= 1&lt;&lt;64.
func cutoff64(base int) uint64 {
	if base &lt; 2 {
		return 0
	}
	return (1&lt;&lt;64-1)/uint64(base) + 1
}

// ParseUint is like ParseInt but for unsigned numbers.
func ParseUint(s []byte, base int, bitSize int) (n uint64, err error) {
	var cutoff, maxVal uint64

	if bitSize == 0 {
		bitSize = int(strconv.IntSize)
	}

	s0 := s
	switch {
	case len(s) &lt; 1:
		err = strconv.ErrSyntax
		goto Error

	case 2 &lt;= base &amp;&amp; base &lt;= 36:
		// valid base; nothing to do

	case base == 0:
		// Look for octal, hex prefix.
		switch {
		case s[0] == &#39;0&#39; &amp;&amp; len(s) &gt; 1 &amp;&amp; (s[1] == &#39;x&#39; || s[1] == &#39;X&#39;):
			base = 16
			s = s[2:]
			if len(s) &lt; 1 {
				err = strconv.ErrSyntax
				goto Error
			}
		case s[0] == &#39;0&#39;:
			base = 8
		default:
			base = 10
		}

	default:
		err = errors.New(&quot;invalid base &quot; + strconv.Itoa(base))
		goto Error
	}

	n = 0
	cutoff = cutoff64(base)
	maxVal = 1&lt;&lt;uint(bitSize) - 1

	for i := 0; i &lt; len(s); i++ {
		var v byte
		d := s[i]
		switch {
		case &#39;0&#39; &lt;= d &amp;&amp; d &lt;= &#39;9&#39;:
			v = d - &#39;0&#39;
		case &#39;a&#39; &lt;= d &amp;&amp; d &lt;= &#39;z&#39;:
			v = d - &#39;a&#39; + 10
		case &#39;A&#39; &lt;= d &amp;&amp; d &lt;= &#39;Z&#39;:
			v = d - &#39;A&#39; + 10
		default:
			n = 0
			err = strconv.ErrSyntax
			goto Error
		}
		if int(v) &gt;= base {
			n = 0
			err = strconv.ErrSyntax
			goto Error
		}

		if n &gt;= cutoff {
			// n*base overflows
			n = 1&lt;&lt;64 - 1
			err = strconv.ErrRange
			goto Error
		}
		n *= uint64(base)

		n1 := n + uint64(v)
		if n1 &lt; n || n1 &gt; maxVal {
			// n+v overflows
			n = 1&lt;&lt;64 - 1
			err = strconv.ErrRange
			goto Error
		}
		n = n1
	}

	return n, nil

Error:
	return n, &amp;strconv.NumError{&quot;ParseUint&quot;, string(s0), err}
}

输出:

$ go build orkut.go
$ time ./orkut ~/release-orkut-links.txt
/home/peter/release-orkut-links.txt
223534301
1788305680 1788327856 1904683256 135 50
117185083 223534301
1,2
1,3
1,4
1,5
1,6
1,7
1,8
1,9
1,10
1,11
1,12
real	2m53.203s
user	2m51.584s
sys	0m1.628s
$

orkut.go 程序使用 release-orkut-links.txt 文件(3,372,855,860(3.4 GB)字节,包含223,534,301条边)大约使用了1.8 GiB的内存。消除重复项后,剩下117,185,083条唯一边。这与117,185,083条唯一边的 com-orkut.ungraph.txt 文件相匹配。

在你的机器上,如果有8 GB 的内存,可以加载更大的文件。

英文:

You are squandering memory. Here's how to rectify it.

You give the sample input a.txt, 48 bytes.

# 3072441,117185083
1,2
1,3
1,4
1,5

On http://snap.stanford.edu/data/com-Orkut.html, I found http://snap.stanford.edu/data/bigdata/communities/com-orkut.ungraph.txt.gz, 1.8 GB uncompressed, 117,185,083 edges.

# Undirected graph: ../../data/output/orkut.txt
# Orkut
# Nodes: 3072441 Edges: 117185083
# FromNodeId	ToNodeId
1	2
1	3
1	4
1	5

On http://socialnetworks.mpi-sws.org/data-imc2007.html, I found http://socialnetworks.mpi-sws.mpg.de/data/orkut-links.txt.gz, 3.4 GB uncompressed, 223,534,301 edges.

1	2
1	3
1	4
1	5

Since they are similar, one program can handle all formats.

Your Edge type is

type Edge struct {
    u, v int
}

which is 16 bytes on a 64-bit architecture.

Use

type Edge struct {
	U, V uint32
}

which is 8 bytes, it is adequate.

If the capacity of a slice is not large enough to fit the additional values, append allocates a new, sufficiently large underlying array that fits both the existing slice elements and the additional values. Otherwise, append re-uses the underlying array. For a large slice, the new array is 1.25 times the size of the old array. While the old array is being copied to the new array, 1 + 1.25 = 2.25 times the memory for the old array is required. Therefore, allocate the underlying array so that all values fit.

make(T, n) initializes map of type T with initial space for n elements. Provide a value for n to limit the cost of reorganization and fragmentation as elements are added. Hashing functions are often imperfect which leads to wasted space. Eliminate the map as it's unneccesary. To eliminate duplicates, sort the slice in place and move the unique elements down.

A string is immutable, therefore a new string is allocated for scanner.Text() to convert from a byte slice buffer. To parse numbers we use strconv. To minimize temporary allocations, use scanner.Bytes() and adapt strconv.ParseUint to accept a byte array argument (bytconv).

For example,

orkut.go

package main

import (
	&quot;bufio&quot;
	&quot;bytes&quot;
	&quot;errors&quot;
	&quot;fmt&quot;
	&quot;os&quot;
	&quot;runtime&quot;
	&quot;sort&quot;
	&quot;strconv&quot;
)

type Edge struct {
	U, V uint32
}

func (e Edge) String() string {
	return fmt.Sprintf(&quot;%d,%d&quot;, e.U, e.V)
}

type ByKey []Edge

func (a ByKey) Len() int      { return len(a) }
func (a ByKey) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
func (a ByKey) Less(i, j int) bool {
	if a[i].U &lt; a[j].U {
		return true
	}
	if a[i].U == a[j].U &amp;&amp; a[i].V &lt; a[j].V {
		return true
	}
	return false
}

func countEdges(scanner *bufio.Scanner) int {
	var nNodes, nEdges int
	for scanner.Scan() {
		line := scanner.Bytes()
		if !(len(line) &gt; 0 &amp;&amp; line[0] == &#39;#&#39;) {
			nEdges++
			continue
		}
		n, err := fmt.Sscanf(string(line), &quot;# Nodes: %d Edges: %d&quot;, &amp;nNodes, &amp;nEdges)
		if err != nil || n != 2 {
			n, err = fmt.Sscanf(string(line), &quot;# %d,%d&quot;, &amp;nNodes, &amp;nEdges)
			if err != nil || n != 2 {
				continue
			}
		}
		fmt.Println(string(line))
		break
	}
	if err := scanner.Err(); err != nil {
		panic(err.Error())
	}
	fmt.Println(nEdges)
	return nEdges
}

func loadEdges(filename string) []Edge {
	file, err := os.Open(filename)
	if err != nil {
		panic(err.Error())
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)
	nEdges := countEdges(scanner)
	edges := make([]Edge, 0, nEdges)
	offset, err := file.Seek(0, os.SEEK_SET)
	if err != nil || offset != 0 {
		panic(err.Error())
	}

	var sep byte = &#39;\t&#39;
	scanner = bufio.NewScanner(file)
	for scanner.Scan() {
		line := scanner.Bytes()
		if len(line) &gt; 0 &amp;&amp; line[0] == &#39;#&#39; {
			continue
		}
		i := bytes.IndexByte(line, sep)
		if i &lt; 0 || i+1 &gt;= len(line) {
			sep = &#39;,&#39;
			i = bytes.IndexByte(line, sep)
			if i &lt; 0 || i+1 &gt;= len(line) {
				err := errors.New(&quot;Invalid line format: &quot; + string(line))
				panic(err.Error())
			}
		}
		u, err := ParseUint(line[:i], 10, 32)
		if err != nil {
			panic(err.Error())
		}
		v, err := ParseUint(line[i+1:], 10, 32)
		if err != nil {
			panic(err.Error())
		}
		if u &gt; v {
			u, v = v, u
		}
		edges = append(edges, Edge{uint32(u), uint32(v)})
	}
	if err := scanner.Err(); err != nil {
		panic(err.Error())
	}

	if len(edges) &lt;= 1 {
		return edges
	}
	sort.Sort(ByKey(edges))
	j := 0
	i := j + 1
	for ; i &lt; len(edges); i, j = i+1, j+1 {
		if edges[i] == edges[j] {
			break
		}
	}
	for ; i &lt; len(edges); i++ {
		if edges[i] != edges[j] {
			j++
			edges[j] = edges[i]
		}
	}
	edges = edges[:j+1]
	return edges
}

func main() {
	if len(os.Args) &lt;= 1 {
		err := errors.New(&quot;Missing file name&quot;)
		panic(err.Error())
	}
	filename := os.Args[1]
	fmt.Println(filename)
	edges := loadEdges(filename)

	var ms runtime.MemStats
	runtime.ReadMemStats(&amp;ms)
	fmt.Println(ms.Alloc, ms.TotalAlloc, ms.Sys, ms.Mallocs, ms.Frees)
	fmt.Println(len(edges), cap(edges))
	for i, e := range edges {
		fmt.Println(e)
		if i &gt;= 10 {
			break
		}
	}
}

// bytconv from strconv

// Return the first number n such that n*base &gt;= 1&lt;&lt;64.
func cutoff64(base int) uint64 {
	if base &lt; 2 {
		return 0
	}
	return (1&lt;&lt;64-1)/uint64(base) + 1
}

// ParseUint is like ParseInt but for unsigned numbers.
func ParseUint(s []byte, base int, bitSize int) (n uint64, err error) {
	var cutoff, maxVal uint64

	if bitSize == 0 {
		bitSize = int(strconv.IntSize)
	}

	s0 := s
	switch {
	case len(s) &lt; 1:
		err = strconv.ErrSyntax
		goto Error

	case 2 &lt;= base &amp;&amp; base &lt;= 36:
		// valid base; nothing to do

	case base == 0:
		// Look for octal, hex prefix.
		switch {
		case s[0] == &#39;0&#39; &amp;&amp; len(s) &gt; 1 &amp;&amp; (s[1] == &#39;x&#39; || s[1] == &#39;X&#39;):
			base = 16
			s = s[2:]
			if len(s) &lt; 1 {
				err = strconv.ErrSyntax
				goto Error
			}
		case s[0] == &#39;0&#39;:
			base = 8
		default:
			base = 10
		}

	default:
		err = errors.New(&quot;invalid base &quot; + strconv.Itoa(base))
		goto Error
	}

	n = 0
	cutoff = cutoff64(base)
	maxVal = 1&lt;&lt;uint(bitSize) - 1

	for i := 0; i &lt; len(s); i++ {
		var v byte
		d := s[i]
		switch {
		case &#39;0&#39; &lt;= d &amp;&amp; d &lt;= &#39;9&#39;:
			v = d - &#39;0&#39;
		case &#39;a&#39; &lt;= d &amp;&amp; d &lt;= &#39;z&#39;:
			v = d - &#39;a&#39; + 10
		case &#39;A&#39; &lt;= d &amp;&amp; d &lt;= &#39;Z&#39;:
			v = d - &#39;A&#39; + 10
		default:
			n = 0
			err = strconv.ErrSyntax
			goto Error
		}
		if int(v) &gt;= base {
			n = 0
			err = strconv.ErrSyntax
			goto Error
		}

		if n &gt;= cutoff {
			// n*base overflows
			n = 1&lt;&lt;64 - 1
			err = strconv.ErrRange
			goto Error
		}
		n *= uint64(base)

		n1 := n + uint64(v)
		if n1 &lt; n || n1 &gt; maxVal {
			// n+v overflows
			n = 1&lt;&lt;64 - 1
			err = strconv.ErrRange
			goto Error
		}
		n = n1
	}

	return n, nil

Error:
	return n, &amp;strconv.NumError{&quot;ParseUint&quot;, string(s0), err}
}

Output:

$ go build orkut.go
$ time ./orkut ~/release-orkut-links.txt
/home/peter/release-orkut-links.txt
223534301
1788305680 1788327856 1904683256 135 50
117185083 223534301
1,2
1,3
1,4
1,5
1,6
1,7
1,8
1,9
1,10
1,11
1,12
real	2m53.203s
user	2m51.584s
sys	0m1.628s
$

The orkut.go program with the release-orkut-links.txt file (3,372,855,860 (3.4 GB) bytes with 223,534,301 edges) uses about 1.8 GiB of memory. After eliminating duplicates, 117,185,083 unique edges remain. This matches the 117,185,083 unique edge com-orkut.ungraph.txt file.

With 8 GB of memory on your machine, you can load much larger files.

huangapple
  • 本文由 发表于 2014年10月12日 17:48:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/26323832.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定