优化字典的RAM使用情况

huangapple go评论64阅读模式
英文:

Optimize RAM usage of dictionary

问题

我可以使用字典来查找字符串数组中元素的频率。但是当数组中的元素数量达到大约5000万时,我的程序的RAM使用量约为8-9 GB。这远远超出了我的预期。

我的字典是一个Dictionary<string, int>,如果没有重复的键,那么5000万个键值对只会占用我约2到2.5GB的内存。

我想知道我哪里出错了。

获取元素频率的部分:

public IEnumerable<string> GetTopTenStrings(string path)
{
    // 用于存储结果的字典
    Dictionary<string, int> result = new Dictionary<string, int>();

    var txtFiles = Directory.EnumerateFiles(path, "*.dat");

    int i = 1;

    foreach (string currentFile in txtFiles)
    {
        using (FileStream fs = File.Open(currentFile, FileMode.Open, FileAccess.Read, FileShare.Read))
        using (BufferedStream bs = new BufferedStream(fs))
        using (StreamReader buffer = new StreamReader(bs))
        {
            Console.WriteLine("现在我们在文件 {0}", i);

            // 存储处理过的行                        
            string storage = buffer.ReadToEnd();

            Process(result, storage);  

            i++;
        }
    }

    // 对字典进行排序并返回所需的值
    var final = result.OrderByDescending(x => x.Value).Take(10);

    foreach (var element in final)
    {
        Console.WriteLine("{0} 出现 {1}", element.Key, element.Value);
    }

    var test = final.Select(x => x.Key).ToList();

    Console.WriteLine('\n');

    return test;
}

向字典添加键值的函数:

public void Process(Dictionary<string, int> dict, string storage)
{
    List<string> lines = new List<string>();

    string[] line = storage.Split(';');

    foreach (var item in line.ToList())
    {
        if(item.Trim().Length != 0)
        {
            if (dict.ContainsKey(item.ToLower()))
            {
                dict[item.ToLower()]++;
            }
            else
            {
                dict.Add(item.ToLower(), 1);
            }
        }
    }
}
英文:

I can use the dictionary to find the frequency of an element in a string array. But when the number of elements in an array reaches around 50 million, the RAM usage for my program is around 8-9 GB. This is way too high compared to what I expected.
My dictionary is a Dictionary&lt;string, int&gt;, 50 million key-value pairs (if there is no duplicate key) will only cost me around 2 - 2.5GB.
I wonder where I was wrong.

The part to get the frequency of elements:

public IEnumerable&lt;string&gt; GetTopTenStrings(string path)
{
    // Dictionary to store the result
    Dictionary&lt;string, int&gt; result = new Dictionary&lt;string, int&gt;();

    var txtFiles = Directory.EnumerateFiles(Path, &quot;*.dat&quot;);

    int i = 1;

    foreach (string currentFile in txtFiles)
    {
        using (FileStream fs = File.Open(currentFile, FileMode.Open,
            FileAccess.Read, FileShare.Read))
        using (BufferedStream bs = new BufferedStream(fs))
        using (StreamReader buffer = new StreamReader(bs))
        {
            Console.WriteLine(&quot;Now we are at the file {0}&quot;, i);

            // Store processed lines                        
            string storage = buffer.ReadToEnd();

            Process(result, storage);  

            i++;
        }
    }

    // Sort the dictionary and return the needed values
    var final = result.OrderByDescending(x =&gt; x.Value).Take(10);

    foreach (var element in final)
    {
        Console.WriteLine(&quot;{0} appears {1}&quot;, element.Key, element.Value);
    }

    var test = final.Select(x =&gt; x.Key).ToList();

    Console.WriteLine(&#39;\n&#39;);

    return test;
}

The function to add key value to the dictionary:

public void Process(Dictionary&lt;string, int&gt; dict, string storage)
{
    List&lt;string&gt;lines = new List&lt;string&gt;();

    string[] line = storage.Split(&quot;;&quot;);

    foreach (var item in line.ToList())
    {
        if(item.Trim().Length != 0)
        {
            if (dict.ContainsKey(item.ToLower()))
            {
                dict[item.ToLower()]++;
            }
            else
            {
                dict.Add(item.ToLower(), 1);
            }
        }
    }
}

答案1

得分: 3

以下是翻译的内容:

这里有显著的性能和内存改进可能性:

  • 我们可以通过将整个文件保存在一个大字符串中,并使用 ReadOnlyMemory<char> 来保存对它的引用来改进内存使用。
  • 预先初始化字典到一些较大的大小,以避免重新调整大小。
  • 不要使用 .Trim(),而是使用 string.IsNullOrWhiteSpace 或类似方法。
  • 不要使用 .ToLower(),而是在字典上使用不区分大小写的比较器。如果你在使用字符串,可以使用 StringComparer,但对于 ReadOnlyMemory<char>,我们需要一个自定义的比较器。
  • 不要不必要地使用 .ToList()
  • BufferedStream 似乎是不必要的。
  • 在处理字符串之前关闭文件。
  • 我们可以使用 CollectionsMarshal.GetValueRefOrAddDefault 来避免对字典进行双重查找。
  • 不要使用 OrderByDescending.Take,它需要对整个列表进行排序,我们可以在固定大小的列表上使用插入排序。
    • 我们只需二进制搜索每个值的位置,然后插入到新位置。
    • 如果找不到值,二进制搜索返回索引的补码(负数)。

这一切都是基于重复字符串的数量不是非常高的假设。如果重复数量很高,那么将每个字符串拆分并使用 StringPool 进行 Intern 处理可能更有意义。

public IEnumerable<string> GetTopTenStrings(string path)
{
    // 用于存储结果的字典
    var result = new Dictionary<ReadOnlyMemory<char>, int>(一些大容量值, new MemoryComparer());

    int i = 1;

    foreach (string currentFile in Directory.EnumerateFiles(Path, "*.dat"))
    {
        string str;
        using (FileStream fs = File.Open(currentFile, FileMode.Open,
            FileAccess.Read, FileShare.Read))
        using (StreamReader sr = new StreamReader(fs))
        {
            Console.WriteLine($"现在我们在文件 {i} 上:{currentFile}");
            str = sr.ReadToEnd();

            i++;
        }
        Process(result, str);
    }

    // 对字典进行排序并返回所需的值
    var final = TopByOrderDesc(result, 10);

    foreach (var element in final)
    {
        Console.WriteLine("{0} 出现 {1}", element.Key, element.Value);
    }

    var test = final.Select(x => x.Key.ToString()).ToList();

    Console.WriteLine('\n');

    return test;
}
public void Process(Dictionary<ReadOnlyMemory<char>, int> dict, string str)
{
    var startIndex = 0;
    while (true)
    {
        // 搜索分隔符
        var endIndex = str.IndexOf(';', startIndex);
        if (endIndex <= 0)   // 未找到
            endIndex = str.Length;   // 到达字符串的末尾
        if (endIndex - startIndex > 0)    // 如果非零
        {
            var mem = str.AsMemory(startIndex, endIndex - startIndex);
            if (!MemoryExtensions.IsWhiteSpace(mem.Span))    // 并且不是空白
            {
                ref var val = ref CollectionsMarshal.GetValueRefOrAddDefault(dict, mem, out _);  // 获取字典中 KVP 位置的引用
                val++;    // 增加位置 1
            }
        }
        if (endIndex == str.Length)    // 字符串处理完成
            break;
        startIndex = endIndex + 1;    // 否则移到下一个字符
    }
}
public List<KeyValuePair<ReadOnlyMemory<char>, int>> TopByOrderDesc(Dictionary<ReadOnlyMemory<char>, int> source, int top)
{
    var list = new List<KeyValuePair<ReadOnlyMemory<char>, int>>(top + 1);  // 预初始化
    var comparer = Comparer<KeyValuePair<ReadOnlyMemory<char>, int>>.Create(
        (kvp1, kvp2) => kvp2.Value.CompareTo(kvp1.Value)
    );    // !!! 反向比较器 !!!

    foreach (var item in source)
    {
        var index = list.BinarySearch(item, comparer);
        if (index < 0)
            index = ~index;

        if (index < top) // 没有插入最后一个元素的意义
        {
            if (list.Count == top)
                list.RemoveAt(top - 1);

            list.InsertAt(index, item);
        }
    }
    return list;
}
class MemoryComparer : IEqualityComparer<ReadOnlyMemory<char>>
{
    public StringComparison Comparison { get; set; } = StringComparison.OrdinalIgnoreCase;

    public bool Equals(ReadOnlyMemory<char> a, ReadOnlyMemory<char> b) =>
        MemoryExtensions.Equals(b.Span, a.Span, Comparison);

    public int GetHashCode(ReadOnlyMemory<char> o) =>
        string.GetHashCode(o.Span, Comparison);
}
英文:

There are significant performance and memory improvements possible here:

  • We can improve memory usage by holding the whole file in one big string, and using ReadOnlyMemory&lt;char&gt; to hold references to it.
  • Pre-initialize the dictionary to some large size, so that resizes are not necessary.
  • Instead of .Trim() use string.IsNullOrWhiteSpace or similar.
  • Instead of .ToLower() use a case-insensitive comparer on the dictionary.
    If you were using strings you can use StringComparer, but we need a custom comparer for ReadOnlyMemory&lt;char&gt;.
  • Don't use .ToList() unnecessarily.
  • BufferedStream seems unnecessary.
  • Close the file before processing the string.
  • We can use CollectionsMarshal.GetValueRefOrAddDefault to avoid a double lookup on the dictionary.
  • Instead of using OrderByDescending.Take, which would require sorting the entire list, we can use an insertion sort on a fixed sized list.
    • We just binary-search each value's location, and then insert at the new location.
    • Binary search returns the two's-complement (negation) of the index if the value isn't found.

This all assumes that the number of duplicate strings isn't very high. If the number of duplicates is high then it would make more sense to split each string and Intern the results using a StringPool.

public IEnumerable&lt;string&gt; GetTopTenStrings(string path)
{
    // Dictionary to store the result
    var result = new Dictionary&lt;ReadOnlyMemory&lt;char&gt;, int&gt;(someLargeCapacityHere, new MemoryComparer());

    int i = 1;

    foreach (string currentFile in Directory.EnumerateFiles(Path, &quot;*.dat&quot;))
    {
        string str;
        using (FileStream fs = File.Open(currentFile, FileMode.Open,
            FileAccess.Read, FileShare.Read))
        using (StreamReader sr = new StreamReader(fs))
        {
            Console.WriteLine($&quot;Now we are at the file {i}: {currentFile}&quot;);
            str = sr.ReadToEnd();

            i++;
        }
        Process(result, str);
    }

    // Sort the dictionary and return the needed values
    var final = TopByOrderDesc(result, 10);

    foreach (var element in final)
    {
        Console.WriteLine(&quot;{0} appears {1}&quot;, element.Key, element.Value);
    }

    var test = final.Select(x =&gt; x.Key.ToString()).ToList();

    Console.WriteLine(&#39;\n&#39;);

    return test;
}
public void Process(Dictionary&lt;ReadOnlyMemory&lt;char&gt;, int&gt; dict, string str)
{
	var startIndex = 0;
	while (true)
	{
        // search for separator
		var endIndex = str.IndexOf(&#39;;&#39;, startIndex);
		if (endIndex &lt;= 0)   // not found
		    endIndex = str.Length;   // go til the end of string
		if (endIndex - startIndex &gt; 0)    // if non-zero
		{
			var mem = str.AsMemory(startIndex, endIndex - startIndex);
			if (!MemoryExtensions.IsWhiteSpace(mem.Span))    // and not whitespace
			{
				ref var val = ref CollectionsMarshal.GetValueRefOrAddDefault(dict, mem, out _);  // get ref of KVP location in dictionary
				val++;    // increment location by 1
			}
		}
		if (endIndex == str.Length)    // finished string
			break;
		startIndex = endIndex + 1;    // otherwise move to next char
	}
}
public List&lt;KeyValuePair&lt;ReadOnlyMemory&lt;char&gt;, int&gt;&gt; TopByOrderDesc(Dictionary&lt;ReadOnlyMemory&lt;char&gt;, int&gt; source, int top)
{
    var list = new List&lt;KeyValuePair&lt;ReadOnlyMemory&lt;char&gt;, int&gt;&gt;(top + 1);  //pre-initialize
    var comparer = Comparer&lt;KeyValuePair&lt;ReadOnlyMemory&lt;char&gt;, int&gt;&gt;.Create(
        (kvp1, kvp2) =&gt; kvp2.Value.CompareTo(kvp1.Value)
    );    // !!! Reverse comparer !!!

    foreach (var item in source)
    {
        var index = list.BinarySearch(item, comparer);
        if (index &lt; 0)
            index = ~index;

        if (index &lt; top) // no point inserting last one
        {
            if (list.Count == top)
                list.RemoveAt(top - 1);

            list.InsertAt(index, item);
        }
    }
    return list;
}
class MemoryComparer : IEqualityComparer&lt;ReadOnlyMemory&lt;char&gt;&gt;
{
    public StringComparison Comparison {get; set;} = StringComparison.OrdinalIgnoreCase;

    public bool Equals(ReadOnlyMemory&lt;char&gt; a, ReadOnlyMemory&lt;char&gt; b) =&gt;
        MemoryExtensions.Equals(b.Span, a.Span, Comparison);

    public int GetHashCode(ReadOnlyMemory&lt;char&gt; o) =&gt;
        string.GetHashCode(o.Span, Comparison);
}

huangapple
  • 本文由 发表于 2023年6月11日 23:27:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76451175.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定