你可以通过在Python中用None替换列表元素来节省内存吗?

huangapple go评论67阅读模式
英文:

Can I save memory by replacing list elements with None in Python?

问题

I have millions of rows of data like these:

	["1.0.0.0/24", 16777216, 16777471, "1.0.0.0", "1.0.0.255", 256, 13335, "AU", false, true, false, false],
	["1.0.1.0/23", 16777472, 16777983, "1.0.1.0", "1.0.2.255", 512, null, "CN", false, false, false, false],
	["1.0.3.0/24", 16777984, 16778239, "1.0.3.0", "1.0.3.255", 256, null, "CN", false, false, false, false]

我有数百万行类似于这样的数据:

	["1.0.0.0/24", 16777216, 16777471, "1.0.0.0", "1.0.0.255", 256, 13335, "AU", false, true, false, false],
	["1.0.1.0/23", 16777472, 16777983, "1.0.1.0", "1.0.2.255", 512, null, "CN", false, false, false, false],
	["1.0.3.0/24", 16777984, 16778239, "1.0.3.0", "1.0.3.255", 256, null, "CN", false, false, false, false]

I saved them in JSON files and also in an SQLITE3 database. I am going to pull all the data from the database at the start of a script, to make the data querying happen entirely in memory, thus save time by not using the slow filesystem calls.

我将它们保存在JSON文件和SQLITE3数据库中。我将在脚本的开头从数据库中提取所有数据,以便数据查询完全在内存中进行,从而节省时间,不使用慢速的文件系统调用。

And this also means it will take a lot of memory, I measured the memory usage of the data to be about 500MiB. I will put them into a list, I use binary search to find the index of closest starting IP address that is less than or equal to any given IP address, and then determine if the IP is inside the network located at the index. (I will pull the starts and ends out of the list)

这也意味着它将占用大量内存,我测量了数据的内存使用量约为500MiB。我将它们放入一个list中,我使用二进制搜索来查找最接近给定IP地址的起始IP地址的索引,然后确定IP是否在索引处的网络中。 (我将从list中拉出起始和结束)

If the IP is inside the network, the data will be put into a custom class to make the result strongly typed, the result will be cached so next time if the query is called with the same argument the cached result will be retrieved to save processing time, and the element located at the index will be deleted before the result is returned. (The key will be the index as well as the argument)

如果IP在网络内,数据将被放入一个自定义类中以使结果强类型化,结果将被缓存,所以下次如果使用相同的参数调用查询,将检索缓存的结果以节省处理时间,并且在返回结果之前将位于索引处的元素删除。(键将是索引以及参数)

Because I use binary search, naturally this requires that the indices to be invariant, but I want to remove the unnecessary element from the list to save memory, and this will cause the indices to change.

因为我使用二进制搜索,自然而然地要求索引不变,但我想要从list中删除不必要的元素以节省内存,这将导致索引发生变化。

A simple solution to this problem is to not delete the element at the index, but assign the list element located at the index to None. Another solution would be to convert the list to a dict with the indices as keys, but of course this would use more memory than using a list.

解决这个问题的一个简单方法是不删除索引处的元素,而是将位于索引处的list元素分配为None。另一个解决方案是将list转换为以索引为键的dict,但当然这会比使用list使用更多的内存。

But I don't know if doing so would save memory, I tried to create lists with the same length and containing the same element at all indices, and it seemed that lists of the same length always have the same size, and the size of elements don't matter:

但我不知道这样做是否会节省内存,我尝试创建具有相同长度并包含所有索引处相同元素的list,似乎具有相同长度的list总是具有相同的大小,元素的大小不重要:

In [200]: ([None]*18).__sizeof__()
Out[200]: 184

In [201]: ([None]*180).__sizeof__()
Out[201]: 1480

In [202]: ([0]*180).__sizeof__()
Out[202]: 1480

In [203]: ([object]*180).__sizeof__()
Out[203]: 1480

In [204]: (['']*180).__sizeof__()
Out[204]: 1480

In [205]: (['abcs']*180).__sizeof__()
Out[205]: 1480

In [206]: (['abcdefg']*180).__sizeof__()
Out[206]: 1480

In [207]: ({i: e for i, e in enumerate(['abcdefg']*180)}).__sizeof__()
Out[207]: 9296

In [208]: 9296/1480
Out[208]: 6.281081081081081

In [209]: ([('abcdefg', 1, 2, 3, 4, 5)]*180).__sizeof__()
Out[209]: 1480

So can replacing list elements with None save memory? If not, then what is a better way to remove items while keeping indices?

所以用None替换list元素

英文:

I have millions of rows of data like these:

	["1.0.0.0/24", 16777216, 16777471, "1.0.0.0", "1.0.0.255", 256, 13335, "AU", false, true, false, false],
	["1.0.1.0/23", 16777472, 16777983, "1.0.1.0", "1.0.2.255", 512, null, "CN", false, false, false, false],
	["1.0.3.0/24", 16777984, 16778239, "1.0.3.0", "1.0.3.255", 256, null, "CN", false, false, false, false]

I saved them in JSON files and also in an SQLITE3 database. I am going to pull all the data from the database at the start of a script, to make the data querying happen entirely in memory, thus save time by not using the slow filesystem calls.

And this also means it will take a lot of memory, I measured the memory usage of the data to be about 500MiB. I will put them into a list, I use binary search to find the index of closest starting IP address that is less than or equal to any given IP address, and then determine if the IP is inside the network located at the index. (I will pull the starts and ends out of the list)

If the IP is inside the network, the data will be put into a custom class to make the result strongly typed, the result will be cached so next time if the query is called with the same argument the cached result will be retrieved to save processing time, and the element located at the index will be deleted before the result is returned. (The key will be the index as well as the argument)

Because I use binary search, naturally this requires that the indices to be invariant, but I want to remove the unnecessary element from the list to save memory, and this will cause the indices to change.

A simple solution to this problem is to not delete the element at the index, but assign the list element located at the index to None. Another solution would be to convert the list to a dict with the indices as keys, but of course this would use more memory than using a list.

But I don't know if doing so would save memory, I tried to create lists with the same length and containing the same element at all indices, and it seemed that lists of the same length always have the same size, and the size of elements don't matter:

In [200]: ([None]*18).__sizeof__()
Out[200]: 184

In [201]: ([None]*180).__sizeof__()
Out[201]: 1480

In [202]: ([0]*180).__sizeof__()
Out[202]: 1480

In [203]: ([object]*180).__sizeof__()
Out[203]: 1480

In [204]: (['']*180).__sizeof__()
Out[204]: 1480

In [205]: (['abcs']*180).__sizeof__()
Out[205]: 1480

In [206]: (['abcdefg']*180).__sizeof__()
Out[206]: 1480

In [207]: ({i: e for i, e in enumerate(['abcdefg']*180)}).__sizeof__()
Out[207]: 9296

In [208]: 9296/1480
Out[208]: 6.281081081081081

In [209]: ([('abcdefg', 1, 2, 3, 4, 5)]*180).__sizeof__()
Out[209]: 1480

So can replacing list elements with None save memory? If not, then what is a better way to remove items while keeping indices?


It seems that a list containing a row of the data repeatedly also has the same size at the same length:

In [221]: import json

In [222]: l = json.loads('["1.0.0.0/24", 16777216, 16777471, "1.0.0.0", "1.0.0.255", 256, 13335, "AU", false, true, false, false]')

In [223]: l
Out[223]:
['1.0.0.0/24',
 16777216,
 16777471,
 '1.0.0.0',
 '1.0.0.255',
 256,
 13335,
 'AU',
 False,
 True,
 False,
 False]

In [224]: ([l]*180).__sizeof__()
Out[224]: 1480

I have made some other tests, but the result doesn't make sense:

In [224]: ([l]*180).__sizeof__()
Out[224]: 1480

In [225]: l.__sizeof__()
Out[225]: 168

In [226]: l = [l]*180

In [227]: l.__sizeof__()
Out[227]: 1480

In [228]: l[0:12] = [None]*12

In [229]: l.__sizeof__()
Out[229]: 1480

In [230]: list(range(180)).__sizeof__()
Out[230]: 1480

It seems that the size of a list is only related to its length and not related to its contents whatsoever, but this simply can't be true.


No the binary search won't be broken, since I will store the starting IP of networks as integers in a separate list, and ending IP of networks as integers in yet another list, and these two lists will not change.

It's like this:


STARTS = [row[1] for row in data]
ENDS = [row[2] for row in data]

store = {}
def query(ip):
    if ip in store:
        return store[ip]
    index = bisect(STARTS, ip) - 1
    if index >= 0:
        if not STARTS[index] <= ip <= ENDS[index]:
            return
        if index in store:
            result = store[index]
            store[ip] = result
            return result
        row = data[index]
        data[index] = None
        result = Network(row)
        store[index] = result
        store[ip] = result
        return result

I didn't actually write working code, though writing it is trivial, I just don't know if this would end up saving memory.


I have benchmarked the SQLite3 query and found it to take around 40 milliseconds to complete a single query:

In [232]: import sqlite3

In [233]: conn = sqlite3.connect('D:/network_guard/IPFire_locations.db')

In [234]: cur = conn.cursor()

In [235]: cur.execute('select * from IPv4 where start_integer = 3113839616;')
Out[235]: <sqlite3.Cursor at 0x20b5d9c66c0>

In [236]: cur.fetchone()
Out[236]:
('185.153.108.0/22',
 3113839616,
 3113840639,
 '185.153.108.0',
 '185.153.111.255',
 1024,
 3242,
 'IT',
 0,
 0,
 0,
 0)

In [237]: %timeit cur.execute('select * from IPv4 where start_integer = 3113839616;')
38.8 ms ± 805 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [238]: cur.execute('select * from IPv4 where start_integer = 3113839616;')
Out[238]: <sqlite3.Cursor at 0x20b5d9c66c0>

In [239]: %timeit cur.fetchone()
58.8 ns ± 0.672 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Using bisect takes under 1 microsecond to complete the same query, and there are 567778 rows for IPv4 addresses and 446631 rows for IPv6 addresses, for a total of 1014409 rows. Just fetching all the rows and creating the lists take about 500MiB memory.

In [246]: cur.execute('select * from IPv4;')
Out[246]: <sqlite3.Cursor at 0x20b5d9c66c0>

In [247]: data = cur.fetchall()

In [248]: STARTS = [row[1] for row in data]

In [249]: bisect(STARTS, 3113839616)
Out[249]: 366233

In [250]: %timeit bisect(STARTS, 3113839616)
341 ns ± 6.43 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [251]: len(data)
Out[251]: 567778

In [252]: cur.execute('select * from IPv6;')
Out[252]: <sqlite3.Cursor at 0x20b5d9c66c0>

In [253]: data6 = cur.fetchall()

In [254]: len(data6)
Out[254]: 446631

In [255]: 567778 + 446631
Out[255]: 1014409

I determined the memory usage by using Task Manager, just by checking the memory usage of the process right before fetching the rows and right after fetching the rows, to calculate the difference.

If I create all instances of custom classes upfront, I don't think I have enough RAM for all the objects even though I have 16GiB (I open multiple tabs in browsers and the browser take multiple GibiBytes of RAM, so I don't have much available RAM).

And I won't make any more edits to this post.

答案1

得分: 2

你的大列表对象(包含1014409行)占用了约8 MiB的内存。其余的约492 MiB用于行列表对象。每行大约占用485字节。所以,是的,你可以为每行替换为None来节省数百字节的内存。(具体节省多少取决于其中的元素对象因被其他地方引用而保持存活的数量。)

英文:

Your big list object (containing the 1014409 rows) takes ~8 MiB memory. The remaining ~492 MiB are for your row list objects. That's ~485 bytes per row. So yes, you could save hundreds of bytes for each row you replace with None. (How much depends on how many of its element objects stay alive due to being referenced elsewhere.)

huangapple
  • 本文由 发表于 2023年5月13日 22:11:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76243166.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定