英文:
Python asyncio sleep is big memory usage
问题
I have an asynchronous function, this function creates about 390 tasks and sleeps for 1 second using asyncio every 10 tasks, and I noticed that this Python script is using a lot of RAM. When I profiled it with a memory profiler, I realized that the most memory-consuming part, as far as I can see, is asyncio sleep. How can I solve this?
Profiler Output:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
91 667.5 MiB 667.5 MiB 1 @profile
92 async def main(self):
93 667.5 MiB 0.0 MiB 1 tasks = []
94 667.5 MiB 0.0 MiB 1 start_time = time.time()
95 667.5 MiB 0.0 MiB 1 following_listings_count = len(self.following_listings)
96
97 1018.8 MiB 0.0 MiB 3 async with self.api.session.create_async_oauth2_session() as session:
98
99 988.2 MiB 0.0 MiB 351 for offset in range(0, following_listings_count, 100):
100 988.2 MiB 0.0 MiB 350 paged_listings = self.following_listings[offset:offset + 100]
101 988.2 MiB 0.0 MiB 36007 listing_ids = [listing.listing_id for listing in paged_listings]
102
103 988.2 MiB 0.0 MiB 700 tasks.append(
104 988.2 MiB 0.0 MiB 350 asyncio.create_task(self.get_listings(session, listing_ids, paged_listings))
105 )
106 988.2 MiB 0.0 MiB 350 if offset % 1000 == 0:
107 988.2 MiB 320.6 MiB 70 await asyncio.sleep(1.1)
108 988.2 MiB 0.0 MiB 350 print(offset)
109 1018.8 MiB 30.6 MiB 2 data = await asyncio.gather(*tasks)
110
111
112 1018.8 MiB 0.0 MiB 1 print(f"Data Scraping Finish Time: {time.time() - start_time}")
113 1018.8 MiB 0.0 MiB 1 """start_time = time.time()
114 all_listings = []
115 for listings in data:
116 all_listings += listings
117
118 self create_reports(all_listings)
119
120 print(f"Finished Time: {time.time() - start_time}")
121 """
122 1018.8 MiB 0.0 MiB 1 return "Ok."
I used time sleep instead of asyncio sleep, but this time I received a rate limit warning because the API I used has a limit of 10 requests per second.
Edit: I'm now using a thread pool executor, but my problem is getting items from the API. response.json()
consumes 4-8 MB of memory all the time.
英文:
I have an asynchronous function, this function creates about 390 tasks and sleep a 1 second asyncio wait every 10 tasks, and I noticed that this python script is using a lot of ram. When I profiled with memory profiler, I realized that the most ram consuming thing as far as I could see was asyncio sleep. how can i solve this?
Profiller Output:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
91 667.5 MiB 667.5 MiB 1 @profile
92 async def main(self):
93 667.5 MiB 0.0 MiB 1 tasks = []
94 667.5 MiB 0.0 MiB 1 start_time = time.time()
95 667.5 MiB 0.0 MiB 1 following_listings_count = len(self.following_listings)
96
97 1018.8 MiB 0.0 MiB 3 async with self.api.session.create_async_oauth2_session() as session:
98
99 988.2 MiB 0.0 MiB 351 for offset in range(0, following_listings_count, 100):
100 988.2 MiB 0.0 MiB 350 paged_listings = self.following_listings[offset:offset + 100]
101 988.2 MiB 0.0 MiB 36007 listing_ids = [listing.listing_id for listing in paged_listings]
102
103 988.2 MiB 0.0 MiB 700 tasks.append(
104 988.2 MiB 0.0 MiB 350 asyncio.create_task(self.get_listings(session, listing_ids, paged_listings))
105 )
106 988.2 MiB 0.0 MiB 350 if offset % 1000 == 0:
107 988.2 MiB 320.6 MiB 70 await asyncio.sleep(1.1)
108 988.2 MiB 0.0 MiB 350 print(offset)
109 1018.8 MiB 30.6 MiB 2 data = await asyncio.gather(*tasks)
110
111
112 1018.8 MiB 0.0 MiB 1 print(f"Data Scraping Finish Time: {time.time() - start_time}")
113 1018.8 MiB 0.0 MiB 1 """start_time = time.time()
114 all_listings = []
115 for listings in data:
116 all_listings += listings
117
118 self.create_reports(all_listings)
119
120 print(f"Finished Time: {time.time() - start_time}")
121 """
122 1018.8 MiB 0.0 MiB 1 return "Ok."
I used time sleep instead of asyncio sleep, but this time I got a limit warning because the api I used gave me a limit of 10 per second.
Edit: i using now thread pool executor but my problem is getting items from api. requests response.json() is Consumes 4-8 MB of memory all the time.
答案1
得分: 1
asyncio.sleep
不会占用太多内存 - 问题在于性能分析工具不了解 asyncio,因此需要知道如何解释这些数据。
它的意思是,在执行 await asyncio.sleep
这行代码期间,内存使用增加了 300MB - 但正是在这些步骤中执行了其他任务:由其他任务创建的对象占用了这么多内存。实际上,在你的代码中,只有在等待 asyncio.sleep
和 asyncio.gather
时才会运行任务(所以将 asyncio.sleep
替换为同步睡眠只会延迟所有任务的执行开始)。
如果你需要限制总数据传输速率或 API 使用,那么在单个任务中使用 asyncio.sleep
是不正确的,因为其他任务将继续获取数据。
而且,正如你所说,每个 API 响应似乎都占用大量内存:解决方法是选择与你的应用程序相关的数据,并确保在每个请求时丢弃响应的其余部分。
你的清单很难阅读,因为它没有与帖子输出分开。但只要花点时间看一下,问题就显而易见:你一次性创建了所有任务,并使用单个 asyncio.gather
监听所有任务。资源 - 包括在任务协程中使用的所有局部变量 - 直到收集结果后才会被释放:任务保留它们的状态。
这种代码只在你有足够的内存和资源可以一直保留在内存中时才适用。如果内存是一个问题,你需要更改代码,以创建有限数量的任务,并在完成它们并记录结果后创建更多任务。asyncio.wait
可能是这种结构的更好选项,此外,如果需要限制并发请求的数量,请考虑使用 asyncio 信号量。
英文:
asyncio.sleep
does not use that much memory - it is the profiler tool that is not asyncio aware, so one have to know how to interpret this
What it says is that during the steps where the line await asyncio.sleep
were executed, memory usage increased by 300MB - but it is during those steps that other tasks are executed: the objects created by the other tasks are amounting to this much. In fact, in your code, it is just while awaiting asyncio.sleep
and asyncio.gather
that the tasks run at all (so replacing the asyncio.sleep
by a synchronous sleep will do nothing but delay the start of execution of all tasks)
If you have to limit the total data rate or API usage, your use of asyncio.sleep
in a single task would be incorrect anyway, as other tasks will proceed fetching data.
And, as you state, it looks like each API response is using a lot of RAM: the workaround them is to pick the relevant data for your app, and ensure the remaining of the response is discarded at each request.
Your listing is very hard to read, since you didn't have it separate from the post output. But the problem is obvious once one take the time to look at it: you create all your tasks in one batch and listen to all your tasks with a single asyncio.gather
. The resources - including all local variables used in your tasks co-routines won't be freed until you collect the results, after all requests are made: tasks keep their state.
That kind of code is just good when you have memory and resources to spare to keep everything in memory. If memory is a concern, you have to change your code to create a limited number of tasks, and create more as those are completed and their results recorded. asyncio.wait
can be a better option than .gather
for this construct. Also, take a look at the use of asyncio Semaphores if limiting the number of concurrent requests is a concern.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论