Python asyncio sleep is big memory usage.

huangapple go评论67阅读模式
英文:

Python asyncio sleep is big memory usage

问题

I have an asynchronous function, this function creates about 390 tasks and sleeps for 1 second using asyncio every 10 tasks, and I noticed that this Python script is using a lot of RAM. When I profiled it with a memory profiler, I realized that the most memory-consuming part, as far as I can see, is asyncio sleep. How can I solve this?

Profiler Output:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    91    667.5 MiB    667.5 MiB           1       @profile
    92                                             async def main(self):
    93    667.5 MiB      0.0 MiB           1           tasks = []
    94    667.5 MiB      0.0 MiB           1           start_time = time.time()
    95    667.5 MiB      0.0 MiB           1           following_listings_count = len(self.following_listings)
    96
    97   1018.8 MiB      0.0 MiB           3           async with self.api.session.create_async_oauth2_session() as session:
    98
    99    988.2 MiB      0.0 MiB         351               for offset in range(0, following_listings_count, 100):
   100    988.2 MiB      0.0 MiB         350                   paged_listings = self.following_listings[offset:offset + 100]
   101    988.2 MiB      0.0 MiB       36007                   listing_ids = [listing.listing_id for listing in paged_listings]
   102
   103    988.2 MiB      0.0 MiB         700                   tasks.append(
   104    988.2 MiB      0.0 MiB         350                       asyncio.create_task(self.get_listings(session, listing_ids, paged_listings))
   105                                                         )
   106    988.2 MiB      0.0 MiB         350                   if offset % 1000 == 0:
   107    988.2 MiB    320.6 MiB          70                      await asyncio.sleep(1.1)
   108    988.2 MiB      0.0 MiB         350                   print(offset)
   109   1018.8 MiB     30.6 MiB           2               data = await asyncio.gather(*tasks)
   110
   111
   112   1018.8 MiB      0.0 MiB           1               print(f"Data Scraping Finish Time: {time.time() - start_time}")
   113   1018.8 MiB      0.0 MiB           1           """start_time = time.time()
   114                                                 all_listings = []
   115                                                 for listings in data:
   116                                                     all_listings += listings
   117
   118                                                 self create_reports(all_listings)
   119
   120                                                 print(f"Finished Time: {time.time() - start_time}")
   121                                                 """
   122   1018.8 MiB      0.0 MiB           1           return "Ok."

I used time sleep instead of asyncio sleep, but this time I received a rate limit warning because the API I used has a limit of 10 requests per second.

Edit: I'm now using a thread pool executor, but my problem is getting items from the API. response.json() consumes 4-8 MB of memory all the time.

英文:

I have an asynchronous function, this function creates about 390 tasks and sleep a 1 second asyncio wait every 10 tasks, and I noticed that this python script is using a lot of ram. When I profiled with memory profiler, I realized that the most ram consuming thing as far as I could see was asyncio sleep. how can i solve this?

Profiller Output:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    91    667.5 MiB    667.5 MiB           1       @profile
    92                                             async def main(self):
    93    667.5 MiB      0.0 MiB           1           tasks = []
    94    667.5 MiB      0.0 MiB           1           start_time = time.time()
    95    667.5 MiB      0.0 MiB           1           following_listings_count = len(self.following_listings)
    96
    97   1018.8 MiB      0.0 MiB           3           async with self.api.session.create_async_oauth2_session() as session:
    98
    99    988.2 MiB      0.0 MiB         351               for offset in range(0, following_listings_count, 100):
   100    988.2 MiB      0.0 MiB         350                   paged_listings = self.following_listings[offset:offset + 100]
   101    988.2 MiB      0.0 MiB       36007                   listing_ids = [listing.listing_id for listing in paged_listings]
   102
   103    988.2 MiB      0.0 MiB         700                   tasks.append(
   104    988.2 MiB      0.0 MiB         350                       asyncio.create_task(self.get_listings(session, listing_ids, paged_listings))
   105                                                         )
   106    988.2 MiB      0.0 MiB         350                   if offset % 1000 == 0:
   107    988.2 MiB    320.6 MiB          70                      await asyncio.sleep(1.1)
   108    988.2 MiB      0.0 MiB         350                   print(offset)
   109   1018.8 MiB     30.6 MiB           2               data = await asyncio.gather(*tasks)
   110
   111
   112   1018.8 MiB      0.0 MiB           1               print(f"Data Scraping Finish Time: {time.time() - start_time}")
   113   1018.8 MiB      0.0 MiB           1           """start_time = time.time()
   114                                                 all_listings = []
   115                                                 for listings in data:
   116                                                     all_listings += listings
   117
   118                                                 self.create_reports(all_listings)
   119
   120                                                 print(f"Finished Time: {time.time() - start_time}")
   121                                                 """
   122   1018.8 MiB      0.0 MiB           1           return "Ok."

I used time sleep instead of asyncio sleep, but this time I got a limit warning because the api I used gave me a limit of 10 per second.

Edit: i using now thread pool executor but my problem is getting items from api. requests response.json() is Consumes 4-8 MB of memory all the time.

答案1

得分: 1

asyncio.sleep 不会占用太多内存 - 问题在于性能分析工具不了解 asyncio,因此需要知道如何解释这些数据。

它的意思是,在执行 await asyncio.sleep 这行代码期间,内存使用增加了 300MB - 但正是在这些步骤中执行了其他任务:由其他任务创建的对象占用了这么多内存。实际上,在你的代码中,只有在等待 asyncio.sleepasyncio.gather 时才会运行任务(所以将 asyncio.sleep 替换为同步睡眠只会延迟所有任务的执行开始)。

如果你需要限制总数据传输速率或 API 使用,那么在单个任务中使用 asyncio.sleep 是不正确的,因为其他任务将继续获取数据。

而且,正如你所说,每个 API 响应似乎都占用大量内存:解决方法是选择与你的应用程序相关的数据,并确保在每个请求时丢弃响应的其余部分。

你的清单很难阅读,因为它没有与帖子输出分开。但只要花点时间看一下,问题就显而易见:你一次性创建了所有任务,并使用单个 asyncio.gather 监听所有任务。资源 - 包括在任务协程中使用的所有局部变量 - 直到收集结果后才会被释放:任务保留它们的状态。

这种代码只在你有足够的内存和资源可以一直保留在内存中时才适用。如果内存是一个问题,你需要更改代码,以创建有限数量的任务,并在完成它们并记录结果后创建更多任务。asyncio.wait 可能是这种结构的更好选项,此外,如果需要限制并发请求的数量,请考虑使用 asyncio 信号量。

英文:

asyncio.sleep does not use that much memory - it is the profiler tool that is not asyncio aware, so one have to know how to interpret this

What it says is that during the steps where the line await asyncio.sleep were executed, memory usage increased by 300MB - but it is during those steps that other tasks are executed: the objects created by the other tasks are amounting to this much. In fact, in your code, it is just while awaiting asyncio.sleep and asyncio.gather that the tasks run at all (so replacing the asyncio.sleep by a synchronous sleep will do nothing but delay the start of execution of all tasks)

If you have to limit the total data rate or API usage, your use of asyncio.sleep in a single task would be incorrect anyway, as other tasks will proceed fetching data.

And, as you state, it looks like each API response is using a lot of RAM: the workaround them is to pick the relevant data for your app, and ensure the remaining of the response is discarded at each request.

Your listing is very hard to read, since you didn't have it separate from the post output. But the problem is obvious once one take the time to look at it: you create all your tasks in one batch and listen to all your tasks with a single asyncio.gather. The resources - including all local variables used in your tasks co-routines won't be freed until you collect the results, after all requests are made: tasks keep their state.

That kind of code is just good when you have memory and resources to spare to keep everything in memory. If memory is a concern, you have to change your code to create a limited number of tasks, and create more as those are completed and their results recorded. asyncio.wait can be a better option than .gather for this construct. Also, take a look at the use of asyncio Semaphores if limiting the number of concurrent requests is a concern.

huangapple
  • 本文由 发表于 2023年4月6日 19:43:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/75949130.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定