如何提高比对一列来自MongoDB的已知ID列表和另一列ID列表的速度?

huangapple go评论96阅读模式
英文:

How to enhance speed to compare a list of ID to a known list of ID from MongoDB?

问题

在我的代码中,我使用一个耗时较长的过程,即网页抓取,我需要知道哪些ID已知:

  1. from pymongo import MongoClient
  2. client = MongoClient('mongodb://localhost:27017/')
  3. mydb = client['db']
  4. mycol = mydb['collection']
  5. if __name__ == '__main__':
  6. [...]
  7. for item in driver.find_element(By.XPATH, '//a'):
  8. flag = True
  9. url = item.get_attribute('href')
  10. myid = re.sub(r'.*item/([0-9a-f-]+)\?.*', r'', url)
  11. # 我迭代所有已知的MongoDB ID以查找匹配项或不匹配项
  12. # 比较耗时
  13. for oupid in mycol.find({}, {"Id": 1, "_id": 0}):
  14. if myid == oupid['Id']:
  15. flag = False
  16. if not flag:
  17. continue

有什么建议吗?解析网站需要6分钟,大部分时间都用在比较ID上。

英文:

In my code, I use a process that takes a lot of time, this is web-scraping, I need to know which ID is already known:

  1. from pymongo import MongoClient
  2. client = MongoClient('mongodb://localhost:27017/')
  3. mydb = client['db']
  4. mycol = mydb['collection']
  5. if __name__ == '__main__':
  6. [...]
  7. for item in driver.find_element(By.XPATH, '//a'):
  8. flag = True
  9. url = item.get_attribute('href')
  10. myid = re.sub(r'.*item/([0-9a-f-]+)\?.*', r'', url)
  11. # I iterate over all known MongoDB ID's to find a match or not
  12. # It takes too much time to compare
  13. for oupid in mycol.find({ }, { "Id": 1, "_id": 0}):
  14. if myid == oupid['Id']:
  15. flag = False
  16. if not flag:
  17. continue

Any recommendation?
It takes 6mn to parse the site, and most of the time is waste to compare IDs.

答案1

得分: 0

解决方法如下,比以前快2倍:

  1. # FIXME 比较 JS 和 Selenium 的时间
  2. items_urls = driver.execute_script("""
  3. xpath = '//a';
  4. x = [];
  5. result = document.evaluate(xpath, document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
  6. for (let i = 0; i < result.snapshotLength; i++) {
  7. const element = result.snapshotItem(i);
  8. x.push(element.href);
  9. };
  10. return x;
  11. """)
  12. items_ids = [re.sub(r'.*item/([0-9a-f-]+)\?.*', r'', url) for url in items_urls]
  13. oupids = collection.find({ }, { "Id": 1, "_id": 0})
  14. oupids = [item['Id'] for item in oupids]
  15. # 移除交集的 ID
  16. items_list = list(set(items_ids).difference(set(oupids)))
  17. for myid in items_list:
  18. item = driver.find_element(By.XPATH, f'//a[contains(@href, "/item/{myid}")]')
  19. url = item.get_attribute('href')

请注意,这是代码的翻译部分,不包括注释和其他文本。

英文:

Solved like this, 2 times faster than previous:

  1. # FIXME compare time JS vs Selenium
  2. items_urls = driver.execute_script(&quot;&quot;&quot;
  3. xpath = &#39;//a&#39;;
  4. x = [];
  5. result = document.evaluate(xpath, document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
  6. for (let i = 0; i &lt; result.snapshotLength; i++) {
  7. const element = result.snapshotItem(i);
  8. x.push(element.href);
  9. };
  10. return x;
  11. &quot;&quot;&quot;)
  12. items_ids = [re.sub(r&#39;.*item/([0-9a-f-]+)\?.*&#39;, r&#39;&#39;, url) for url in items_urls]
  13. oupids = collection.find({ }, { &quot;Id&quot;: 1, &quot;_id&quot;: 0})
  14. oupids = [item[&#39;Id&#39;] for item in oupids]
  15. # Remove intersection ID&#39;s
  16. items_list = list(set(items_ids).difference(set(oupids)))
  17. for myid in items_list:
  18. item = driver.find_element(By.XPATH, f&#39;//a[contains(@href, &quot;/item/{myid}&quot;)]&#39;)
  19. url = item.get_attribute(&#39;href&#39;)

huangapple
  • 本文由 发表于 2023年5月26日 09:32:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76337133.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定