如何提高比对一列来自MongoDB的已知ID列表和另一列ID列表的速度?

huangapple go评论64阅读模式
英文:

How to enhance speed to compare a list of ID to a known list of ID from MongoDB?

问题

在我的代码中,我使用一个耗时较长的过程,即网页抓取,我需要知道哪些ID已知:

from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
mydb = client['db']
mycol = mydb['collection']

if __name__ == '__main__':
    [...]
    for item in driver.find_element(By.XPATH, '//a'):
        flag = True
        url = item.get_attribute('href')
        myid = re.sub(r'.*item/([0-9a-f-]+)\?.*', r'', url)
        # 我迭代所有已知的MongoDB ID以查找匹配项或不匹配项
        # 比较耗时
        for oupid in mycol.find({}, {"Id": 1, "_id": 0}):
            if myid == oupid['Id']:
                flag = False

        if not flag:
            continue

有什么建议吗?解析网站需要6分钟,大部分时间都用在比较ID上。

英文:

In my code, I use a process that takes a lot of time, this is web-scraping, I need to know which ID is already known:

from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
mydb = client['db']
mycol = mydb['collection']

if __name__ == '__main__':
    [...]
    for item in driver.find_element(By.XPATH, '//a'):
        flag = True
        url = item.get_attribute('href')
        myid = re.sub(r'.*item/([0-9a-f-]+)\?.*', r'', url)
        # I iterate over all known MongoDB ID's to find a match or not
        # It takes too much time to compare
        for oupid in mycol.find({ }, { "Id": 1, "_id": 0}):
            if myid == oupid['Id']:
                flag = False

        if not flag:
            continue 

Any recommendation?
It takes 6mn to parse the site, and most of the time is waste to compare IDs.

答案1

得分: 0

解决方法如下,比以前快2倍:

# FIXME 比较 JS 和 Selenium 的时间
items_urls = driver.execute_script("""
xpath = '//a';
x = [];
result = document.evaluate(xpath, document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
for (let i = 0; i < result.snapshotLength; i++) {
    const element = result.snapshotItem(i);
    x.push(element.href);
};
return x;
""")

items_ids = [re.sub(r'.*item/([0-9a-f-]+)\?.*', r'', url) for url in items_urls]

oupids = collection.find({ }, { "Id": 1, "_id": 0})
oupids = [item['Id'] for item in oupids]

# 移除交集的 ID
items_list = list(set(items_ids).difference(set(oupids)))

for myid in items_list:
    item = driver.find_element(By.XPATH, f'//a[contains(@href, "/item/{myid}")]')
    url = item.get_attribute('href')

请注意,这是代码的翻译部分,不包括注释和其他文本。

英文:

Solved like this, 2 times faster than previous:

# FIXME compare time JS vs Selenium
items_urls = driver.execute_script(&quot;&quot;&quot;
xpath = &#39;//a&#39;;
x = [];
result = document.evaluate(xpath, document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
for (let i = 0; i &lt; result.snapshotLength; i++) {
    const element = result.snapshotItem(i);
    x.push(element.href);
};
return x;
&quot;&quot;&quot;)

items_ids = [re.sub(r&#39;.*item/([0-9a-f-]+)\?.*&#39;, r&#39;&#39;, url) for url in items_urls]

oupids = collection.find({ }, { &quot;Id&quot;: 1, &quot;_id&quot;: 0})
oupids = [item[&#39;Id&#39;] for item in oupids]

# Remove intersection ID&#39;s
items_list = list(set(items_ids).difference(set(oupids)))

for myid in items_list:
    item = driver.find_element(By.XPATH, f&#39;//a[contains(@href, &quot;/item/{myid}&quot;)]&#39;)
    url = item.get_attribute(&#39;href&#39;)

huangapple
  • 本文由 发表于 2023年5月26日 09:32:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76337133.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定