问题

在我的代码中，我使用一个耗时较长的过程，即网页抓取，我需要知道哪些ID已知：

from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
mydb = client['db']
mycol = mydb['collection']
if __name__ == '__main__':
    [...]
    for item in driver.find_element(By.XPATH, '//a'):
        flag = True
        url = item.get_attribute('href')
        myid = re.sub(r'.*item/([0-9a-f-]+)\?.*', r'', url)
        # 我迭代所有已知的MongoDB ID以查找匹配项或不匹配项
        # 比较耗时
        for oupid in mycol.find({}, {"Id": 1, "_id": 0}):
            if myid == oupid['Id']:
                flag = False
        if not flag:
            continue

有什么建议吗？解析网站需要6分钟，大部分时间都用在比较ID上。

英文:

In my code, I use a process that takes a lot of time, this is web-scraping, I need to know which ID is already known:

from pymongo import MongoClient
client = MongoClient(&#39;mongodb://localhost:27017/&#39;)
mydb = client[&#39;db&#39;]
mycol = mydb[&#39;collection&#39;]
if __name__ == &#39;__main__&#39;:
    [...]
    for item in driver.find_element(By.XPATH, &#39;//a&#39;):
        flag = True
        url = item.get_attribute(&#39;href&#39;)
        myid = re.sub(r&#39;.*item/([0-9a-f-]+)\?.*&#39;, r&#39;&#39;, url)
        # I iterate over all known MongoDB ID&#39;s to find a match or not
        # It takes too much time to compare
        for oupid in mycol.find({ }, { &quot;Id&quot;: 1, &quot;_id&quot;: 0}):
            if myid == oupid[&#39;Id&#39;]:
                flag = False
        if not flag:
            continue

Any recommendation?
It takes 6mn to parse the site, and most of the time is waste to compare IDs.

答案1

得分: 0

解决方法如下，比以前快2倍：

# FIXME 比较 JS 和 Selenium 的时间
items_urls = driver.execute_script("""
xpath = '//a';
x = [];
result = document.evaluate(xpath, document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
for (let i = 0; i < result.snapshotLength; i++) {
    const element = result.snapshotItem(i);
    x.push(element.href);
};
return x;
""")
items_ids = [re.sub(r'.*item/([0-9a-f-]+)\?.*', r'', url) for url in items_urls]
oupids = collection.find({ }, { "Id": 1, "_id": 0})
oupids = [item['Id'] for item in oupids]
# 移除交集的 ID
items_list = list(set(items_ids).difference(set(oupids)))
for myid in items_list:
    item = driver.find_element(By.XPATH, f'//a[contains(@href, "/item/{myid}")]')
    url = item.get_attribute('href')

请注意，这是代码的翻译部分，不包括注释和其他文本。

英文:

Solved like this, 2 times faster than previous:

# FIXME compare time JS vs Selenium
items_urls = driver.execute_script(&quot;&quot;&quot;
xpath = &#39;//a&#39;;
x = [];
result = document.evaluate(xpath, document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
for (let i = 0; i &lt; result.snapshotLength; i++) {
    const element = result.snapshotItem(i);
    x.push(element.href);
};
return x;
&quot;&quot;&quot;)
items_ids = [re.sub(r&#39;.*item/([0-9a-f-]+)\?.*&#39;, r&#39;&#39;, url) for url in items_urls]
oupids = collection.find({ }, { &quot;Id&quot;: 1, &quot;_id&quot;: 0})
oupids = [item[&#39;Id&#39;] for item in oupids]
# Remove intersection ID&#39;s
items_list = list(set(items_ids).difference(set(oupids)))
for myid in items_list:
    item = driver.find_element(By.XPATH, f&#39;//a[contains(@href, &quot;/item/{myid}&quot;)]&#39;)
    url = item.get_attribute(&#39;href&#39;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何提高比对一列来自MongoDB的已知ID列表和另一列ID列表的速度？

问题

答案1

使用mgo在golang中从Mongodb选择列。

正则表达式可以识别与限定字符交错的字符吗？

TypeError: 元组索引必须是整数或切片，而不是列表 – 在加载Keras模型时

如何将用户定义的参数传递给setuptools，以设置更改编译宏的标志。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。