英文:
How to enhance speed to compare a list of ID to a known list of ID from MongoDB?
问题
在我的代码中,我使用一个耗时较长的过程,即网页抓取,我需要知道哪些ID已知:
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
mydb = client['db']
mycol = mydb['collection']
if __name__ == '__main__':
[...]
for item in driver.find_element(By.XPATH, '//a'):
flag = True
url = item.get_attribute('href')
myid = re.sub(r'.*item/([0-9a-f-]+)\?.*', r'', url)
# 我迭代所有已知的MongoDB ID以查找匹配项或不匹配项
# 比较耗时
for oupid in mycol.find({}, {"Id": 1, "_id": 0}):
if myid == oupid['Id']:
flag = False
if not flag:
continue
有什么建议吗?解析网站需要6分钟,大部分时间都用在比较ID上。
英文:
In my code, I use a process that takes a lot of time, this is web-scraping, I need to know which ID is already known:
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
mydb = client['db']
mycol = mydb['collection']
if __name__ == '__main__':
[...]
for item in driver.find_element(By.XPATH, '//a'):
flag = True
url = item.get_attribute('href')
myid = re.sub(r'.*item/([0-9a-f-]+)\?.*', r'', url)
# I iterate over all known MongoDB ID's to find a match or not
# It takes too much time to compare
for oupid in mycol.find({ }, { "Id": 1, "_id": 0}):
if myid == oupid['Id']:
flag = False
if not flag:
continue
Any recommendation?
It takes 6mn to parse the site, and most of the time is waste to compare IDs.
答案1
得分: 0
解决方法如下,比以前快2倍:
# FIXME 比较 JS 和 Selenium 的时间
items_urls = driver.execute_script("""
xpath = '//a';
x = [];
result = document.evaluate(xpath, document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
for (let i = 0; i < result.snapshotLength; i++) {
const element = result.snapshotItem(i);
x.push(element.href);
};
return x;
""")
items_ids = [re.sub(r'.*item/([0-9a-f-]+)\?.*', r'', url) for url in items_urls]
oupids = collection.find({ }, { "Id": 1, "_id": 0})
oupids = [item['Id'] for item in oupids]
# 移除交集的 ID
items_list = list(set(items_ids).difference(set(oupids)))
for myid in items_list:
item = driver.find_element(By.XPATH, f'//a[contains(@href, "/item/{myid}")]')
url = item.get_attribute('href')
请注意,这是代码的翻译部分,不包括注释和其他文本。
英文:
Solved like this, 2 times faster than previous:
# FIXME compare time JS vs Selenium
items_urls = driver.execute_script("""
xpath = '//a';
x = [];
result = document.evaluate(xpath, document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
for (let i = 0; i < result.snapshotLength; i++) {
const element = result.snapshotItem(i);
x.push(element.href);
};
return x;
""")
items_ids = [re.sub(r'.*item/([0-9a-f-]+)\?.*', r'', url) for url in items_urls]
oupids = collection.find({ }, { "Id": 1, "_id": 0})
oupids = [item['Id'] for item in oupids]
# Remove intersection ID's
items_list = list(set(items_ids).difference(set(oupids)))
for myid in items_list:
item = driver.find_element(By.XPATH, f'//a[contains(@href, "/item/{myid}")]')
url = item.get_attribute('href')
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论