优化爬虫

huangapple go评论131阅读模式
英文:

how to optimize a scraper

问题

这是您提供的代码的翻译部分:

import requests
import json
import os
ln = 0
os.system("clear")

with open("wordlist.txt") as file:
  lines = 
for line in lines: try: query = str(line) responsetext = requests.get("https://us-central1-sandtable-8d0f7.cloudfunctions.net/api/creations?title="+query).text responsedict = json.loads(responsetext) length = int(len(responsedict)) if length != 0: item = responsedict[length - 1] itemtimestamp = item["data"]["timestamp"] if str(itemtimestamp[:4]) == "2018" and int(itemtimestamp[8:10]) <= 14: itemtitle = item["data"]["title"] # itemid = item["data"]["id"] itemurl = "https://sandspiel.club/#"+item["data"]["id"] print(" Title: "+str(itemtitle)) # print(" Post ID: "+itemid) print(" Post URL: "+itemurl) print(" Post date: "+itemtimestamp[:10]) print(" Timestamp: "+itemtimestamp) print(" Word: " + query) # print(" Post time: "+itemtimestamp[12:19]) open('posts.txt', 'w').writelines(itemtitle + "\n" + itemurl + "\n" + itemtimestamp + "\n") pass except: print(query + str(length) + " Error!") continue ln += 1 print("\n\n done!")

请注意,我已经将代码中的HTML实体编码(如&quot;)更改为正常的引号字符。如果您需要进一步的帮助或有其他问题,请随时提出。

英文:

im trying to figure out how to optimize some code, the purpose is to go through a word list (10k words), make a search query for each word, and then get the last result, printing it if the result is before a certain date.

the code:

import requests
import json
import os
ln = 0
os.system(&quot;clear&quot;)

with open(&quot;wordlist.txt&quot;) as file:
  lines = 
for line in lines: try: query = str(line) responsetext = requests.get(&quot;https://us-central1-sandtable-8d0f7.cloudfunctions.net/api/creations?title=&quot;+query).text responsedict = json.loads(responsetext) length = int(len(responsedict)) if length != 0: item = responsedict[length - 1] itemtimestamp = item[&quot;data&quot;][&quot;timestamp&quot;] if str(itemtimestamp[:4]) == &quot;2018&quot; and int(itemtimestamp[8:10]) &lt;= 14: itemtitle = item[&quot;data&quot;][&quot;title&quot;] # itemid = item[&quot;data&quot;][&quot;id&quot;] itemurl = &quot;https://sandspiel.club/#&quot;+item[&quot;data&quot;][&quot;id&quot;] print(&quot; Title: &quot;+str(itemtitle)) # print(&quot; Post ID: &quot;+itemid) print(&quot; Post URL: &quot;+itemurl) print(&quot; Post date: &quot;+itemtimestamp[:10]) print(&quot; Timestamp: &quot;+itemtimestamp) print(&quot; Word: &quot; + query) # print(&quot; Post time: &quot;+itemtimestamp[12:19]) open(&#39;posts.txt&#39;, &#39;w&#39;).writelines(itemtitle + &quot;\n&quot; + itemurl + &quot;\n&quot; + itemtimestamp + &quot;\n&quot;) pass except: print(query + str(length) + &quot; Error!&quot;) continue ln += 1 print(&quot;\n\n done!&quot;)``` </details> # 答案1 **得分**: 1 使用来自请求库的会话对象以便可以重用底层TCP连接还可以使用单个文件对象这样您就不必每次都打开和关闭f-string也更好如果可能的话使用较小的单词列表或查看并行处理 ```python import requests import json import os os.system("clear") session = requests.Session() with open("wordlist.txt") as file, open("posts.txt", "w") as output_file: lines =
for line in lines: try: query = line response = session.get(f"https://us-central1-sandtable-8d0f7.cloudfunctions.net/api/creations?title={query}") response_dict = response.json() length = len(response_dict) if length != 0: item = response_dict[length - 1] item_data = item["data"] item_timestamp = item_data["timestamp"] if item_timestamp.startswith("2018") and int(item_timestamp[8:10]) <= 14: item_title = item_data["title"] item_url = f"https://sandspiel.club/#{item_data['id']}" print(f" Title: {item_title}") print(f" Post URL: {item_url}") print(f" Post date: {item_timestamp[:10]}") print(f" Timestamp: {item_timestamp}") print(f" Word: {query}") output_file.writelines([item_title + "\n", item_url + "\n", item_timestamp + "\n"]) except Exception as e: print(f"{query} {length} Error: {e}") continue print("\n\n done!")
英文:

Use the Session object from the request library so you can reuse the underlying TCP connection, also you could use a single file object, so that you dont have to open and close each time, f-string is better too, and if possible use a smaller word list or look into parallel processing.

import requests
import json
import os
import requests
import json
import os
os.system(&quot;clear&quot;)
session = requests.Session()
with open(&quot;wordlist.txt&quot;) as file, open(&quot;posts.txt&quot;, &quot;w&quot;) as output_file:
lines = 
for line in lines: try: query = line response = session.get(f&quot;https://us-central1-sandtable-8d0f7.cloudfunctions.net/api/creations?title={query}&quot;) response_dict = response.json() length = len(response_dict) if length != 0: item = response_dict[length - 1] item_data = item[&quot;data&quot;] item_timestamp = item_data[&quot;timestamp&quot;] if item_timestamp.startswith(&quot;2018&quot;) and int(item_timestamp[8:10]) &lt;= 14: item_title = item_data[&quot;title&quot;] item_url = f&quot;https://sandspiel.club/#{item_data[&#39;id&#39;]}&quot; print(f&quot; Title: {item_title}&quot;) print(f&quot; Post URL: {item_url}&quot;) print(f&quot; Post date: {item_timestamp[:10]}&quot;) print(f&quot; Timestamp: {item_timestamp}&quot;) print(f&quot; Word: {query}&quot;) output_file.writelines([item_title + &quot;\n&quot;, item_url + &quot;\n&quot;, item_timestamp + &quot;\n&quot;]) except Exception as e: print(f&quot;{query} {length} Error: {e}&quot;) continue print(&quot;\n\n done!&quot;)

huangapple
  • 本文由 发表于 2023年4月17日 01:23:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76029287.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定