英文:
scrapy intercepts not all of the markup that comes in the request
问题
我正在尝试拦截HTTP数据包中的标记,但我只得到了部分标记。出于某种原因,它在中间截断了。这是否与此有关?这是我的代码:
英文:
I'm trying to intercept the markup that comes in http packets, but I only get part of that markup. For some reason it cuts off in the middle. Is it related to that? Here is my code:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.log import configure_logging
class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
allowed_domains = ['stackoverflow.com']
start_urls = ['https://stackoverflow.com/questions/tagged/python?tab=newest&page=1&pagesize=15']
first_request_done = False
def start_requests(self):
if not self.first_request_done:
self.first_request_done = True
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse, dont_filter=True)
def parse(self, response):
if response.status == 200 and response.headers.get('Content-Type', '').startswith(b'text/html'):
html = response.body.decode('utf-8')
print(html)
yield
configure_logging()
process = CrawlerProcess(settings={
'LOG_ENABLED': False,
'DOWNLOAD_DELAY': 1,
'CONCURRENT_REQUESTS': 1
})
process.crawl(StackOverflowSpider)
process.start(stop_after_crawl=False)
答案1
得分: 0
这只是Python的打印函数没有正确刷新输出... 这可以通过将页面内容拆分成行并逐行打印它们,或者将内容写入文件并在写入的文件中查看完整输出来演示。
例如,您可以尝试这样逐行打印它:
def parse(self, response):
for line in response.text.splitlines():
print(line)
或者如果您想将内容写入文件:
def parse(self, response):
with open('response.html', "wt", encoding="utf8") as htmlfile:
htmlfile.write(response.text)
...
...
英文:
This is just the python print function not properly flushing the output... This can be demonstrated by spliting the page content into lines and printing them out one at a time, or alternatively writing the contents to a file and viewing the full output in the written file.
For example, you can try this to print it out line by line:
def parse(self, response):
for line in response.text.splitlines():
print(line)
or if you wanted to write the contents to a file:
def parse(self, response):
with open('response.html', "wt", encoding="utf8") as htmlfile:
htmlfile.write(response.text)
...
...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论