问题

我正在尝试拦截HTTP数据包中的标记，但我只得到了部分标记。出于某种原因，它在中间截断了。这是否与此有关？这是我的代码：

英文:

I'm trying to intercept the markup that comes in http packets, but I only get part of that markup. For some reason it cuts off in the middle. Is it related to that? Here is my code:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.log import configure_logging


class StackOverflowSpider(scrapy.Spider):
    
    name = &#39;stackoverflow&#39;
    allowed_domains = [&#39;stackoverflow.com&#39;]
    start_urls = [&#39;https://stackoverflow.com/questions/tagged/python?tab=newest&amp;page=1&amp;pagesize=15&#39;]
    first_request_done = False
    
    def start_requests(self):
        if not self.first_request_done:
            self.first_request_done = True
            for url in self.start_urls:
                yield scrapy.Request(url=url, callback=self.parse, dont_filter=True)
            
    def parse(self, response):
        if response.status == 200 and response.headers.get(&#39;Content-Type&#39;, &#39;&#39;).startswith(b&#39;text/html&#39;):
            html = response.body.decode(&#39;utf-8&#39;)
            print(html)
        
        yield
    

configure_logging()
process = CrawlerProcess(settings={
    &#39;LOG_ENABLED&#39;: False,
    &#39;DOWNLOAD_DELAY&#39;: 1,
    &#39;CONCURRENT_REQUESTS&#39;: 1
})
process.crawl(StackOverflowSpider)
process.start(stop_after_crawl=False)

答案1

得分: 0

这只是Python的打印函数没有正确刷新输出... 这可以通过将页面内容拆分成行并逐行打印它们，或者将内容写入文件并在写入的文件中查看完整输出来演示。

例如，您可以尝试这样逐行打印它：

def parse(self, response):
    for line in response.text.splitlines():
        print(line)

或者如果您想将内容写入文件：

def parse(self, response):
    with open('response.html', "wt", encoding="utf8") as htmlfile:
        htmlfile.write(response.text)
    ...
    ...

英文:

This is just the python print function not properly flushing the output... This can be demonstrated by spliting the page content into lines and printing them out one at a time, or alternatively writing the contents to a file and viewing the full output in the written file.

For example, you can try this to print it out line by line:

def parse(self, response):
    for line in response.text.splitlines():
        print(line)

or if you wanted to write the contents to a file:

def parse(self, response):
    with open(&#39;response.html&#39;, &quot;wt&quot;, encoding=&quot;utf8&quot;) as htmlfile:
        htmlfile.write(response.text)
    ...
    ...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Scrapy未拦截到请求中的所有标记。

问题

答案1

将DataFrame从一行排列到多列中的Python代码。

在迭代列以进行API请求时出现类型错误。

How to generate a 3D surface function to fit given 3D points and interpolate 3rd coordinate if I have other 3 coordinates

如何使用Python解析*.py文件？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论