2023年5月28日 18:58:21go评论120阅读模式

英文:

Using FormRequest to extract data via HTTP POST

问题

使用FormRequest进行HTTP POST提取数据

嘿，大家好
我将使用Scrapy来爬取https://bitsclassic.com/fa/ 网站上所有产品的详细信息
为了获取产品的URL，我需要向Web服务https://bitsclassic.com/fa/Product/ProductList发送POST请求
我尝试了这个，但没有输出！
我应该如何发送请求？

class BitsclassicSpider(scrapy.Spider):
    name = "bitsclassic"
    start_urls = ['https://bitsclassic.com/fa']
    def parse(self, response):
        """
        这个方法是默认的回调函数，在爬虫开始抓取网站时执行。
        """
        category_urls = response.css('ul.children a::attr(href)').getall()[1:]
        for category_url in category_urls:
            yield scrapy.Request(category_url, callback=self.parse_category)
    def parse_category(self, response):
        """
        这个方法是用于处理分类请求的回调函数。
        """
        category_id = re.search(r"/(\d+)-", response.url).group(1)
        num_products = 1000
        # 创建POST请求的表单数据
        form_data = {
            'Cats': str(category_id),
            'Size': str(num_products)
        }
        # 发送POST请求以获取产品列表
        yield FormRequest(
            url='https://bitsclassic.com/fa/Product/ProductList',
            method='POST',
            formdata=form_data,
            callback=self.parse_page
        )
    def parse_page(self, response):
        """
        这个方法是处理产品页面请求的回调函数。
        """
        # 使用XPath或CSS选择器从响应中提取数据
        title = response.css('p[itemrolep="name"]::text').get()
        url = response.url
        categories = response.xpath('//div[@class="con-main"]//a/text()').getall()
        price = response.xpath('//div[@id="priceBox"]//span[@data-role="price"]/text()').get()
        # 处理提取的数据
        if price is not None:
            price = price.strip()
            product_exist = True
        else:
            price = None
            product_exist = False
        # 使用提取的数据创建一个新的item
        item = BitsclassicItem()
        item["title"] = title.strip()
        item["categories"] = categories[3:-1]
        item["product_exist"] = product_exist
        item["price"] = price
        item["url"] = response.url
        item["domain"] = "bitsclassic.com/fa"
        # 将item传递给下一个管道阶段进行进一步处理
        yield item

我怀疑我发送请求的方式是否正确？

英文:

Using FormRequest to extract data via HTTP POST

hey guys
I will crawl the details of all the products of the site https://bitsclassic.com/fa/ with scrapy
To get the url of the products, I have to sends a POST request to the web service https://bitsclassic.com/fa/Product/ProductList
I did this, but it doesn't output!
How do I post a request?

class BitsclassicSpider(scrapy.Spider):
    name = &quot;bitsclassic&quot;
    start_urls = [&#39;https://bitsclassic.com/fa&#39;]
    def parse(self, response):
        &quot;&quot;&quot;
        This method is the default callback function that will be
        executed when the spider starts crawling the website.
        &quot;&quot;&quot;
        category_urls = response.css(&#39;ul.children a::attr(href)&#39;).getall()[1:]
        for category_url in category_urls:
            yield scrapy.Request(category_url, callback=self.parse_category)
    def parse_category(self, response):
        &quot;&quot;&quot;
        This method is the callback function for the category requests.
        &quot;&quot;&quot;
        category_id = re.search(r&quot;/(\d+)-&quot;, response.url).group(1)
        num_products = 1000
        # Create the form data for the POST request
        form_data = {
            &#39;Cats&#39;: str(category_id),
            &#39;Size&#39;: str(num_products)
        }
        # Send a POST request to retrieve the product list
        yield FormRequest(
            url=&#39;https://bitsclassic.com/fa/Product/ProductList&#39;,
            method=&#39;POST&#39;,
            formdata=form_data,
            callback=self.parse_page
        )
    def parse_page(self, response):
        &quot;&quot;&quot;
        This method is the callback function for the product page requests.
        &quot;&quot;&quot;
        # Extract data from the response using XPath or CSS selectors
        title = response.css(&#39;p[itemrolep=&quot;name&quot;]::text&#39;).get()
        url = response.url
        categories = response.xpath(&#39;//div[@class=&quot;con-main&quot;]//a/text()&#39;).getall()
        price = response.xpath(&#39;//div[@id=&quot;priceBox&quot;]//span[@data-role=&quot;price&quot;]/text()&#39;).get()
        # Process the extracted data
        if price is not None:
            price = price.strip()
            product_exist = True
        else:
            price = None
            product_exist = False
        # Create a new item with the extracted data
        item = BitsclassicItem()
        item[&quot;title&quot;] = title.strip()
        item[&quot;categories&quot;] = categories[3:-1]
        item[&quot;product_exist&quot;] = product_exist
        item[&quot;price&quot;] = price
        item[&quot;url&quot;] = response.url
        item[&quot;domain&quot;] = &quot;bitsclassic.com/fa&quot;
        # Yield the item to pass it to the next pipeline stage for further processing
        yield item

I doubted that the way I made the request is correct?

答案1

得分: 1

请求没问题。
<br>
你还有一些其他问题。

你从表单请求中获取的响应是JSON响应，你需要将其视为JSON响应，而不是HTML响应。
你只获取了每个页面的第一项。你需要使用for循环。
有一些可以改进你的代码的地方，我已经进行了一些修改。

import scrapy
from scrapy import FormRequest
from scrapy.http import HtmlResponse
class BitsclassicSpider(scrapy.Spider):
    name = "bitsclassic"
    start_urls = ['https://bitsclassic.com/fa']
    def parse(self, response):
        """
        This method is the default callback function that will be
        executed when the spider starts crawling the website.
        """
        category_urls = response.css('ul.children a')
        for category in category_urls[1:]:
            category_url = category.css('::attr(href)').get()
            category_id = category.re(r"/(\d+)-")[0]
            yield scrapy.Request(category_url, callback=self.parse_category, cb_kwargs={'category_id': category_id})
    def parse_category(self, response, category_id):
        """
        This method is the callback function for the category requests.
        """
        num_products = 1000
        # Create the form data for the POST request
        form_data = {
            'Cats': str(category_id),
            'Size': str(12)
        }
        page = 1
        form_data['Page'] = str(page)
        yield FormRequest(
            url='https://bitsclassic.com/fa/Product/ProductList',
            method='POST',
            formdata=form_data,
            callback=self.parse_page,
            cb_kwargs={'url': response.url, 'form_data': form_data, 'page': page}
        )
    def parse_page(self, response, url, form_data, page):
        """
        This method is the callback function for the product page requests.
        """
        json_data = response.json()
        if not json_data:
            return
        html = json_data.get('Html', '')
        if not html.strip():
            return
        html_res = HtmlResponse(url=url, body=html, encoding='utf-8')
        for product in html_res.xpath('//div[@itemrole="item"]'):
            # Extract data from the response using XPath or CSS selectors
            title = product.css('span[itemrole="name"]::text').get(default='').strip()
            # 你需要检查如何获取类别
            # categories = product.xpath('//div[@class="con-main"]//a/text()').getall()
            price = product.xpath('//span[@class="price"]/text()').get(default='').strip()
            product_url = product.xpath('//a[@itemrole="productLink"]/@href').get()
            # Process the extracted data
            product_exist = True if price else False
            # Create a new item with the extracted data
            item = BitsclassicItem()
            item["title"] = title
            # item["categories"] = categories[3:-1]
            item["product_exist"] = product_exist
            item["price"] = price
            item["url"] = product_url
            item["domain"] = "bitsclassic.com/fa"
            # Yield the item to pass it to the next pipeline stage for further processing
            yield item
        # 分页
        page += 1
        form_data['Page'] = str(page)
        yield FormRequest(
            url='https://bitsclassic.com/fa/Product/ProductList',
            method='POST',
            formdata=form_data,
            callback=self.parse_page,
            cb_kwargs={'url': url, 'form_data': form_data, 'page': page}
        )

英文:

The request is fine.
<br>
You have a couple of other problems.

The response you're getting from the form request is a JSON response, and you need to treat it like that instead of an HTML response.
You only get the first item from each page. You to use a for loop.
There are some things that you can do to improve your code, I did some of them.

import scrapy
from scrapy import FormRequest
from scrapy.http import HtmlResponse
class BitsclassicSpider(scrapy.Spider):
    name = &quot;bitsclassic&quot;
    start_urls = [&#39;https://bitsclassic.com/fa&#39;]
    def parse(self, response):
        &quot;&quot;&quot;
        This method is the default callback function that will be
        executed when the spider starts crawling the website.
        &quot;&quot;&quot;
        category_urls = response.css(&#39;ul.children a&#39;)
        for category in category_urls[1:]:
            category_url = category.css(&#39;::attr(href)&#39;).get()
            category_id = category.re(r&quot;/(\d+)-&quot;)[0]
            yield scrapy.Request(category_url, callback=self.parse_category, cb_kwargs={&#39;category_id&#39;: category_id})
    def parse_category(self, response, category_id):
        &quot;&quot;&quot;
        This method is the callback function for the category requests.
        &quot;&quot;&quot;
        num_products = 1000
        # Create the form data for the POST request
        form_data = {
            &#39;Cats&#39;: str(category_id),
            &#39;Size&#39;: str(12)
        }
        page = 1
        form_data[&#39;Page&#39;] = str(page)
        yield FormRequest(
            url=&#39;https://bitsclassic.com/fa/Product/ProductList&#39;,
            method=&#39;POST&#39;,
            formdata=form_data,
            callback=self.parse_page,
            cb_kwargs={&#39;url&#39;: response.url, &#39;form_data&#39;: form_data, &#39;page&#39;: page}
        )
    def parse_page(self, response, url, form_data, page):
        &quot;&quot;&quot;
        This method is the callback function for the product page requests.
        &quot;&quot;&quot;
        json_data = response.json()
        if not json_data:
            return
        html = json_data.get(&#39;Html&#39;, &#39;&#39;)
        if not html.strip():
            return
        html_res = HtmlResponse(url=url, body=html, encoding=&#39;utf-8&#39;)
        for product in html_res.xpath(&#39;//div[@itemrole=&quot;item&quot;]&#39;):
            # Extract data from the response using XPath or CSS selectors
            title = product.css(&#39;span[itemrole=&quot;name&quot;]::text&#39;).get(default=&#39;&#39;).strip()
            # you need to check how to get the categories
            # categories = product.xpath(&#39;//div[@class=&quot;con-main&quot;]//a/text()&#39;).getall()
            price = product.xpath(&#39;//span[@class=&quot;price&quot;]/text()&#39;).get(default=&#39;&#39;).strip()
            product_url = product.xpath(&#39;//a[@itemrole=&quot;productLink&quot;]/@href&#39;).get()
            # Process the extracted data
            product_exist = True if price else False
            # Create a new item with the extracted data
            item = BitsclassicItem()
            item[&quot;title&quot;] = title
            # item[&quot;categories&quot;] = categories[3:-1]
            item[&quot;product_exist&quot;] = product_exist
            item[&quot;price&quot;] = price
            item[&quot;url&quot;] = product_url
            item[&quot;domain&quot;] = &quot;bitsclassic.com/fa&quot;
            # Yield the item to pass it to the next pipeline stage for further processing
            yield item
        # pagination
        page += 1
        form_data[&#39;Page&#39;] = str(page)
        yield FormRequest(
            url=&#39;https://bitsclassic.com/fa/Product/ProductList&#39;,
            method=&#39;POST&#39;,
            formdata=form_data,
            callback=self.parse_page,
            cb_kwargs={&#39;url&#39;: url, &#39;form_data&#39;: form_data, &#39;page&#39;: page}
        )

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用FormRequest通过HTTP POST提取数据。

问题

答案1

如何防止Py_Finalize关闭stderr？

从大量的数据中高效地找出重叠的范围

如何修复psycopg2中的语法错误，出现在'%'附近？

如何获取一个结构化的NumPy数组的总和？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。