使用FormRequest通过HTTP POST提取数据。

huangapple go评论120阅读模式
英文:

Using FormRequest to extract data via HTTP POST

问题

使用FormRequest进行HTTP POST提取数据

嘿,大家好
我将使用Scrapy来爬取https://bitsclassic.com/fa/ 网站上所有产品的详细信息
为了获取产品的URL,我需要向Web服务https://bitsclassic.com/fa/Product/ProductList发送POST请求
我尝试了这个,但没有输出!
我应该如何发送请求?

  1. class BitsclassicSpider(scrapy.Spider):
  2. name = "bitsclassic"
  3. start_urls = ['https://bitsclassic.com/fa']
  4. def parse(self, response):
  5. """
  6. 这个方法是默认的回调函数,在爬虫开始抓取网站时执行。
  7. """
  8. category_urls = response.css('ul.children a::attr(href)').getall()[1:]
  9. for category_url in category_urls:
  10. yield scrapy.Request(category_url, callback=self.parse_category)
  11. def parse_category(self, response):
  12. """
  13. 这个方法是用于处理分类请求的回调函数。
  14. """
  15. category_id = re.search(r"/(\d+)-", response.url).group(1)
  16. num_products = 1000
  17. # 创建POST请求的表单数据
  18. form_data = {
  19. 'Cats': str(category_id),
  20. 'Size': str(num_products)
  21. }
  22. # 发送POST请求以获取产品列表
  23. yield FormRequest(
  24. url='https://bitsclassic.com/fa/Product/ProductList',
  25. method='POST',
  26. formdata=form_data,
  27. callback=self.parse_page
  28. )
  29. def parse_page(self, response):
  30. """
  31. 这个方法是处理产品页面请求的回调函数。
  32. """
  33. # 使用XPath或CSS选择器从响应中提取数据
  34. title = response.css('p[itemrolep="name"]::text').get()
  35. url = response.url
  36. categories = response.xpath('//div[@class="con-main"]//a/text()').getall()
  37. price = response.xpath('//div[@id="priceBox"]//span[@data-role="price"]/text()').get()
  38. # 处理提取的数据
  39. if price is not None:
  40. price = price.strip()
  41. product_exist = True
  42. else:
  43. price = None
  44. product_exist = False
  45. # 使用提取的数据创建一个新的item
  46. item = BitsclassicItem()
  47. item["title"] = title.strip()
  48. item["categories"] = categories[3:-1]
  49. item["product_exist"] = product_exist
  50. item["price"] = price
  51. item["url"] = response.url
  52. item["domain"] = "bitsclassic.com/fa"
  53. # 将item传递给下一个管道阶段进行进一步处理
  54. yield item

我怀疑我发送请求的方式是否正确?

英文:

Using FormRequest to extract data via HTTP POST

hey guys
I will crawl the details of all the products of the site https://bitsclassic.com/fa/ with scrapy
To get the url of the products, I have to sends a POST request to the web service https://bitsclassic.com/fa/Product/ProductList
I did this, but it doesn't output!
How do I post a request?

  1. class BitsclassicSpider(scrapy.Spider):
  2. name = "bitsclassic"
  3. start_urls = ['https://bitsclassic.com/fa']
  4. def parse(self, response):
  5. """
  6. This method is the default callback function that will be
  7. executed when the spider starts crawling the website.
  8. """
  9. category_urls = response.css('ul.children a::attr(href)').getall()[1:]
  10. for category_url in category_urls:
  11. yield scrapy.Request(category_url, callback=self.parse_category)
  12. def parse_category(self, response):
  13. """
  14. This method is the callback function for the category requests.
  15. """
  16. category_id = re.search(r"/(\d+)-", response.url).group(1)
  17. num_products = 1000
  18. # Create the form data for the POST request
  19. form_data = {
  20. 'Cats': str(category_id),
  21. 'Size': str(num_products)
  22. }
  23. # Send a POST request to retrieve the product list
  24. yield FormRequest(
  25. url='https://bitsclassic.com/fa/Product/ProductList',
  26. method='POST',
  27. formdata=form_data,
  28. callback=self.parse_page
  29. )
  30. def parse_page(self, response):
  31. """
  32. This method is the callback function for the product page requests.
  33. """
  34. # Extract data from the response using XPath or CSS selectors
  35. title = response.css('p[itemrolep="name"]::text').get()
  36. url = response.url
  37. categories = response.xpath('//div[@class="con-main"]//a/text()').getall()
  38. price = response.xpath('//div[@id="priceBox"]//span[@data-role="price"]/text()').get()
  39. # Process the extracted data
  40. if price is not None:
  41. price = price.strip()
  42. product_exist = True
  43. else:
  44. price = None
  45. product_exist = False
  46. # Create a new item with the extracted data
  47. item = BitsclassicItem()
  48. item["title"] = title.strip()
  49. item["categories"] = categories[3:-1]
  50. item["product_exist"] = product_exist
  51. item["price"] = price
  52. item["url"] = response.url
  53. item["domain"] = "bitsclassic.com/fa"
  54. # Yield the item to pass it to the next pipeline stage for further processing
  55. yield item

I doubted that the way I made the request is correct?

答案1

得分: 1

请求没问题。
<br>
你还有一些其他问题。

  1. 你从表单请求中获取的响应是JSON响应,你需要将其视为JSON响应,而不是HTML响应。
  2. 你只获取了每个页面的第一项。你需要使用for循环。
  3. 有一些可以改进你的代码的地方,我已经进行了一些修改。
  1. import scrapy
  2. from scrapy import FormRequest
  3. from scrapy.http import HtmlResponse
  4. class BitsclassicSpider(scrapy.Spider):
  5. name = "bitsclassic"
  6. start_urls = ['https://bitsclassic.com/fa']
  7. def parse(self, response):
  8. """
  9. This method is the default callback function that will be
  10. executed when the spider starts crawling the website.
  11. """
  12. category_urls = response.css('ul.children a')
  13. for category in category_urls[1:]:
  14. category_url = category.css('::attr(href)').get()
  15. category_id = category.re(r"/(\d+)-")[0]
  16. yield scrapy.Request(category_url, callback=self.parse_category, cb_kwargs={'category_id': category_id})
  17. def parse_category(self, response, category_id):
  18. """
  19. This method is the callback function for the category requests.
  20. """
  21. num_products = 1000
  22. # Create the form data for the POST request
  23. form_data = {
  24. 'Cats': str(category_id),
  25. 'Size': str(12)
  26. }
  27. page = 1
  28. form_data['Page'] = str(page)
  29. yield FormRequest(
  30. url='https://bitsclassic.com/fa/Product/ProductList',
  31. method='POST',
  32. formdata=form_data,
  33. callback=self.parse_page,
  34. cb_kwargs={'url': response.url, 'form_data': form_data, 'page': page}
  35. )
  36. def parse_page(self, response, url, form_data, page):
  37. """
  38. This method is the callback function for the product page requests.
  39. """
  40. json_data = response.json()
  41. if not json_data:
  42. return
  43. html = json_data.get('Html', '')
  44. if not html.strip():
  45. return
  46. html_res = HtmlResponse(url=url, body=html, encoding='utf-8')
  47. for product in html_res.xpath('//div[@itemrole="item"]'):
  48. # Extract data from the response using XPath or CSS selectors
  49. title = product.css('span[itemrole="name"]::text').get(default='').strip()
  50. # 你需要检查如何获取类别
  51. # categories = product.xpath('//div[@class="con-main"]//a/text()').getall()
  52. price = product.xpath('//span[@class="price"]/text()').get(default='').strip()
  53. product_url = product.xpath('//a[@itemrole="productLink"]/@href').get()
  54. # Process the extracted data
  55. product_exist = True if price else False
  56. # Create a new item with the extracted data
  57. item = BitsclassicItem()
  58. item["title"] = title
  59. # item["categories"] = categories[3:-1]
  60. item["product_exist"] = product_exist
  61. item["price"] = price
  62. item["url"] = product_url
  63. item["domain"] = "bitsclassic.com/fa"
  64. # Yield the item to pass it to the next pipeline stage for further processing
  65. yield item
  66. # 分页
  67. page += 1
  68. form_data['Page'] = str(page)
  69. yield FormRequest(
  70. url='https://bitsclassic.com/fa/Product/ProductList',
  71. method='POST',
  72. formdata=form_data,
  73. callback=self.parse_page,
  74. cb_kwargs={'url': url, 'form_data': form_data, 'page': page}
  75. )
英文:

The request is fine.
<br>
You have a couple of other problems.

  1. The response you're getting from the form request is a JSON response, and you need to treat it like that instead of an HTML response.
  2. You only get the first item from each page. You to use a for loop.
  3. There are some things that you can do to improve your code, I did some of them.
  1. import scrapy
  2. from scrapy import FormRequest
  3. from scrapy.http import HtmlResponse
  4. class BitsclassicSpider(scrapy.Spider):
  5. name = &quot;bitsclassic&quot;
  6. start_urls = [&#39;https://bitsclassic.com/fa&#39;]
  7. def parse(self, response):
  8. &quot;&quot;&quot;
  9. This method is the default callback function that will be
  10. executed when the spider starts crawling the website.
  11. &quot;&quot;&quot;
  12. category_urls = response.css(&#39;ul.children a&#39;)
  13. for category in category_urls[1:]:
  14. category_url = category.css(&#39;::attr(href)&#39;).get()
  15. category_id = category.re(r&quot;/(\d+)-&quot;)[0]
  16. yield scrapy.Request(category_url, callback=self.parse_category, cb_kwargs={&#39;category_id&#39;: category_id})
  17. def parse_category(self, response, category_id):
  18. &quot;&quot;&quot;
  19. This method is the callback function for the category requests.
  20. &quot;&quot;&quot;
  21. num_products = 1000
  22. # Create the form data for the POST request
  23. form_data = {
  24. &#39;Cats&#39;: str(category_id),
  25. &#39;Size&#39;: str(12)
  26. }
  27. page = 1
  28. form_data[&#39;Page&#39;] = str(page)
  29. yield FormRequest(
  30. url=&#39;https://bitsclassic.com/fa/Product/ProductList&#39;,
  31. method=&#39;POST&#39;,
  32. formdata=form_data,
  33. callback=self.parse_page,
  34. cb_kwargs={&#39;url&#39;: response.url, &#39;form_data&#39;: form_data, &#39;page&#39;: page}
  35. )
  36. def parse_page(self, response, url, form_data, page):
  37. &quot;&quot;&quot;
  38. This method is the callback function for the product page requests.
  39. &quot;&quot;&quot;
  40. json_data = response.json()
  41. if not json_data:
  42. return
  43. html = json_data.get(&#39;Html&#39;, &#39;&#39;)
  44. if not html.strip():
  45. return
  46. html_res = HtmlResponse(url=url, body=html, encoding=&#39;utf-8&#39;)
  47. for product in html_res.xpath(&#39;//div[@itemrole=&quot;item&quot;]&#39;):
  48. # Extract data from the response using XPath or CSS selectors
  49. title = product.css(&#39;span[itemrole=&quot;name&quot;]::text&#39;).get(default=&#39;&#39;).strip()
  50. # you need to check how to get the categories
  51. # categories = product.xpath(&#39;//div[@class=&quot;con-main&quot;]//a/text()&#39;).getall()
  52. price = product.xpath(&#39;//span[@class=&quot;price&quot;]/text()&#39;).get(default=&#39;&#39;).strip()
  53. product_url = product.xpath(&#39;//a[@itemrole=&quot;productLink&quot;]/@href&#39;).get()
  54. # Process the extracted data
  55. product_exist = True if price else False
  56. # Create a new item with the extracted data
  57. item = BitsclassicItem()
  58. item[&quot;title&quot;] = title
  59. # item[&quot;categories&quot;] = categories[3:-1]
  60. item[&quot;product_exist&quot;] = product_exist
  61. item[&quot;price&quot;] = price
  62. item[&quot;url&quot;] = product_url
  63. item[&quot;domain&quot;] = &quot;bitsclassic.com/fa&quot;
  64. # Yield the item to pass it to the next pipeline stage for further processing
  65. yield item
  66. # pagination
  67. page += 1
  68. form_data[&#39;Page&#39;] = str(page)
  69. yield FormRequest(
  70. url=&#39;https://bitsclassic.com/fa/Product/ProductList&#39;,
  71. method=&#39;POST&#39;,
  72. formdata=form_data,
  73. callback=self.parse_page,
  74. cb_kwargs={&#39;url&#39;: url, &#39;form_data&#39;: form_data, &#39;page&#39;: page}
  75. )

huangapple
  • 本文由 发表于 2023年5月28日 18:58:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/76351127.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定