I wrote the code to scrap data from a website and store in a csv, but I can only use request and lxml module. I can use xpath not beautifulSoup

huangapple go评论72阅读模式
英文:

I wrote the code to scrap data from a website and store in a csv, but I can only use request and lxml module. I can use xpath not beautifulSoup

问题

import requests
from lxml import html
import os
import csv

s = requests.session()
headers_dict = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"}

r = s.get("https://www.scrapethissite.com/pages/simple/", headers=headers_dict)
tree = html.fromstring(r.content)
rows = tree.xpath('//*[@id="countries"]/div')

# Create a CSV file to store the data
csv_file = open('CountryInfo.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Country', 'Capital', 'Population', 'Area'])

for row in rows:
    country_name = row.xpath('//*[@id="countries"]/div/div[4]/div[1]/h3')[0].text_content()
    country_capital = row.xpath('//*[@id="countries"]/div/div[4]/div[1]/div/span[1]')[0].text_content()
    population = row.xpath('//*[@id="countries"]/div/div[4]/div[1]/div/span[2]')[0].text_content()
    area = row.xpath('//*[@id="countries"]/div/div[4]/div[1]/div/span[3]')[0].text_content()
    csv_writer.writerow([country_name, country_capital, population, area])

csv_file.close()
print("Program Executed")

这段代码只会写入第一个国家的名称、首都等信息。我理解为什么会发生这种情况,但我想要抓取所有的详细信息。我该如何执行这个操作?

我尝试使用循环,但没有得到期望的输出,即包含所有国家及其详细信息的 CSV 文件。

网站的 URL 是:https://www.scrapethissite.com/pages/simple/

英文:
import requests
from lxml import html
import os
import csv
s = requests.session()
headers_dict = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"}
#s.headers = headers_dict
r = s.get("https://www.scrapethissite.com/pages/simple/", headers = headers_dict)
tree = html.fromstring(r.content)
rows = tree.xpath('//*[@id="countries"]/div')
print(rows[0].text_content())
# Create a CSV file to store the data
csv_file = open('CountryInfo.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Country', 'Capital', 'Population', 'Area'])
for row in rows:
    country_name = (row.xpath('//*[@id="countries"]/div/div[4]/div[1]/h3'))[0]
    #print(country_name.text_content())
    country_capital = (row.xpath('//*[@id="countries"]/div/div[4]/div[1]/div/span[1]'))[0]
    population = (row.xpath('//*[@id="countries"]/div/div[4]/div[1]/div/span[2]'))[0]
    area = (row.xpath('//*[@id="countries"]/div/div[4]/div[1]/div/span[3]'))[0]
    csv_writer.writerow([country_name.text_content(), country_capital.text_content(), population.text_content(), area.text_content()])
print("Program Executed")

but it only the writes the first country name, capital etc. and I understand why its happening but I want to scrape all the details how can I perform the operation.

I tried using the for loop but not getting the desired output that is: to have a csv that contains all the countries with their details.

The website url is: https://www.scrapethissite.com/pages/simple/

答案1

得分: 1

在XPath和Python方面,如果你选择rows = tree.xpath('//[@id="countries"]/div'),然后使用for row in rows:,我会将所有XPath选择器都相对于row进行,例如country_name = row.xpath('div[4]/div[1]/h3')[0],而不是country_name = (row.xpath('//[@id="countries"]/div/div[4]/div[1]/h3'))[0]

其他数据也是一样的。

英文:

In terms of XPath and Python, if you select rows = tree.xpath('//*[@id="countries"]/div') and then use for row in rows:, I would all XPath selections on row to be relative e.g. country_name = row.xpath('div[4]/div[1]/h3')[0] instead of country_name = (row.xpath('//*[@id="countries"]/div/div[4]/div[1]/h3'))[0].

The same for the other data.

huangapple
  • 本文由 发表于 2023年8月9日 17:39:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76866448.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定