英文:
I wrote the code to scrap data from a website and store in a csv, but I can only use request and lxml module. I can use xpath not beautifulSoup
问题
import requests
from lxml import html
import os
import csv
s = requests.session()
headers_dict = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"}
r = s.get("https://www.scrapethissite.com/pages/simple/", headers=headers_dict)
tree = html.fromstring(r.content)
rows = tree.xpath('//*[@id="countries"]/div')
# Create a CSV file to store the data
csv_file = open('CountryInfo.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Country', 'Capital', 'Population', 'Area'])
for row in rows:
country_name = row.xpath('//*[@id="countries"]/div/div[4]/div[1]/h3')[0].text_content()
country_capital = row.xpath('//*[@id="countries"]/div/div[4]/div[1]/div/span[1]')[0].text_content()
population = row.xpath('//*[@id="countries"]/div/div[4]/div[1]/div/span[2]')[0].text_content()
area = row.xpath('//*[@id="countries"]/div/div[4]/div[1]/div/span[3]')[0].text_content()
csv_writer.writerow([country_name, country_capital, population, area])
csv_file.close()
print("Program Executed")
这段代码只会写入第一个国家的名称、首都等信息。我理解为什么会发生这种情况,但我想要抓取所有的详细信息。我该如何执行这个操作?
我尝试使用循环,但没有得到期望的输出,即包含所有国家及其详细信息的 CSV 文件。
网站的 URL 是:https://www.scrapethissite.com/pages/simple/
英文:
import requests
from lxml import html
import os
import csv
s = requests.session()
headers_dict = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"}
#s.headers = headers_dict
r = s.get("https://www.scrapethissite.com/pages/simple/", headers = headers_dict)
tree = html.fromstring(r.content)
rows = tree.xpath('//*[@id="countries"]/div')
print(rows[0].text_content())
# Create a CSV file to store the data
csv_file = open('CountryInfo.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Country', 'Capital', 'Population', 'Area'])
for row in rows:
country_name = (row.xpath('//*[@id="countries"]/div/div[4]/div[1]/h3'))[0]
#print(country_name.text_content())
country_capital = (row.xpath('//*[@id="countries"]/div/div[4]/div[1]/div/span[1]'))[0]
population = (row.xpath('//*[@id="countries"]/div/div[4]/div[1]/div/span[2]'))[0]
area = (row.xpath('//*[@id="countries"]/div/div[4]/div[1]/div/span[3]'))[0]
csv_writer.writerow([country_name.text_content(), country_capital.text_content(), population.text_content(), area.text_content()])
print("Program Executed")
but it only the writes the first country name, capital etc. and I understand why its happening but I want to scrape all the details how can I perform the operation.
I tried using the for loop but not getting the desired output that is: to have a csv that contains all the countries with their details.
The website url is: https://www.scrapethissite.com/pages/simple/
答案1
得分: 1
在XPath和Python方面,如果你选择rows = tree.xpath('//[@id="countries"]/div')
,然后使用for row in rows:
,我会将所有XPath选择器都相对于row
进行,例如country_name = row.xpath('div[4]/div[1]/h3')[0]
,而不是country_name = (row.xpath('//[@id="countries"]/div/div[4]/div[1]/h3'))[0]
。
其他数据也是一样的。
英文:
In terms of XPath and Python, if you select rows = tree.xpath('//*[@id="countries"]/div')
and then use for row in rows:
, I would all XPath selections on row
to be relative e.g. country_name = row.xpath('div[4]/div[1]/h3')[0]
instead of country_name = (row.xpath('//*[@id="countries"]/div/div[4]/div[1]/h3'))[0]
.
The same for the other data.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论