2023年8月9日 17:39:36go评论72阅读模式

英文:

I wrote the code to scrap data from a website and store in a csv, but I can only use request and lxml module. I can use xpath not beautifulSoup

问题

import requests
from lxml import html
import os
import csv

s = requests.session()
headers_dict = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"}

r = s.get("https://www.scrapethissite.com/pages/simple/", headers=headers_dict)
tree = html.fromstring(r.content)
rows = tree.xpath('//*[@id="countries"]/div')

# Create a CSV file to store the data
csv_file = open('CountryInfo.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Country', 'Capital', 'Population', 'Area'])

for row in rows:
    country_name = row.xpath('//*[@id="countries"]/div/div[4]/div[1]/h3')[0].text_content()
    country_capital = row.xpath('//*[@id="countries"]/div/div[4]/div[1]/div/span[1]')[0].text_content()
    population = row.xpath('//*[@id="countries"]/div/div[4]/div[1]/div/span[2]')[0].text_content()
    area = row.xpath('//*[@id="countries"]/div/div[4]/div[1]/div/span[3]')[0].text_content()
    csv_writer.writerow([country_name, country_capital, population, area])

csv_file.close()
print("Program Executed")

这段代码只会写入第一个国家的名称、首都等信息。我理解为什么会发生这种情况，但我想要抓取所有的详细信息。我该如何执行这个操作？

我尝试使用循环，但没有得到期望的输出，即包含所有国家及其详细信息的 CSV 文件。

网站的 URL 是：https://www.scrapethissite.com/pages/simple/

英文:

import requests
from lxml import html
import os
import csv
s = requests.session()
headers_dict = {&quot;User-Agent&quot;: &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36&quot;}
#s.headers = headers_dict
r = s.get(&quot;https://www.scrapethissite.com/pages/simple/&quot;, headers = headers_dict)
tree = html.fromstring(r.content)
rows = tree.xpath(&#39;//*[@id=&quot;countries&quot;]/div&#39;)
print(rows[0].text_content())
# Create a CSV file to store the data
csv_file = open(&#39;CountryInfo.csv&#39;, &#39;w&#39;)
csv_writer = csv.writer(csv_file)
csv_writer.writerow([&#39;Country&#39;, &#39;Capital&#39;, &#39;Population&#39;, &#39;Area&#39;])
for row in rows:
    country_name = (row.xpath(&#39;//*[@id=&quot;countries&quot;]/div/div[4]/div[1]/h3&#39;))[0]
    #print(country_name.text_content())
    country_capital = (row.xpath(&#39;//*[@id=&quot;countries&quot;]/div/div[4]/div[1]/div/span[1]&#39;))[0]
    population = (row.xpath(&#39;//*[@id=&quot;countries&quot;]/div/div[4]/div[1]/div/span[2]&#39;))[0]
    area = (row.xpath(&#39;//*[@id=&quot;countries&quot;]/div/div[4]/div[1]/div/span[3]&#39;))[0]
    csv_writer.writerow([country_name.text_content(), country_capital.text_content(), population.text_content(), area.text_content()])
print(&quot;Program Executed&quot;)

but it only the writes the first country name, capital etc. and I understand why its happening but I want to scrape all the details how can I perform the operation.

I tried using the for loop but not getting the desired output that is: to have a csv that contains all the countries with their details.

The website url is: https://www.scrapethissite.com/pages/simple/

答案1

得分: 1

在XPath和Python方面，如果你选择rows = tree.xpath('//[@id="countries"]/div')，然后使用for row in rows:，我会将所有XPath选择器都相对于row进行，例如country_name = row.xpath('div[4]/div[1]/h3')[0]，而不是country_name = (row.xpath('//[@id="countries"]/div/div[4]/div[1]/h3'))[0]。

其他数据也是一样的。

英文:

In terms of XPath and Python, if you select rows = tree.xpath('//*[@id="countries"]/div') and then use for row in rows:, I would all XPath selections on row to be relative e.g. country_name = row.xpath('div[4]/div[1]/h3')[0] instead of country_name = (row.xpath('//*[@id="countries"]/div/div[4]/div[1]/h3'))[0].

The same for the other data.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

I wrote the code to scrap data from a website and store in a csv, but I can only use request and lxml module. I can use xpath not beautifulSoup

问题

答案1

Kubernetes Pod 无法按名称解析服务

A faulty keyboard where 0 and 1 sometime doesn't work is replaced by "o" and "i" respectively find the mistakes in the input and correct it

导入opcua xml文件

Python – 在长文本中查找短语

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论