英文:
want to scrape only text inside <ul> without spaces and balises
问题
I'm using xpath, I want to scrape from this URL: https://www.le-dictionnaire.com/definition/tout
I'm using this code but it brings spaces, new lines, and balises li from the ul:
def parse(self, response):
print("procesing:" + response.url)
# Extract data using css selectors
# product_name = response.css('.product::text').extract()
# price_range = response.css('.value::text').extract()
# Extract data using xpath
title = response.xpath("//b/text()").extract()
genre1 = response.xpath("(//span/text())[2]").extract()
def1 = response.xpath("((//*[self::ul])[1])").extract()
genre2 = response.xpath("(//span/text())[3]").extract()
def2 = response.xpath("((//*[self::ul])[2])").extract()
row_data = zip(title, genre1, def1, genre2, def2)
# Making extracted data row wise
for item in row_data:
# create a dictionary to store the scraped info
scraped_info = {
# key:value
'page': response.url,
'title': item[0], # item[0] means product in the list and so on, index tells what value to assign
'genere1': item[1],
'def1': item[2],
'genere2': item[3],
'def2': item[4],
}
# yield or give the scraped info to scrapy
yield scraped_info
When I add the tag text()
def1 = response.xpath("((//*[self::ul])[1]/text())").extract()
def2 = response.xpath("((//*[self::ul])[2]/text())").extract()
it scrapes only blank spaces.
英文:
I'm using xpath, I want to scrape from this URL: https://www.le-dictionnaire.com/definition/tout'
I'm using this code but it brings spaces, new lines and balises li from the ul:
def parse(self, response):
print("procesing:"+response.url)
#Extract data using css selectors
#product_name=response.css('.product::text').extract()
#price_range=response.css('.value::text').extract()
#Extract data using xpath
title = response.xpath("//b/text()").extract()
genre1 = response.xpath("(//span/text())[2]").extract()
def1 = response.xpath("((//*[self::ul])[1])").extract()
genre2 = response.xpath("(//span/text())[3]").extract()
def2 = response.xpath("((//*[self::ul])[2])").extract()
row_data=zip(title,genre1,def1,genre2,def2)
#Making extracted data row wise
for item in row_data:
#create a dictionary to store the scraped info
scraped_info = {
#key:value
'page':response.url,
'title' : item[0], #item[0] means product in the list and so on, index tells what value to assign
'genere1' : item[1],
'def1' : item[2],
'genere2' : item[3],
'def2' : item[4],
}
#yield or give the scraped info to scrapy
yield scraped_info
When I add the tag text()
def1 = response.xpath("((//*[self::ul])[1]/text())").extract()
def2 = response.xpath("((//*[self::ul])[2]/text())").extract()
it scrapes only blank spaces.
答案1
得分: 1
以下是您要翻译的内容:
"这是因为您想要的文本不是<ul>
标签的直接子级,所以使用/text()
会返回直接子级(或简单的子级)文本。您需要从<ul>
标签的孙子级别获取文本,这就是您想要抓取的文本。为此,您可以使用//text()
而不是/text
,或者缩小XPath表达式范围,如下所示:
"//*[@class='defbox'][n]//ul/li/a/text()"
通过这样做,您可以获得更清晰的列表输出,还可以创建一个干净的字符串:
>>> def1 = response.xpath("//*[@class='defbox'][1]//ul/li/a/text()").getall()
>>> ' '.join(def1)
'Qui comprend l’intégrité, l’entière, la totalité d’une chose considérée par rapport au nombre, à l’étendue ou à l’intensité de l’énergie.
S’emploie devant un nom précédé ou non d’un article, d’un démonstratif ou d’un possessif. S’emploie aussi devant un nom propre. S’emploie également devant ceci, cela, ce que, ce qui, ceux qui et celles qui. S’emploie aussi comme attribut après le verbe.'
希望这有所帮助。
英文:
It happens because the text you want is not direct children of <ul>
tag so using /text()
would return direct children (or simply children) text. You need to get text from grand children of <ul>
tag which is the text you want to scrape. For this purpose you can use //text()
instead of /text
or narrow down the XPath expression like:
"//*[@class='defbox'][n]//ul/li/a/text()"
By doing this you have more clear list output also you can make a clean string of it:
>>> def1 = response.xpath("//*[@class='defbox'][1]//ul/li/a/text()").getall()
>>> ' '.join(def1)
'Qui comprend l’intégrité, l’entièreté, la totalité d’une chose considérée par rapport au nombre, à l’étendue ou à l’intensité de l’énergie.\n\nS’emploie devant un nom précédé ou non d’un article, d’un dé
monstratif ou d’un possessif. S’emploie aussi devant un nom propre. S’emploie également devant ceci, cela, ce que, ce qui, ceux qui et celles qui. S’emploie aussi comme attribut après le verbe.'
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论