2020年1月7日 01:32:45go评论104阅读模式

英文:

want to scrape only text inside <ul> without spaces and balises

问题

I'm using xpath, I want to scrape from this URL: https://www.le-dictionnaire.com/definition/tout

I'm using this code but it brings spaces, new lines, and balises li from the ul:

def parse(self, response):
    print("procesing:" + response.url)
    # Extract data using css selectors
    # product_name = response.css('.product::text').extract()
    # price_range = response.css('.value::text').extract()
    # Extract data using xpath
    title = response.xpath("//b/text()").extract()
    genre1 = response.xpath("(//span/text())[2]").extract()
    def1 = response.xpath("((//*[self::ul])[1])").extract()
    genre2 = response.xpath("(//span/text())[3]").extract()
    def2 = response.xpath("((//*[self::ul])[2])").extract()
    row_data = zip(title, genre1, def1, genre2, def2)
    # Making extracted data row wise
    for item in row_data:
        # create a dictionary to store the scraped info
        scraped_info = {
            # key:value
            'page': response.url,
            'title': item[0],  # item[0] means product in the list and so on, index tells what value to assign
            'genere1': item[1],
            'def1': item[2],
            'genere2': item[3],
            'def2': item[4],
        }
        # yield or give the scraped info to scrapy
        yield scraped_info

When I add the tag text()

def1 = response.xpath("((//*[self::ul])[1]/text())").extract()
def2 = response.xpath("((//*[self::ul])[2]/text())").extract()

it scrapes only blank spaces.

英文:

I'm using xpath, I want to scrape from this URL: https://www.le-dictionnaire.com/definition/tout'

I'm using this code but it brings spaces, new lines and balises li from the ul:

def parse(self, response):
    print(&quot;procesing:&quot;+response.url)
    #Extract data using css selectors
    #product_name=response.css(&#39;.product::text&#39;).extract()
    #price_range=response.css(&#39;.value::text&#39;).extract()
    #Extract data using xpath
    title = response.xpath(&quot;//b/text()&quot;).extract()
    genre1 = response.xpath(&quot;(//span/text())[2]&quot;).extract()
    def1 = response.xpath(&quot;((//*[self::ul])[1])&quot;).extract()
    genre2 = response.xpath(&quot;(//span/text())[3]&quot;).extract()
    def2 = response.xpath(&quot;((//*[self::ul])[2])&quot;).extract()
    row_data=zip(title,genre1,def1,genre2,def2)
    #Making extracted data row wise
    for item in row_data:
        #create a dictionary to store the scraped info
        scraped_info = {
            #key:value
            &#39;page&#39;:response.url,
            &#39;title&#39; : item[0], #item[0] means product in the list and so on, index tells what value to assign
            &#39;genere1&#39; : item[1],
            &#39;def1&#39; : item[2],
            &#39;genere2&#39; : item[3],
            &#39;def2&#39; : item[4],
            
        }
        #yield or give the scraped info to scrapy
        yield scraped_info

When I add the tag text()

def1 = response.xpath(&quot;((//*[self::ul])[1]/text())&quot;).extract()
def2 = response.xpath(&quot;((//*[self::ul])[2]/text())&quot;).extract()

it scrapes only blank spaces.

答案1

得分: 1

以下是您要翻译的内容：

"这是因为您想要的文本不是<ul>标签的直接子级，所以使用/text()会返回直接子级（或简单的子级）文本。您需要从<ul>标签的孙子级别获取文本，这就是您想要抓取的文本。为此，您可以使用//text()而不是/text，或者缩小XPath表达式范围，如下所示：

"//*[@class='defbox'][n]//ul/li/a/text()"

通过这样做，您可以获得更清晰的列表输出，还可以创建一个干净的字符串：

>>> def1 = response.xpath("//*[@class='defbox'][1]//ul/li/a/text()").getall()
>>> ' '.join(def1)
'Qui comprend l’intégrité, l’entière, la totalité d’une chose considérée par rapport au nombre, à l’étendue ou à l’intensité de l’énergie.
S’emploie devant un nom précédé ou non d’un article, d’un démonstratif ou d’un possessif. S’emploie aussi devant un nom propre. S’emploie également devant ceci, cela, ce que, ce qui, ceux qui et celles qui. S’emploie aussi comme attribut après le verbe.'

希望这有所帮助。

英文:

It happens because the text you want is not direct children of <ul> tag so using /text() would return direct children (or simply children) text. You need to get text from grand children of <ul> tag which is the text you want to scrape. For this purpose you can use //text() instead of /text or narrow down the XPath expression like:

&quot;//*[@class=&#39;defbox&#39;][n]//ul/li/a/text()&quot;

By doing this you have more clear list output also you can make a clean string of it:

&gt;&gt;&gt; def1 = response.xpath(&quot;//*[@class=&#39;defbox&#39;][1]//ul/li/a/text()&quot;).getall()
&gt;&gt;&gt; &#39; &#39;.join(def1)
&#39;Qui comprend l’int&#233;grit&#233;, l’enti&#232;ret&#233;, la totalit&#233; d’une chose consid&#233;r&#233;e par rapport au nombre, &#224; l’&#233;tendue ou &#224; l’intensit&#233; de l’&#233;nergie.\n\nS’emploie devant un nom pr&#233;c&#233;d&#233; ou non d’un article, d’un d&#233;
monstratif ou d’un possessif. S’emploie aussi devant un nom propre. S’emploie &#233;galement devant ceci, cela, ce que, ce qui, ceux qui et celles qui. S’emploie aussi comme attribut apr&#232;s le verbe.&#39;
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

想要抓取<ul>标签内的纯文本，不包括空格和标签。

问题

答案1

PNG图像使用Python Pillow的frombytes方法变成黑色。如何保持颜色？

使用Python的requests库下载超过1GB的大型数据并将其保存到文件中。

数据框列基于简单多数进行聚合。

在polars中查找匹配的对，并将它们按列排列。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。