想要抓取<ul>标签内的纯文本,不包括空格和标签。

huangapple go评论78阅读模式
英文:

want to scrape only text inside <ul> without spaces and balises

问题

I'm using xpath, I want to scrape from this URL: https://www.le-dictionnaire.com/definition/tout

I'm using this code but it brings spaces, new lines, and balises li from the ul:

def parse(self, response):
    print("procesing:" + response.url)
    # Extract data using css selectors
    # product_name = response.css('.product::text').extract()
    # price_range = response.css('.value::text').extract()
    # Extract data using xpath
    title = response.xpath("//b/text()").extract()
    genre1 = response.xpath("(//span/text())[2]").extract()
    def1 = response.xpath("((//*[self::ul])[1])").extract()
    genre2 = response.xpath("(//span/text())[3]").extract()
    def2 = response.xpath("((//*[self::ul])[2])").extract()

    row_data = zip(title, genre1, def1, genre2, def2)

    # Making extracted data row wise
    for item in row_data:
        # create a dictionary to store the scraped info
        scraped_info = {
            # key:value
            'page': response.url,
            'title': item[0],  # item[0] means product in the list and so on, index tells what value to assign
            'genere1': item[1],
            'def1': item[2],
            'genere2': item[3],
            'def2': item[4],
        }

        # yield or give the scraped info to scrapy
        yield scraped_info

When I add the tag text()

def1 = response.xpath("((//*[self::ul])[1]/text())").extract()
def2 = response.xpath("((//*[self::ul])[2]/text())").extract()

it scrapes only blank spaces.

英文:

I'm using xpath, I want to scrape from this URL: https://www.le-dictionnaire.com/definition/tout&#39;

I'm using this code but it brings spaces, new lines and balises li from the ul:

def parse(self, response):

    print(&quot;procesing:&quot;+response.url)
    #Extract data using css selectors
    #product_name=response.css(&#39;.product::text&#39;).extract()
    #price_range=response.css(&#39;.value::text&#39;).extract()
    #Extract data using xpath
    title = response.xpath(&quot;//b/text()&quot;).extract()
    genre1 = response.xpath(&quot;(//span/text())[2]&quot;).extract()
    def1 = response.xpath(&quot;((//*[self::ul])[1])&quot;).extract()
    genre2 = response.xpath(&quot;(//span/text())[3]&quot;).extract()
    def2 = response.xpath(&quot;((//*[self::ul])[2])&quot;).extract()

    row_data=zip(title,genre1,def1,genre2,def2)

    #Making extracted data row wise
    for item in row_data:
        #create a dictionary to store the scraped info
        scraped_info = {
            #key:value
            &#39;page&#39;:response.url,
            &#39;title&#39; : item[0], #item[0] means product in the list and so on, index tells what value to assign
            &#39;genere1&#39; : item[1],
            &#39;def1&#39; : item[2],
            &#39;genere2&#39; : item[3],
            &#39;def2&#39; : item[4],
            
        }

        #yield or give the scraped info to scrapy
        yield scraped_info

When I add the tag text()

def1 = response.xpath(&quot;((//*[self::ul])[1]/text())&quot;).extract()
def2 = response.xpath(&quot;((//*[self::ul])[2]/text())&quot;).extract()

it scrapes only blank spaces.

答案1

得分: 1

以下是您要翻译的内容:

"这是因为您想要的文本不是<ul>标签的直接子级,所以使用/text()会返回直接子级(或简单的子级)文本。您需要从<ul>标签的孙子级别获取文本,这就是您想要抓取的文本。为此,您可以使用//text()而不是/text,或者缩小XPath表达式范围,如下所示:

"//*[@class='defbox'][n]//ul/li/a/text()"

通过这样做,您可以获得更清晰的列表输出,还可以创建一个干净的字符串:

>>> def1 = response.xpath("//*[@class='defbox'][1]//ul/li/a/text()").getall()
>>> ' '.join(def1)
'Qui comprend l’intégrité, l’entière, la totalité d’une chose considérée par rapport au nombre, à l’étendue ou à l’intensité de l’énergie.

Semploie devant un nom précédé ou non dun article, dun démonstratif ou dun possessif. Semploie aussi devant un nom propre. Semploie également devant ceci, cela, ce que, ce qui, ceux qui et celles qui. Semploie aussi comme attribut après le verbe.'

希望这有所帮助。

英文:

It happens because the text you want is not direct children of &lt;ul&gt; tag so using /text() would return direct children (or simply children) text. You need to get text from grand children of &lt;ul&gt; tag which is the text you want to scrape. For this purpose you can use //text() instead of /text or narrow down the XPath expression like:

&quot;//*[@class=&#39;defbox&#39;][n]//ul/li/a/text()&quot;

By doing this you have more clear list output also you can make a clean string of it:

&gt;&gt;&gt; def1 = response.xpath(&quot;//*[@class=&#39;defbox&#39;][1]//ul/li/a/text()&quot;).getall()
&gt;&gt;&gt; &#39; &#39;.join(def1)
&#39;Qui comprend l’int&#233;grit&#233;, l’enti&#232;ret&#233;, la totalit&#233; d’une chose consid&#233;r&#233;e par rapport au nombre, &#224; l’&#233;tendue ou &#224; l’intensit&#233; de l’&#233;nergie.\n\nS’emploie devant un nom pr&#233;c&#233;d&#233; ou non d’un article, d’un d&#233;
monstratif ou dun possessif. Semploie aussi devant un nom propre. Semploie &#233;galement devant ceci, cela, ce que, ce qui, ceux qui et celles qui. S’emploie aussi comme attribut apr&#232;s le verbe.&#39;

</details>



huangapple
  • 本文由 发表于 2020年1月7日 01:32:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/59616490.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定