2023年6月22日 17:43:44go评论114阅读模式

英文:

how to extract the texts after the first h1 Tag?

问题

我尝试编写一段代码，以获取并清理每天100个网站的文本。我遇到了一个问题，有一个网站有多个h1标签，当你滚动到下一个h1标签时，网站的URL会更改，例如这个网站。

我基本上有这个。

response=requests.get('https://economictimes.indiatimes.com/news/international/business/volkswagen-sets-5-7-revenue-growth-target-preaches-cost-discipline/articleshow/101168014.cms',headers={"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"})      
soup = BeautifulSoup(response.content, 'html.parser')
if len(soup.body.find_all('h1'))>2:    #检查是否有多个标签
    if i.endswith(".cms"):              #检查网站是否以 .cms 结尾（我对这部分有疑虑）
        for elem in soup.next_siblings:
            if elem.name == 'h1':
                获取第一个h1标签后的文本
            
            break

如何在第一个h1标签后获取文本？（请注意，文本位于

标签中，而不是

标签）。

英文:

i'm trying to write a code to get and clean the text from 100 websites per day. i came across an issue with one website that has More than one h1 tag and when you scroll to the next h1 tag the URL on the website changes for example this website.

what i have is basically this.

response=requests.get(&#39;https://economictimes.indiatimes.com/news/international/business/volkswagen-sets-5-7-revenue-growth-target-preaches-cost-discipline/articleshow/101168014.cms&#39;,headers={&quot;User-Agent&quot; : &quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36&quot;})      
soup = BeautifulSoup(response.content, &#39;html.parser&#39;)
 if len(soup.body.find_all(&#39;h1&#39;))&gt;2:    #to check if there is more than one tag      
    if i.endswith(&quot;.cms&quot;):              #to check if the website has .cms ending (i have my doubts on this part)
      for elem in soup.next_siblings:
        if elem.name == &#39;h1&#39;:
           GET THE TEXT SOME HOW
          
        break

How can i get the text after first h1 tag? (please note that the text is in <div> tag and not in <p> tag.

答案1

得分: 1

你之前尝试使用.next_siblings的想法是正确的，但需要记住***soup.next_siblings不太可能生成任何内容*，因为通常不会期望文档本身具有任何兄弟节点。

以下代码会找到第一个标题，然后【如果它没有任何兄弟节点】，会向上搜索其父级，找到最近具有兄弟节点的父级，然后遍历兄弟节点，但如果遇到另一个h1标签则停止。

# list_of_urls = [&#39;https://economictimes.indiatimes.com/...
# for url in list_of_urls:
    # response = requests.get(url,.....
    # soup = BeautifulSoup(response.content, &#39;html.parser&#39;)
    header1 = soup.find(&#39;h1&#39;)
    if not header1:
        print(f&#39;[{response.status_code} {response.reason}] No headers at&#39;, url)
        continue
    
    if header1.next_sibling: hSibs = header1.next_siblings
    else:
        hParent = next((p for p in header1.parents if p.next_sibling), None)
        hSibs = hParent.next_siblings if hParent else []
    
    h1Sibs = []
    for ns in hSibs:
        if ns.name == &#39;h1&#39; or (not isinstance(ns,str) and ns.find(&#39;h1&#39;)): break
        h1Sibs.append(ns)
    h1Sibs_text = &#39;\n---\n&#39;.join(ns.get_text(&#39; &#39;) for ns in h1Sibs)

对于你提供的网站，<kbd>print(h1Sibs_text)</kbd> 应该打印

> lang-none > SECTIONS Volkswagen sets 5-7% revenue growth target, preaches cost discipline Reuters Last Updated: Jun 21, 2023, 07:16 PM IST Rate Story Share Font Size Abc Small Abc Medium Abc Large Save Print Comment > --- ...（内容较多，已省略） >

请注意，你不必使用'\n---\n'来连接兄弟节点的文本 - 你可以使用任何字符串作为分隔符。

顺便说一下，对于该特定网站的文章，一个更简单的方法是通过目标标题标签的类来定位，

    if url.startswith(&#39;https://economictimes.indiatimes.com/&#39;): ## might need more
        h1Sibs = soup.select(&#39;*:has(&gt;h1.artTitle)~*&#39;) 
        h1Sibs_text = &#39;\n---\n&#39;.join(ns.get_text(&#39; &#39;) for ns in h1Sibs)

<sup>注意：使用select和 *:has(>h1.artTitle)~* 选择器类似于使用 soup.find('h1',class_='artTitle').parent.next_siblings，但比链式调用 find,parent,next_siblings 更安全，如果未找到 h1.artTitle 将简单返回一个空列表而不会引发错误。</sup>

如果你正在爬取许多不同的链接，但知道大部分链接所在的网站，你可能想将其分成适用于每个站点（甚至站点组）的 if...elif... 块，并在 else 块中仅使用像第一个代码片段那样的通用内容。你甚至可以考虑使用类似于此可配置解析器的东西，其中包含每个站点的选择器集合。

英文:

You had the right idea in trying to use .next_siblings, but you should keep in mind that soup.next_siblings is unlikely to generate anything as the document itself is generally not expected to have any siblings.

The following code finds the first header and then [if it doesn't have any siblings], searches up the its parents to find the nearest one with siblings and then goes through the siblings but stops if another h1 tag is reached.

# list_of_urls = [&#39;https://economictimes.indiatimes.com/...
# for url in list_of_urls:
    # response = requests.get(url,.....
    # soup = BeautifulSoup(response.content, &#39;html.parser&#39;)
    header1 = soup.find(&#39;h1&#39;)
    if not header1:
        print(f&#39;[{response.status_code} {response.reason}] No headers at&#39;, url)
        continue
    
    if header1.next_sibling: hSibs = header1.next_siblings
    else:
        hParent = next((p for p in header1.parents if p.next_sibling), None)
        hSibs = hParent.next_siblings if hParent else []
    
    h1Sibs = []
    for ns in hSibs:
        if ns.name == &#39;h1&#39; or (not isinstance(ns,str) and ns.find(&#39;h1&#39;)): break
        h1Sibs.append(ns)
    h1Sibs_text = &#39;\n---\n&#39;.join(ns.get_text(&#39; &#39;) for ns in h1Sibs)

for the site in your example, <kbd>print(h1Sibs_text)</kbd> should print

> lang-none > SECTIONS Volkswagen sets 5-7% revenue growth target, preaches cost discipline Reuters Last Updated: Jun 21, 2023, 07:16 PM IST Rate Story Share Font Size Abc Small Abc Medium Abc Large Save Print Comment > --- > Synopsis The German carmaker has set "performance programmes" for each brand, allocating them capital and setting a specific return on sales target, but delegating responsibility to the brands for how those targets are reached, executives said in a press call on its Capital Markets Day. "If you look at how Volkswagen operated in the past, often we had a fixed cost growth and we wanted to outgrow that fixed cost," Chief Financial Officer Arno Antlitz said. Agencies Volkswagen sets 5-7% revenue growth target, preaches cost discipline Volkswagen set new financial targets on Wednesday of 5-7% annual revenue growth by 2027 and 9-11% returns by 2030, aiming to stay disciplined on investment and focus on boosting margins in the face of growing competition for market share. The German carmaker has set "performance programmes" for each brand, allocating them capital and setting a specific return on sales target, but delegating responsibility to the brands for how those targets are reached, executives said in a press call on its Capital Markets Day . "If you look at how Volkswagen operated in the past, often we had a fixed cost growth and we wanted to outgrow that fixed cost," Chief Financial Officer Arno Antlitz said. "We are convinced in the transformation we need to change that strategy to our value over volume approach, be very disciplined on fixed cost, be very disciplined on investment and rather focus on value," he added. In China , where internal combustion engine sales still provide high revenues for the carmaker, it has slightly reduced its target for battery-electric vehicle sales in the next 1-2 years and is instead focused on protecting margins, Antlitz said. The new revenue growth target is a marked jump from Volkswagen's performance in recent years, with revenue growing just 1.1-1.2% per year in the last two years, and 0.7% in 2018-2019 prior to the pandemic. Under the new performance programmes, each brand will have a set target for operating result, returns, net cash flow, cash conversion rate, and investment ratio, Volkswagen said in a statement, adding it would tie management incentives to meeting targets. The carmaker is planning separate capital markets days for each brand over the coming months to introduce those targets, sources close to the company told Reuters last Friday. Don’t miss out on ET Prime stories! Get your daily dose of business updates on WhatsApp. click here! Thursday, 22 Jun, 2023 Experience Your Economic Times Newspaper, The Digital Way! Read Complete Print Edition » Front Page Pure Politics ET Markets Smart Investing More Local Indices End at Record Peaks on HDFC, IT Gains India’s key stock benchmarks closed at record highs amid choppy trade on Wednesday, bucking the bearish mood in other Asian markets, as merger-bound HDFC Bank and HDFC, as well as software shares, paced the gains. Musk Meets Modi, Says Tesla to be in India Soon Tesla founder Elon Musk said he had a very good conversation with Prime Minister Narendra Modi and he is confident the company, the world’s largest electric carmaker, will be in India “as soon as humanly possible” and that it is likely to make a “significant investment” in the country. ZEE-Sony Deal on, Whether I’m CEO or Not Punit Goenka, CEO & MD of Zee Entertainment Enterprises, has said that the ZEE-Sony merger will go through whether or not he is the CEO of the merged company, as it benefits 96% of stakeholders. Read More News on volkswagen Capital Market Brands Capital Revenue antlitz arno antlitz capital markets day china (Catch all the Business News , Breaking News Events and Latest News Updates on The Economic Times .) Download The Economic Times News App to get Daily Market Updates & Live Business News. ... more less ETPrime stories of the day Venture capital Peak XV and Sequoia’s trek ahead has plenty of tricky troughs 11 mins read Investing Despite the INR1 lakh a share feat, MRF skids on 10 critical points investors shouldn’t overlook. 7 mins read OTT More than just claps and confetti: How IPL has transformed the in-stadium cricket viewing experience 14 mins read Subscribe to ETPrime Videos PM Modi, Joe Biden exchange gifts at White House PM signs on T-shirt of a boy as he welcomes Modi Details of PM Modi's gift to Joe Biden, Jill Biden Stock radar: Buy Grasim stock; target Rs 2080 Sensex loses over 50 pts, Nifty tests 18,850 Joe Biden, Jill Biden receive PM Modi at WH Here's what was there for PM at State Dinner Stock ideas by experts for June 22 Stocks in focus: Glenmark, LIC & more Richard Gere to Ruchira Kamboj on UN's Yoga event 1 2 3 Poll Are foreign rating agencies unfairly harsh on India? Yes No Can't say Vote Latest from ET Gritty Goenka says Sony-Zee merger is still on for larger audience Modi invites Micron to boost chip making in India Can Vedanta afford to repay debts amid profit pangs? Trending World News Nintendo Direct 2023 Pokemon Go Spotlight Hour Titanic tourist submersible Lionel Messi Britney Joy Venus Williams Boxing Summer solstice Jujutsu Kaisen Chapter 227 Wordle Today Quordle Today Mikayla Campinos Taylor Swift The Flash Box-office Plaza Wars Paxton Whitehead Summer Solstice 2023 Extraction 2 How to Watch Portugal vs Iceland Titanic Submarine >

Note that you don't have to use '\n---\n' to join the siblings' text - you can use any string as separator.

Btw, for that specific site's articles, a much simpler way would be to target the header tag specifically by its class,

    if url.startswith(&#39;https://economictimes.indiatimes.com/&#39;): ## might need more
        h1Sibs = soup.select(&#39;*:has(&gt;h1.artTitle)~*&#39;) 
        h1Sibs_text = &#39;\n---\n&#39;.join(ns.get_text(&#39; &#39;) for ns in h1Sibs)

<sup>NOTE: using select with the *:has(>h1.artTitle)~* selector is similar to using soup.find('h1',class_='artTitle').parent.next_siblings, but is safer than chaining find,parent,next_siblings as it will simply return an empty list instead of raising any errors if h1.artTitle is not found.</sup>

If you are scraping many different links, but you know the sites for most of them, you might want to break if up into if...elif... blocks for each site (or even groups of sites) and only use something generic like my first snippet for unlisted sites in the else block. You might even consider using something like this configurable parser with sets of selectors for each site.

答案2

得分: 0

"BeautifulSoup解析器可能在这里有帮助。"
"https://stackabuse.com/guide-to-parsing-html-with-beautifulsoup-in-python/"

`response = requests.get('https://economictimes.indiatimes.com/news/international/business/volkswagen-sets-5-7-revenue-growth-target-preaches-cost-discipline/articleshow/101168014.cms', headers={"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"})
soup = BeautifulSoup(response.content, 'html.parser')

h1_tags = soup.body.find_all('h1')
if len(h1_tags) > 1:
for sibling in h1_tags[0].next_siblings:
if sibling.name == 'p':
text_after_h1 = sibling.get_text(strip=True)
break

print(text_after_h1)`

"soup.body.find_all('h1')" - 这将查找所有在**中的

**元素。

我们遍历它们的下一个兄弟节点，直到找到一个**
标签（假设

标签是文本）。
get_text() - 将获取

**标签下的文本。strip=True - 移除任何前导或尾随的空格。

曾经遇到类似的问题。
希望这有所帮助。

英文:

Maybe BeautifulSoup parser would be helpful here.
https://stackabuse.com/guide-to-parsing-html-with-beautifulsoup-in-python/

`response = requests.get(&#39;https://economictimes.indiatimes.com/news/international/business/volkswagen-sets-5-7-revenue-growth-target-preaches-cost-discipline/articleshow/101168014.cms&#39;, headers={&quot;User-Agent&quot;: &quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36&quot;})
soup = BeautifulSoup(response.content, &#39;html.parser&#39;)
h1_tags = soup.body.find_all(&#39;h1&#39;)
if len(h1_tags) &gt; 1:
    for sibling in h1_tags[0].next_siblings:
        if sibling.name == &#39;p&#39;:
            text_after_h1 = sibling.get_text(strip=True)
            break
print(text_after_h1)`

"soup.body.find_all('h1')" - This will find all the "<h1>" elements within the "<body>"
-we iterate through their next siblings until we find a "<p" tag (assuming that the p tag is the text).
get_text() - will grab the text under the p tag. ** strip=True** - removes any leading or whitespace.

Had a simillar issue once.
Hope that this helps

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

提取第一个h1标签后面的文本如何操作？

问题

答案1

答案2

优化 Django 查询 – 减少数据库请求和正确的查询集访问

使用两个变量创建sympy函数

如何格式化数据框中列表中的日期时间元素？

Trying to create a streamlit app that uses user-provided URLs to scrape and return a downloadable df

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。