英文:
how to extract the texts after the first h1 Tag?
问题
我尝试编写一段代码,以获取并清理每天100个网站的文本。我遇到了一个问题,有一个网站有多个h1标签,当你滚动到下一个h1标签时,网站的URL会更改,例如这个网站。
我基本上有这个。
response=requests.get('https://economictimes.indiatimes.com/news/international/business/volkswagen-sets-5-7-revenue-growth-target-preaches-cost-discipline/articleshow/101168014.cms',headers={"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"})
soup = BeautifulSoup(response.content, 'html.parser')
if len(soup.body.find_all('h1'))>2: #检查是否有多个标签
if i.endswith(".cms"): #检查网站是否以 .cms 结尾(我对这部分有疑虑)
for elem in soup.next_siblings:
if elem.name == 'h1':
获取第一个h1标签后的文本
break
如何在第一个h1标签后获取文本?(请注意,文本位于
标签)。
英文:
i'm trying to write a code to get and clean the text from 100 websites per day. i came across an issue with one website that has More than one h1 tag and when you scroll to the next h1 tag the URL on the website changes for example this website.
what i have is basically this.
response=requests.get('https://economictimes.indiatimes.com/news/international/business/volkswagen-sets-5-7-revenue-growth-target-preaches-cost-discipline/articleshow/101168014.cms',headers={"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"})
soup = BeautifulSoup(response.content, 'html.parser')
if len(soup.body.find_all('h1'))>2: #to check if there is more than one tag
if i.endswith(".cms"): #to check if the website has .cms ending (i have my doubts on this part)
for elem in soup.next_siblings:
if elem.name == 'h1':
GET THE TEXT SOME HOW
break
How can i get the text after first h1 tag? (please note that the text is in <div> tag and not in <p> tag.
答案1
得分: 1
你之前尝试使用.next_siblings
的想法是正确的,但需要记住***soup.next_siblings
不太可能生成任何内容*,因为通常不会期望文档本身具有任何兄弟节点。
以下代码会找到第一个标题,然后【如果它没有任何兄弟节点】,会向上搜索其父级,找到最近具有兄弟节点的父级,然后遍历兄弟节点,但如果遇到另一个h1
标签则停止。
# list_of_urls = ['https://economictimes.indiatimes.com/...
# for url in list_of_urls:
# response = requests.get(url,.....
# soup = BeautifulSoup(response.content, 'html.parser')
header1 = soup.find('h1')
if not header1:
print(f'[{response.status_code} {response.reason}] No headers at', url)
continue
if header1.next_sibling: hSibs = header1.next_siblings
else:
hParent = next((p for p in header1.parents if p.next_sibling), None)
hSibs = hParent.next_siblings if hParent else []
h1Sibs = []
for ns in hSibs:
if ns.name == 'h1' or (not isinstance(ns,str) and ns.find('h1')): break
h1Sibs.append(ns)
h1Sibs_text = '\n---\n'.join(ns.get_text(' ') for ns in h1Sibs)
对于你提供的网站,<kbd>print(h1Sibs_text)
</kbd> 应该打印
> lang-none > SECTIONS Volkswagen sets 5-7% revenue growth target, preaches cost discipline Reuters Last Updated: Jun 21, 2023, 07:16 PM IST Rate Story Share Font Size Abc Small Abc Medium Abc Large Save Print Comment > --- ...(内容较多,已省略) >
请注意,你不必使用'\n---\n'
来连接兄弟节点的文本 - 你可以使用任何字符串作为分隔符。
顺便说一下,对于该特定网站的文章,一个更简单的方法是通过目标标题标签的类来定位,
if url.startswith('https://economictimes.indiatimes.com/'): ## might need more
h1Sibs = soup.select('*:has(>h1.artTitle)~*')
h1Sibs_text = '\n---\n'.join(ns.get_text(' ') for ns in h1Sibs)
<sup>注意:使用select
和 *:has(>h1.artTitle)~*
选择器类似于使用 soup.find('h1',class_='artTitle').parent.next_siblings
,但比链式调用 find
,parent
,next_siblings
更安全,如果未找到 h1.artTitle
将简单返回一个空列表而不会引发错误。</sup>
如果你正在爬取许多不同的链接,但知道大部分链接所在的网站,你可能想将其分成适用于每个站点(甚至站点组)的 if...elif...
块,并在 else
块中仅使用像第一个代码片段那样的通用内容。你甚至可以考虑使用类似于 此可配置解析器 的东西,其中包含每个站点的选择器集合。
英文:
You had the right idea in trying to use .next_siblings
, but you should keep in mind that soup.next_siblings
is unlikely to generate anything as the document itself is generally not expected to have any siblings.
The following code finds the first header and then [if it doesn't have any siblings], searches up the its parents to find the nearest one with siblings and then goes through the siblings but stops if another h1
tag is reached.
# list_of_urls = ['https://economictimes.indiatimes.com/...
# for url in list_of_urls:
# response = requests.get(url,.....
# soup = BeautifulSoup(response.content, 'html.parser')
header1 = soup.find('h1')
if not header1:
print(f'[{response.status_code} {response.reason}] No headers at', url)
continue
if header1.next_sibling: hSibs = header1.next_siblings
else:
hParent = next((p for p in header1.parents if p.next_sibling), None)
hSibs = hParent.next_siblings if hParent else []
h1Sibs = []
for ns in hSibs:
if ns.name == 'h1' or (not isinstance(ns,str) and ns.find('h1')): break
h1Sibs.append(ns)
h1Sibs_text = '\n---\n'.join(ns.get_text(' ') for ns in h1Sibs)
for the site in your example, <kbd>print(h1Sibs_text)
</kbd> should print
> lang-none
> SECTIONS Volkswagen sets 5-7% revenue growth target, preaches cost discipline Reuters Last Updated: Jun 21, 2023, 07:16 PM IST Rate Story Share Font Size Abc Small Abc Medium Abc Large Save Print Comment
> ---
> Synopsis The German carmaker has set "performance programmes" for each brand, allocating them capital and setting a specific return on sales target, but delegating responsibility to the brands for how those targets are reached, executives said in a press call on its Capital Markets Day. "If you look at how Volkswagen operated in the past, often we had a fixed cost growth and we wanted to outgrow that fixed cost," Chief Financial Officer Arno Antlitz said. Agencies Volkswagen sets 5-7% revenue growth target, preaches cost discipline Volkswagen set new financial targets on Wednesday of 5-7% annual revenue growth by 2027 and 9-11% returns by 2030, aiming to stay disciplined on investment and focus on boosting margins in the face of growing competition for market share. The German carmaker has set "performance programmes" for each brand, allocating them capital and setting a specific return on sales target, but delegating responsibility to the brands for how those targets are reached, executives said in a press call on its Capital Markets Day . "If you look at how Volkswagen operated in the past, often we had a fixed cost growth and we wanted to outgrow that fixed cost," Chief Financial Officer Arno Antlitz said. "We are convinced in the transformation we need to change that strategy to our value over volume approach, be very disciplined on fixed cost, be very disciplined on investment and rather focus on value," he added. In China , where internal combustion engine sales still provide high revenues for the carmaker, it has slightly reduced its target for battery-electric vehicle sales in the next 1-2 years and is instead focused on protecting margins, Antlitz said. The new revenue growth target is a marked jump from Volkswagen's performance in recent years, with revenue growing just 1.1-1.2% per year in the last two years, and 0.7% in 2018-2019 prior to the pandemic. Under the new performance programmes, each brand will have a set target for operating result, returns, net cash flow, cash conversion rate, and investment ratio, Volkswagen said in a statement, adding it would tie management incentives to meeting targets. The carmaker is planning separate capital markets days for each brand over the coming months to introduce those targets, sources close to the company told Reuters last Friday. Don’t miss out on ET Prime stories! Get your daily dose of business updates on WhatsApp. click here! Thursday, 22 Jun, 2023 Experience Your Economic Times Newspaper, The Digital Way! Read Complete Print Edition » Front Page Pure Politics ET Markets Smart Investing More Local Indices End at Record Peaks on HDFC, IT Gains India’s key stock benchmarks closed at record highs amid choppy trade on Wednesday, bucking the bearish mood in other Asian markets, as merger-bound HDFC Bank and HDFC, as well as software shares, paced the gains. Musk Meets Modi, Says Tesla to be in India Soon Tesla founder Elon Musk said he had a very good conversation with Prime Minister Narendra Modi and he is confident the company, the world’s largest electric carmaker, will be in India “as soon as humanly possible” and that it is likely to make a “significant investment” in the country. ZEE-Sony Deal on, Whether I’m CEO or Not Punit Goenka, CEO & MD of Zee Entertainment Enterprises, has said that the ZEE-Sony merger will go through whether or not he is the CEO of the merged company, as it benefits 96% of stakeholders. Read More News on volkswagen Capital Market Brands Capital Revenue antlitz arno antlitz capital markets day china (Catch all the Business News , Breaking News Events and Latest News Updates on The Economic Times .) Download The Economic Times News App to get Daily Market Updates & Live Business News. ... more less ETPrime stories of the day Venture capital Peak XV and Sequoia’s trek ahead has plenty of tricky troughs 11 mins read Investing Despite the INR1 lakh a share feat, MRF skids on 10 critical points investors shouldn’t overlook. 7 mins read OTT More than just claps and confetti: How IPL has transformed the in-stadium cricket viewing experience 14 mins read Subscribe to ETPrime Videos PM Modi, Joe Biden exchange gifts at White House PM signs on T-shirt of a boy as he welcomes Modi Details of PM Modi's gift to Joe Biden, Jill Biden Stock radar: Buy Grasim stock; target Rs 2080 Sensex loses over 50 pts, Nifty tests 18,850 Joe Biden, Jill Biden receive PM Modi at WH Here's what was there for PM at State Dinner Stock ideas by experts for June 22 Stocks in focus: Glenmark, LIC & more Richard Gere to Ruchira Kamboj on UN's Yoga event 1 2 3 Poll Are foreign rating agencies unfairly harsh on India? Yes No Can't say Vote Latest from ET Gritty Goenka says Sony-Zee merger is still on for larger audience Modi invites Micron to boost chip making in India Can Vedanta afford to repay debts amid profit pangs? Trending World News Nintendo Direct 2023 Pokemon Go Spotlight Hour Titanic tourist submersible Lionel Messi Britney Joy Venus Williams Boxing Summer solstice Jujutsu Kaisen Chapter 227 Wordle Today Quordle Today Mikayla Campinos Taylor Swift The Flash Box-office Plaza Wars Paxton Whitehead Summer Solstice 2023 Extraction 2 How to Watch Portugal vs Iceland Titanic Submarine
>
Note that you don't have to use '\n---\n'
to join the siblings' text - you can use any string as separator.
Btw, for that specific site's articles, a much simpler way would be to target the header tag specifically by its class,
if url.startswith('https://economictimes.indiatimes.com/'): ## might need more
h1Sibs = soup.select('*:has(>h1.artTitle)~*')
h1Sibs_text = '\n---\n'.join(ns.get_text(' ') for ns in h1Sibs)
<sup>NOTE: using select
with the *:has(>h1.artTitle)~*
selector is similar to using soup.find('h1',class_='artTitle').parent.next_siblings
, but is safer than chaining find
,parent
,next_siblings
as it will simply return an empty list instead of raising any errors if h1.artTitle
is not found.</sup>
If you are scraping many different links, but you know the sites for most of them, you might want to break if up into if...elif...
blocks for each site (or even groups of sites) and only use something generic like my first snippet for unlisted sites in the else
block. You might even consider using something like this configurable parser with sets of selectors for each site.
答案2
得分: 0
"BeautifulSoup解析器可能在这里有帮助。"
"https://stackabuse.com/guide-to-parsing-html-with-beautifulsoup-in-python/"
`response = requests.get('https://economictimes.indiatimes.com/news/international/business/volkswagen-sets-5-7-revenue-growth-target-preaches-cost-discipline/articleshow/101168014.cms', headers={"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"})
soup = BeautifulSoup(response.content, 'html.parser')
h1_tags = soup.body.find_all('h1')
if len(h1_tags) > 1:
for sibling in h1_tags[0].next_siblings:
if sibling.name == 'p':
text_after_h1 = sibling.get_text(strip=True)
break
print(text_after_h1)`
"soup.body.find_all('h1')" - 这将查找所有在**
中的**元素。
- 我们遍历它们的下一个兄弟节点,直到找到一个**
标签(假设
标签是文本)。
get_text() - 将获取
**标签下的文本。strip=True - 移除任何前导或尾随的空格。
标签(假设
标签是文本)。
get_text() - 将获取
**标签下的文本。strip=True - 移除任何前导或尾随的空格。
曾经遇到类似的问题。
希望这有所帮助。
英文:
Maybe BeautifulSoup parser would be helpful here.
https://stackabuse.com/guide-to-parsing-html-with-beautifulsoup-in-python/
`response = requests.get('https://economictimes.indiatimes.com/news/international/business/volkswagen-sets-5-7-revenue-growth-target-preaches-cost-discipline/articleshow/101168014.cms', headers={"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"})
soup = BeautifulSoup(response.content, 'html.parser')
h1_tags = soup.body.find_all('h1')
if len(h1_tags) > 1:
for sibling in h1_tags[0].next_siblings:
if sibling.name == 'p':
text_after_h1 = sibling.get_text(strip=True)
break
print(text_after_h1)`
"soup.body.find_all('h1')" - This will find all the "<h1>" elements within the "<body>"
-we iterate through their next siblings until we find a "<p" tag (assuming that the p tag is the text).
get_text() - will grab the text under the p tag. ** strip=True** - removes any leading or whitespace.
Had a simillar issue once.
Hope that this helps
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
- beautifulsoup
- content-management-system
- html
- python
- web
评论