Find all div, scrape from span.

huangapple go评论85阅读模式
英文:

Find all div, scrape from span

问题

你的脚本中有一些 HTML 实体编码(HTML entity encoding),需要先解码成正常的 HTML 标记才能正常解析。以下是解决方法:

import requests
from bs4 import BeautifulSoup
import html

url = "https://theedgemalaysia.com/categories/malaysia"

# 发送 GET 请求获取网页内容
response = requests.get(url)

# 创建一个 BeautifulSoup 对象来解析 HTML 内容
soup = BeautifulSoup(response.text, 'html.parser')

# 找到所有 class 为 "NewsList_newsListContent__4UpiN" 的 <div> 元素
container_divs = soup.find_all('div', class_='NewsList_newsListContent__4UpiN')

# 迭代遍历容器 divs
for container_div in container_divs:
    # 找到容器内 class 为 "NewsList_newsListText__hstO7" 的 <div> 元素
    news_text_divs = container_div.find_all('div', class_='NewsList_newsListText__hstO7')

    # 迭代遍历新闻文本 divs
    for news_text_div in news_text_divs:
        # 找到 class 为 "NewsList_newsListItemHead__dg7eK" 的 <span> 元素
        headline_span = news_text_div.find('span', class_='NewsList_newsListItemHead__dg7eK')

        # 打印标题的文本
        if headline_span:
            # 解码 HTML 实体编码并打印
            decoded_text = html.unescape(headline_span.text)
            print(decoded_text)

这样修改后,你应该能够正确地获取到标题文本了。希望这有所帮助!

英文:
&lt;div class=&quot;NewsList_newsListContent__4UpiN&quot;&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div class=&quot;NewsList_newsListItemWrap__XovMP&quot;&gt;
&lt;div style=&quot;display: flex;&quot;&gt;
&lt;div class=&quot;NewsList_newsListItem__yRAbe&quot;&gt;
&lt;a href=&quot;/flash-categories/Currency&quot;&gt;
&lt;div class=&quot;NewsList_newsListTag__TGHJ_&quot;&gt;
&lt;span&gt;Currency&lt;/span&gt;
&lt;/div&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;NewsList_newsListContent__4UpiN&quot;&gt;
&lt;div class=&quot;NewsList_infoNewsListSubMobile__SPmAG&quot;&gt;
&lt;span&gt;06 Jun 2023, 10:05 am &lt;/span&gt;
&lt;/div&gt;
&lt;div class=&quot;NewsList_newsListText__hstO7&quot;&gt;
&lt;a href=&quot;/node/669947&quot;&gt;
# &lt;span class=&quot;NewsList_newsListItemHead__dg7eK&quot;**&gt;Ringgit lower against US dollar in early session on June 6**&lt;/span&gt;
&lt;/a&gt;
&lt;a href=&quot;/node/669947&quot;&gt;
&lt;span class=&quot;NewsList_newsList__2fXyv&quot;&gt;KUALA LUMPUR (June 6): The ringgit opened lower against   the US dollar in the early session on Tuesday (June 6), as investors remain cautious on the global outlook despite a slightly weaker greenback, an analyst said.&amp;nbsp;At 9am, the local note fell to 4.5950/6000 versus the greenback, compared with Friday (June 2)’s closing of&amp;nbsp;4.5745/5785.  &lt;/span&gt;
&lt;/a&gt;
&lt;/div&gt;

For example: I want to scrape the above BOLD word :Ringgit lower against US dollar in early session on June 6

This is my script:

import requests
from bs4 import BeautifulSoup

url = &quot;https://theedgemalaysia.com/categories/malaysia&quot;

# Send a GET request to the URL
response = requests.get(url)

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.text, &#39;html.parser&#39;)

# Find all &lt;div&gt; elements with class &quot;NewsList_newsListContent__4UpiN&quot;
container_divs = soup.find_all(&#39;div&#39;, class_=&#39;NewsList_newsListContent__4UpiN&#39;)

# Iterate over the container divs
for container_div in container_divs:
    # Find all &lt;div&gt; elements with class &quot;NewsList_newsListText__hstO7&quot; within the container
    news_text_divs = container_div.find_all(&#39;div&#39;, class_=&#39;NewsList_newsListText__hstO7&#39;)

    # Iterate over the news text divs
    for news_text_div in news_text_divs:
        # Find the &lt;span&gt; element with class &quot;NewsList_newsListItemHead__dg7eK&quot; within the news text div
        headline_span = news_text_div.find(&#39;span&#39;, class_=&#39;NewsList_newsListItemHead__dg7eK&#39;)

        # Print the text of the headline
        if headline_span:
            print(headline_span.text)

I have tried out the script above and cannot find over the bugs, anyone here can have a look and let me know where is the problem please? Appreciate it a lot !

答案1

得分: 2

以下是您要翻译的代码部分:

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import json 
headers= {
    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'
}

url = 'https://theedgemalaysia.com/categories/malaysia'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')

data_script = soup.select_one('script[id="__NEXT_DATA__"]')
data = json.loads(data_script.string)
df = pd.json_normalize(data['props']['pageProps']['corporateData'])
print(df)

请注意,我已经为您提供了代码的中文翻译,不包括其他内容。

英文:

That page is being formed by JS based on some existent information in a script tag. Requests cannot execute Javascript, so it won't see those titles as you see them when visiting the page in a JS enabled browser.

Here is one way to get those titles:

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import json 
headers= {
    &#39;User-Agent&#39;:&#39;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36&#39;
}

url = &#39;https://theedgemalaysia.com/categories/malaysia&#39;
r = requests.get(url, headers=headers)
soup = bs(r.text, &#39;html.parser&#39;)

data_script = soup.select_one(&#39;script[id=&quot;__NEXT_DATA__&quot;]&#39;)
data = json.loads(data_script.string)
df = pd.json_normalize(data[&#39;props&#39;][&#39;pageProps&#39;][&#39;corporateData&#39;])
print(df)

Result in terminal:

 	nid 	type 	language 	category 	options 	flash 	tags 	edited 	title 	created 	updated 	author 	source 	audio 	audioflag 	alias 	video_url 	img 	caption 	summary
0 	669998 	article 	english 	Corporate,Malaysia 	Top Stories 	Noon Market 			Bursa stays in the red at midday 	1686027040000 	1686027040000 	Bernama 	Bernama 		0 	node/669998 		https://assets.theedgemarkets.com/noon-market-... 		Bursa Malaysia stayed in the red at midday due...
1 	669997 	article 	english 	Corporate,Malaysia 					Skyworld eyes 3Q Main Market listing, inks und... 	1686026908000 	1686026908000 	Lam Jian Wyn 	theedgemalaysia.com 		0 	node/669997 		https://assets.theedgemarkets.com/SkyWorld-Dev... 		KUALA LUMPUR (June 6): SkyWorld Development Bh...
2 	669995 	article 	english 	Malaysia 	Top Stories,Politics &amp; Government 	Parliament 			Investigation into Kedah MB over comments Pena... 	1686026473000 	1686026473000 	Hailey Chung &amp; Chester Tay 	theedgemalaysia.com 		0 	node/669995 		https://assets.theedgemarkets.com/Kedah Sanusi... 	Kedah Menteri Besar Datuk Seri Muhammad Sanusi... 	Police have started an investigation into Keda...
3 	669984 	article 	english 	Malaysia,Economy 	Top Stories,Politics &amp; Government 	Parliament 	mynewstv 		Anwar defends BNM’s gradual approach to moneta... 	1686025226000 	1686025226000 	Hailey Chung &amp; Chester Tay 	theedgemalaysia.com 		0 	node/669984 		https://assets.theedgemarkets.com/Anwar 060620... 		Higher borrowing costs and the sharp depreciat...
4 	669980 	article 	english 	Malaysia,World,Economy 	Top Stories,Politics &amp; Government 	ESG 			Global carbon markets face upheaval as nations... 	1686024746000 	1686024746000 	Natasha White &amp; Ewa Krukowska 	Bloomberg 		0 	node/669980 		https://assets.theedgemarkets.com/398972891-fo... 		LONDON/BRUSSELS (June 6): The US$2 billion mar...
5 	669961 	article 	english 	Corporate,Malaysia 				Isabelle Francis 	CGS-CIMB starts coverage of Dayang Enterprise ... 	1686022324000 	1686022324000 	Anis Hazim 	theedgemalaysia.com 		0 	node/669961 		https://assets.theedgemarkets.com/Dayang-Enter... 		CGS-CIMB has initiated coverage of Dayang Ente...
6 	669957 	article 	english 	Malaysia 	Politics &amp; Government 				Kit Siang expresses gratitude to Agong for &#39;Ta... 	1686021406000 	1686021406000 	Bernama 	Bernama 		0 	node/669957 		https://assets.theedgemarkets.com/Lim-Kit-sian... 		Veteran politician Tan Sri Lim Kit Siang expre...
7 	669956 	article 	english 	Corporate,Malaysia 			mynewstv 		1Q results came broadly below expectations, sa... 	1686020951000 	1686020951000 	Isabelle Francis 	theedgemalaysia.com 		0 	node/669956 		https://assets.theedgemarkets.com/Bursa-Malays... 		KUALA LUMPUR (June 6): Analysts said the first...
8 	669954 	article 	english 	Corporate,Management,Malaysia 	Top Stories 	ESG 	mynewstv 		24 public-listed companies still have no women... 	1686020019000 	1686020019000 	Tan Zhai Yun 	theedgemalaysia.com 		0 	node/669954 		https://assets.theedgemarkets.com/Bursa-4_2023... 		KUALA LUMPUR (June 6): As at June 1, 2023, 24 ...
9 	669953 	article 	english 	Corporate,Malaysia 		Hot Stock 	mynewstv 	Lam Jian Wyn 	Bumi Armada shares fall 20.47% on Kraken FPSO ... 	1686019041000 	1686019041000 	Anis Hazim 	theedgemalaysia.com 		0 	node/669953 		https://assets.theedgemarkets.com/Bumi-Armada-... 		KUALA LUMPUR (June 6): Shares of Bumi Armada B...
10 	669951 	article 	english 	Corporate,Malaysia,World 		Global Markets 			Asian stocks wobble as traders weigh Fed rate ... 	1686018738000 	1686018738000 	Ankur Banerjee 	Reuters 		0 	node/669951 		https://assets.theedgemarkets.com/395135636-As... 		SINGAPORE (June 6): Asian stock markets edged ...
11 	669948 	article 	english 	Malaysia,Court 	Politics &amp; Government 			Lam Jian Wyn 	High Court dismisses Zuraida’s leave applicati... 	1686017713000 	1686017713000 	Tarani Palani 	theedgemalaysia.com 		0 	node/669948 		https://assets.theedgemarkets.com/Zuraida-Kama... 	Ampang member of Parliament Datuk Zuraida Kama... 	KUALA LUMPUR (June 6): The High Court has dism...
12 	669947 	article 	english 	Malaysia 	Top Stories 	Currency 			Ringgit lower against US dollar in early sessi... 	1686017126000 	1686017126000 	Bernama 	Bernama 		0 	node/669947 		https://assets.theedgemarkets.com/Ringgit-5_20... 		KUALA LUMPUR (June 6): The ringgit opened lowe...
13 	669945 	article 	english 	Corporate,Malaysia 		Market Open 			Bursa Malaysia marginally higher in early sess... 	1686016694000 	1686016694000 	Bernama 	Bernama 		0 	node/669945 		https://assets.theedgemarkets.com/opening-mark... 		KUALA LUMPUR (June 6): Bursa Malaysia rebounde...

See BeautifulSoup documentation here, and for pandas docs go here.

huangapple
  • 本文由 发表于 2023年6月6日 11:54:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76411348.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定