英文:
Find all div, scrape from span
问题
你的脚本中有一些 HTML 实体编码(HTML entity encoding),需要先解码成正常的 HTML 标记才能正常解析。以下是解决方法:
import requests
from bs4 import BeautifulSoup
import html
url = "https://theedgemalaysia.com/categories/malaysia"
# 发送 GET 请求获取网页内容
response = requests.get(url)
# 创建一个 BeautifulSoup 对象来解析 HTML 内容
soup = BeautifulSoup(response.text, 'html.parser')
# 找到所有 class 为 "NewsList_newsListContent__4UpiN" 的 <div> 元素
container_divs = soup.find_all('div', class_='NewsList_newsListContent__4UpiN')
# 迭代遍历容器 divs
for container_div in container_divs:
# 找到容器内 class 为 "NewsList_newsListText__hstO7" 的 <div> 元素
news_text_divs = container_div.find_all('div', class_='NewsList_newsListText__hstO7')
# 迭代遍历新闻文本 divs
for news_text_div in news_text_divs:
# 找到 class 为 "NewsList_newsListItemHead__dg7eK" 的 <span> 元素
headline_span = news_text_div.find('span', class_='NewsList_newsListItemHead__dg7eK')
# 打印标题的文本
if headline_span:
# 解码 HTML 实体编码并打印
decoded_text = html.unescape(headline_span.text)
print(decoded_text)
这样修改后,你应该能够正确地获取到标题文本了。希望这有所帮助!
英文:
<div class="NewsList_newsListContent__4UpiN">
<div>
<div>
<div class="NewsList_newsListItemWrap__XovMP">
<div style="display: flex;">
<div class="NewsList_newsListItem__yRAbe">
<a href="/flash-categories/Currency">
<div class="NewsList_newsListTag__TGHJ_">
<span>Currency</span>
</div></a></div></div>
<div class="NewsList_newsListContent__4UpiN">
<div class="NewsList_infoNewsListSubMobile__SPmAG">
<span>06 Jun 2023, 10:05 am </span>
</div>
<div class="NewsList_newsListText__hstO7">
<a href="/node/669947">
# <span class="NewsList_newsListItemHead__dg7eK"**>Ringgit lower against US dollar in early session on June 6**</span>
</a>
<a href="/node/669947">
<span class="NewsList_newsList__2fXyv">KUALA LUMPUR (June 6): The ringgit opened lower against the US dollar in the early session on Tuesday (June 6), as investors remain cautious on the global outlook despite a slightly weaker greenback, an analyst said.&nbsp;At 9am, the local note fell to 4.5950/6000 versus the greenback, compared with Friday (June 2)’s closing of&nbsp;4.5745/5785. </span>
</a>
</div>
For example: I want to scrape the above BOLD word :Ringgit lower against US dollar in early session on June 6
This is my script:
import requests
from bs4 import BeautifulSoup
url = "https://theedgemalaysia.com/categories/malaysia"
# Send a GET request to the URL
response = requests.get(url)
# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all <div> elements with class "NewsList_newsListContent__4UpiN"
container_divs = soup.find_all('div', class_='NewsList_newsListContent__4UpiN')
# Iterate over the container divs
for container_div in container_divs:
# Find all <div> elements with class "NewsList_newsListText__hstO7" within the container
news_text_divs = container_div.find_all('div', class_='NewsList_newsListText__hstO7')
# Iterate over the news text divs
for news_text_div in news_text_divs:
# Find the <span> element with class "NewsList_newsListItemHead__dg7eK" within the news text div
headline_span = news_text_div.find('span', class_='NewsList_newsListItemHead__dg7eK')
# Print the text of the headline
if headline_span:
print(headline_span.text)
I have tried out the script above and cannot find over the bugs, anyone here can have a look and let me know where is the problem please? Appreciate it a lot !
答案1
得分: 2
以下是您要翻译的代码部分:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import json
headers= {
'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'
}
url = 'https://theedgemalaysia.com/categories/malaysia'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
data_script = soup.select_one('script[id="__NEXT_DATA__"]')
data = json.loads(data_script.string)
df = pd.json_normalize(data['props']['pageProps']['corporateData'])
print(df)
请注意,我已经为您提供了代码的中文翻译,不包括其他内容。
英文:
That page is being formed by JS based on some existent information in a script
tag. Requests cannot execute Javascript, so it won't see those titles as you see them when visiting the page in a JS enabled browser.
Here is one way to get those titles:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import json
headers= {
'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'
}
url = 'https://theedgemalaysia.com/categories/malaysia'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
data_script = soup.select_one('script[id="__NEXT_DATA__"]')
data = json.loads(data_script.string)
df = pd.json_normalize(data['props']['pageProps']['corporateData'])
print(df)
Result in terminal:
nid type language category options flash tags edited title created updated author source audio audioflag alias video_url img caption summary
0 669998 article english Corporate,Malaysia Top Stories Noon Market Bursa stays in the red at midday 1686027040000 1686027040000 Bernama Bernama 0 node/669998 https://assets.theedgemarkets.com/noon-market-... Bursa Malaysia stayed in the red at midday due...
1 669997 article english Corporate,Malaysia Skyworld eyes 3Q Main Market listing, inks und... 1686026908000 1686026908000 Lam Jian Wyn theedgemalaysia.com 0 node/669997 https://assets.theedgemarkets.com/SkyWorld-Dev... KUALA LUMPUR (June 6): SkyWorld Development Bh...
2 669995 article english Malaysia Top Stories,Politics & Government Parliament Investigation into Kedah MB over comments Pena... 1686026473000 1686026473000 Hailey Chung & Chester Tay theedgemalaysia.com 0 node/669995 https://assets.theedgemarkets.com/Kedah Sanusi... Kedah Menteri Besar Datuk Seri Muhammad Sanusi... Police have started an investigation into Keda...
3 669984 article english Malaysia,Economy Top Stories,Politics & Government Parliament mynewstv Anwar defends BNM’s gradual approach to moneta... 1686025226000 1686025226000 Hailey Chung & Chester Tay theedgemalaysia.com 0 node/669984 https://assets.theedgemarkets.com/Anwar 060620... Higher borrowing costs and the sharp depreciat...
4 669980 article english Malaysia,World,Economy Top Stories,Politics & Government ESG Global carbon markets face upheaval as nations... 1686024746000 1686024746000 Natasha White & Ewa Krukowska Bloomberg 0 node/669980 https://assets.theedgemarkets.com/398972891-fo... LONDON/BRUSSELS (June 6): The US$2 billion mar...
5 669961 article english Corporate,Malaysia Isabelle Francis CGS-CIMB starts coverage of Dayang Enterprise ... 1686022324000 1686022324000 Anis Hazim theedgemalaysia.com 0 node/669961 https://assets.theedgemarkets.com/Dayang-Enter... CGS-CIMB has initiated coverage of Dayang Ente...
6 669957 article english Malaysia Politics & Government Kit Siang expresses gratitude to Agong for 'Ta... 1686021406000 1686021406000 Bernama Bernama 0 node/669957 https://assets.theedgemarkets.com/Lim-Kit-sian... Veteran politician Tan Sri Lim Kit Siang expre...
7 669956 article english Corporate,Malaysia mynewstv 1Q results came broadly below expectations, sa... 1686020951000 1686020951000 Isabelle Francis theedgemalaysia.com 0 node/669956 https://assets.theedgemarkets.com/Bursa-Malays... KUALA LUMPUR (June 6): Analysts said the first...
8 669954 article english Corporate,Management,Malaysia Top Stories ESG mynewstv 24 public-listed companies still have no women... 1686020019000 1686020019000 Tan Zhai Yun theedgemalaysia.com 0 node/669954 https://assets.theedgemarkets.com/Bursa-4_2023... KUALA LUMPUR (June 6): As at June 1, 2023, 24 ...
9 669953 article english Corporate,Malaysia Hot Stock mynewstv Lam Jian Wyn Bumi Armada shares fall 20.47% on Kraken FPSO ... 1686019041000 1686019041000 Anis Hazim theedgemalaysia.com 0 node/669953 https://assets.theedgemarkets.com/Bumi-Armada-... KUALA LUMPUR (June 6): Shares of Bumi Armada B...
10 669951 article english Corporate,Malaysia,World Global Markets Asian stocks wobble as traders weigh Fed rate ... 1686018738000 1686018738000 Ankur Banerjee Reuters 0 node/669951 https://assets.theedgemarkets.com/395135636-As... SINGAPORE (June 6): Asian stock markets edged ...
11 669948 article english Malaysia,Court Politics & Government Lam Jian Wyn High Court dismisses Zuraida’s leave applicati... 1686017713000 1686017713000 Tarani Palani theedgemalaysia.com 0 node/669948 https://assets.theedgemarkets.com/Zuraida-Kama... Ampang member of Parliament Datuk Zuraida Kama... KUALA LUMPUR (June 6): The High Court has dism...
12 669947 article english Malaysia Top Stories Currency Ringgit lower against US dollar in early sessi... 1686017126000 1686017126000 Bernama Bernama 0 node/669947 https://assets.theedgemarkets.com/Ringgit-5_20... KUALA LUMPUR (June 6): The ringgit opened lowe...
13 669945 article english Corporate,Malaysia Market Open Bursa Malaysia marginally higher in early sess... 1686016694000 1686016694000 Bernama Bernama 0 node/669945 https://assets.theedgemarkets.com/opening-mark... KUALA LUMPUR (June 6): Bursa Malaysia rebounde...
See BeautifulSoup documentation here, and for pandas docs go here.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论