Find all div, scrape from span.

huangapple go评论116阅读模式
英文:

Find all div, scrape from span

问题

你的脚本中有一些 HTML 实体编码(HTML entity encoding),需要先解码成正常的 HTML 标记才能正常解析。以下是解决方法:

  1. import requests
  2. from bs4 import BeautifulSoup
  3. import html
  4. url = "https://theedgemalaysia.com/categories/malaysia"
  5. # 发送 GET 请求获取网页内容
  6. response = requests.get(url)
  7. # 创建一个 BeautifulSoup 对象来解析 HTML 内容
  8. soup = BeautifulSoup(response.text, 'html.parser')
  9. # 找到所有 class 为 "NewsList_newsListContent__4UpiN" 的 <div> 元素
  10. container_divs = soup.find_all('div', class_='NewsList_newsListContent__4UpiN')
  11. # 迭代遍历容器 divs
  12. for container_div in container_divs:
  13. # 找到容器内 class 为 "NewsList_newsListText__hstO7" 的 <div> 元素
  14. news_text_divs = container_div.find_all('div', class_='NewsList_newsListText__hstO7')
  15. # 迭代遍历新闻文本 divs
  16. for news_text_div in news_text_divs:
  17. # 找到 class 为 "NewsList_newsListItemHead__dg7eK" 的 <span> 元素
  18. headline_span = news_text_div.find('span', class_='NewsList_newsListItemHead__dg7eK')
  19. # 打印标题的文本
  20. if headline_span:
  21. # 解码 HTML 实体编码并打印
  22. decoded_text = html.unescape(headline_span.text)
  23. print(decoded_text)

这样修改后,你应该能够正确地获取到标题文本了。希望这有所帮助!

英文:
  1. &lt;div class=&quot;NewsList_newsListContent__4UpiN&quot;&gt;
  2. &lt;div&gt;
  3. &lt;div&gt;
  4. &lt;div class=&quot;NewsList_newsListItemWrap__XovMP&quot;&gt;
  5. &lt;div style=&quot;display: flex;&quot;&gt;
  6. &lt;div class=&quot;NewsList_newsListItem__yRAbe&quot;&gt;
  7. &lt;a href=&quot;/flash-categories/Currency&quot;&gt;
  8. &lt;div class=&quot;NewsList_newsListTag__TGHJ_&quot;&gt;
  9. &lt;span&gt;Currency&lt;/span&gt;
  10. &lt;/div&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;
  11. &lt;div class=&quot;NewsList_newsListContent__4UpiN&quot;&gt;
  12. &lt;div class=&quot;NewsList_infoNewsListSubMobile__SPmAG&quot;&gt;
  13. &lt;span&gt;06 Jun 2023, 10:05 am &lt;/span&gt;
  14. &lt;/div&gt;
  15. &lt;div class=&quot;NewsList_newsListText__hstO7&quot;&gt;
  16. &lt;a href=&quot;/node/669947&quot;&gt;
  17. # &lt;span class=&quot;NewsList_newsListItemHead__dg7eK&quot;**&gt;Ringgit lower against US dollar in early session on June 6**&lt;/span&gt;
  18. &lt;/a&gt;
  19. &lt;a href=&quot;/node/669947&quot;&gt;
  20. &lt;span class=&quot;NewsList_newsList__2fXyv&quot;&gt;KUALA LUMPUR (June 6): The ringgit opened lower against the US dollar in the early session on Tuesday (June 6), as investors remain cautious on the global outlook despite a slightly weaker greenback, an analyst said.&amp;nbsp;At 9am, the local note fell to 4.5950/6000 versus the greenback, compared with Friday (June 2)’s closing of&amp;nbsp;4.5745/5785. &lt;/span&gt;
  21. &lt;/a&gt;
  22. &lt;/div&gt;

For example: I want to scrape the above BOLD word :Ringgit lower against US dollar in early session on June 6

This is my script:

  1. import requests
  2. from bs4 import BeautifulSoup
  3. url = &quot;https://theedgemalaysia.com/categories/malaysia&quot;
  4. # Send a GET request to the URL
  5. response = requests.get(url)
  6. # Create a BeautifulSoup object to parse the HTML content
  7. soup = BeautifulSoup(response.text, &#39;html.parser&#39;)
  8. # Find all &lt;div&gt; elements with class &quot;NewsList_newsListContent__4UpiN&quot;
  9. container_divs = soup.find_all(&#39;div&#39;, class_=&#39;NewsList_newsListContent__4UpiN&#39;)
  10. # Iterate over the container divs
  11. for container_div in container_divs:
  12. # Find all &lt;div&gt; elements with class &quot;NewsList_newsListText__hstO7&quot; within the container
  13. news_text_divs = container_div.find_all(&#39;div&#39;, class_=&#39;NewsList_newsListText__hstO7&#39;)
  14. # Iterate over the news text divs
  15. for news_text_div in news_text_divs:
  16. # Find the &lt;span&gt; element with class &quot;NewsList_newsListItemHead__dg7eK&quot; within the news text div
  17. headline_span = news_text_div.find(&#39;span&#39;, class_=&#39;NewsList_newsListItemHead__dg7eK&#39;)
  18. # Print the text of the headline
  19. if headline_span:
  20. print(headline_span.text)

I have tried out the script above and cannot find over the bugs, anyone here can have a look and let me know where is the problem please? Appreciate it a lot !

答案1

得分: 2

以下是您要翻译的代码部分:

  1. from bs4 import BeautifulSoup as bs
  2. import requests
  3. import pandas as pd
  4. import json
  5. headers= {
  6. 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'
  7. }
  8. url = 'https://theedgemalaysia.com/categories/malaysia'
  9. r = requests.get(url, headers=headers)
  10. soup = bs(r.text, 'html.parser')
  11. data_script = soup.select_one('script[id="__NEXT_DATA__"]')
  12. data = json.loads(data_script.string)
  13. df = pd.json_normalize(data['props']['pageProps']['corporateData'])
  14. print(df)

请注意,我已经为您提供了代码的中文翻译,不包括其他内容。

英文:

That page is being formed by JS based on some existent information in a script tag. Requests cannot execute Javascript, so it won't see those titles as you see them when visiting the page in a JS enabled browser.

Here is one way to get those titles:

  1. from bs4 import BeautifulSoup as bs
  2. import requests
  3. import pandas as pd
  4. import json
  5. headers= {
  6. &#39;User-Agent&#39;:&#39;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36&#39;
  7. }
  8. url = &#39;https://theedgemalaysia.com/categories/malaysia&#39;
  9. r = requests.get(url, headers=headers)
  10. soup = bs(r.text, &#39;html.parser&#39;)
  11. data_script = soup.select_one(&#39;script[id=&quot;__NEXT_DATA__&quot;]&#39;)
  12. data = json.loads(data_script.string)
  13. df = pd.json_normalize(data[&#39;props&#39;][&#39;pageProps&#39;][&#39;corporateData&#39;])
  14. print(df)

Result in terminal:

  1. nid type language category options flash tags edited title created updated author source audio audioflag alias video_url img caption summary
  2. 0 669998 article english Corporate,Malaysia Top Stories Noon Market Bursa stays in the red at midday 1686027040000 1686027040000 Bernama Bernama 0 node/669998 https://assets.theedgemarkets.com/noon-market-... Bursa Malaysia stayed in the red at midday due...
  3. 1 669997 article english Corporate,Malaysia Skyworld eyes 3Q Main Market listing, inks und... 1686026908000 1686026908000 Lam Jian Wyn theedgemalaysia.com 0 node/669997 https://assets.theedgemarkets.com/SkyWorld-Dev... KUALA LUMPUR (June 6): SkyWorld Development Bh...
  4. 2 669995 article english Malaysia Top Stories,Politics &amp; Government Parliament Investigation into Kedah MB over comments Pena... 1686026473000 1686026473000 Hailey Chung &amp; Chester Tay theedgemalaysia.com 0 node/669995 https://assets.theedgemarkets.com/Kedah Sanusi... Kedah Menteri Besar Datuk Seri Muhammad Sanusi... Police have started an investigation into Keda...
  5. 3 669984 article english Malaysia,Economy Top Stories,Politics &amp; Government Parliament mynewstv Anwar defends BNMs gradual approach to moneta... 1686025226000 1686025226000 Hailey Chung &amp; Chester Tay theedgemalaysia.com 0 node/669984 https://assets.theedgemarkets.com/Anwar 060620... Higher borrowing costs and the sharp depreciat...
  6. 4 669980 article english Malaysia,World,Economy Top Stories,Politics &amp; Government ESG Global carbon markets face upheaval as nations... 1686024746000 1686024746000 Natasha White &amp; Ewa Krukowska Bloomberg 0 node/669980 https://assets.theedgemarkets.com/398972891-fo... LONDON/BRUSSELS (June 6): The US$2 billion mar...
  7. 5 669961 article english Corporate,Malaysia Isabelle Francis CGS-CIMB starts coverage of Dayang Enterprise ... 1686022324000 1686022324000 Anis Hazim theedgemalaysia.com 0 node/669961 https://assets.theedgemarkets.com/Dayang-Enter... CGS-CIMB has initiated coverage of Dayang Ente...
  8. 6 669957 article english Malaysia Politics &amp; Government Kit Siang expresses gratitude to Agong for &#39;Ta... 1686021406000 1686021406000 Bernama Bernama 0 node/669957 https://assets.theedgemarkets.com/Lim-Kit-sian... Veteran politician Tan Sri Lim Kit Siang expre...
  9. 7 669956 article english Corporate,Malaysia mynewstv 1Q results came broadly below expectations, sa... 1686020951000 1686020951000 Isabelle Francis theedgemalaysia.com 0 node/669956 https://assets.theedgemarkets.com/Bursa-Malays... KUALA LUMPUR (June 6): Analysts said the first...
  10. 8 669954 article english Corporate,Management,Malaysia Top Stories ESG mynewstv 24 public-listed companies still have no women... 1686020019000 1686020019000 Tan Zhai Yun theedgemalaysia.com 0 node/669954 https://assets.theedgemarkets.com/Bursa-4_2023... KUALA LUMPUR (June 6): As at June 1, 2023, 24 ...
  11. 9 669953 article english Corporate,Malaysia Hot Stock mynewstv Lam Jian Wyn Bumi Armada shares fall 20.47% on Kraken FPSO ... 1686019041000 1686019041000 Anis Hazim theedgemalaysia.com 0 node/669953 https://assets.theedgemarkets.com/Bumi-Armada-... KUALA LUMPUR (June 6): Shares of Bumi Armada B...
  12. 10 669951 article english Corporate,Malaysia,World Global Markets Asian stocks wobble as traders weigh Fed rate ... 1686018738000 1686018738000 Ankur Banerjee Reuters 0 node/669951 https://assets.theedgemarkets.com/395135636-As... SINGAPORE (June 6): Asian stock markets edged ...
  13. 11 669948 article english Malaysia,Court Politics &amp; Government Lam Jian Wyn High Court dismisses Zuraidas leave applicati... 1686017713000 1686017713000 Tarani Palani theedgemalaysia.com 0 node/669948 https://assets.theedgemarkets.com/Zuraida-Kama... Ampang member of Parliament Datuk Zuraida Kama... KUALA LUMPUR (June 6): The High Court has dism...
  14. 12 669947 article english Malaysia Top Stories Currency Ringgit lower against US dollar in early sessi... 1686017126000 1686017126000 Bernama Bernama 0 node/669947 https://assets.theedgemarkets.com/Ringgit-5_20... KUALA LUMPUR (June 6): The ringgit opened lowe...
  15. 13 669945 article english Corporate,Malaysia Market Open Bursa Malaysia marginally higher in early sess... 1686016694000 1686016694000 Bernama Bernama 0 node/669945 https://assets.theedgemarkets.com/opening-mark... KUALA LUMPUR (June 6): Bursa Malaysia rebounde...

See BeautifulSoup documentation here, and for pandas docs go here.

huangapple
  • 本文由 发表于 2023年6月6日 11:54:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76411348.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定