Beautiful Soup从脚本中提取数据

huangapple go评论105阅读模式
英文:

beautiful soup extract data from script

问题

I can help you translate the code parts you provided into Chinese. Here are the translated code sections:

  1. 我获得这段代码需要从脚本标签中提取纬度和经度
  2. ```JavaScript
  3. <script>
  4. var loadPoints = '/Map/Points'
  5. var mapDetails = {"point":{"latitude":-34.023418,"longitude":18.331407,"title":"Sandy Bay","location":null,"subject":"P","link":"/Explore/South-Africa/Western-Cape/Sandy-Bay"},"bounds":null,"moveMarkerCallback":null,"changeBoundsCallback":null}
  6. requireExploreMap(loadPoints, mapDetails)
  7. </script>

我可以看到所有的HTML内容在soup中,但是当我尝试这种方式时:

  1. def get_textchunk(word1, word2, text):
  2. if not (word1 in text and word2 in text): return ''
  3. return text.split(word1)[-1].split(word2)[0]
  4. lat = get_textchunk('latitude":', ',"longitude', soup.get_text(' '))

它没有返回任何内容。

我如何修复它?

更新

这是我的代码

  1. with open('urls.txt', 'r' ,encoding="utf-8") as inf:
  2. with open('data2.csv' , 'w' ,encoding="utf-8") as outf:
  3. outf.write('Titre,add,art,club,tel,\n')
  4. for row in inf:
  5. url = row.strip()
  6. response = requests.get(url)
  7. if response.ok:
  8. print ("ok&quot)
  9. soup = BeautifulSoup(response.text, 'html.parser')
  10. print (soup)
  11. stag = soup.find("script")
  12. obj = json.loads(re.search(r"mapDetails\s*= \s*({.*});", str(stag)).group(1))
  13. lat, lon = obj["point"]["latitude"], obj["point"]["longitude"]
  14. #Faire une pause
  15. time.sleep(2)

问题是BS找到第一个脚本标签,所需的信息不在第一个标签中。

谢谢你的帮助

我尝试爬取的页面:
https://worldbeachlist.com/Explore/Australia/Victoria/Bells-Beach

  1. 这是你提供的代码的中文翻译部分,如有需要,请随时提出问题。
  2. <details>
  3. <summary>英文:</summary>
  4. I get this code and I need to extract latitude and longitude from script tag.
  5. ```JavaScript
  6. &lt;script&gt;
  7. var loadPoints = &#39;/Map/Points&#39;;
  8. var mapDetails = {&quot;point&quot;:{&quot;latitude&quot;:-34.023418,&quot;longitude&quot;:18.331407,&quot;title&quot;:&quot;Sandy Bay&quot;,&quot;location&quot;:null,&quot;subject&quot;:&quot;P&quot;,&quot;link&quot;:&quot;/Explore/South-Africa/Western-Cape/Sandy-Bay&quot;},&quot;bounds&quot;:null,&quot;moveMarkerCallback&quot;:null,&quot;changeBoundsCallback&quot;:null};
  9. requireExploreMap(loadPoints, mapDetails);
  10. &lt;/script&gt;

I can see all HTML content in soup but when I try this way:

  1. def get_textchunk(word1, word2, text):
  2. if not (word1 in text and word2 in text): return &#39;&#39;
  3. return text.split(word1)[-1].split(word2)[0]
  4. lat = get_textchunk(&#39;latitude&quot;:&#39;, &#39;,&quot;longitude&#39;, soup.get_text(&#39; &#39;))

it doesn't return anything.

How can I fix it?

UPDATE

This is my code

  1. with open(&#39;urls.txt&#39;, &#39;r&#39; ,encoding=&quot;utf-8&quot;) as inf:
  2. with open(&#39;data2.csv&#39; , &#39;w&#39; ,encoding=&quot;utf-8&quot;) as outf:
  3. outf.write(&#39;Titre,add,art,club,tel,\n&#39;)
  4. for row in inf:
  5. url = row.strip()
  6. response = requests.get(url)
  7. if response.ok:
  8. print (&quot;ok&quot;)
  9. soup = BeautifulSoup(response.text, &#39;html.parser&#39;)
  10. print (soup)
  11. stag = soup.find(&quot;script&quot;)
  12. obj = json.loads(re.search(r&quot;mapDetails\s*= \s*({.*});&quot;, str(stag)).group(1))
  13. lat, lon = obj[&quot;point&quot;][&quot;latitude&quot;], obj[&quot;point&quot;][&quot;longitude&quot;]
  14. #Faire une pause
  15. time.sleep(2)

The problem is BS find the first script tag and the information needed are not in the first tag.
Thanks a lot for your help

The page i try to scrap :
https://worldbeachlist.com/Explore/Australia/Victoria/Bells-Beach

答案1

得分: 1

以下是代码部分的翻译:

  1. import json
  2. import re
  3. from bs4 import BeautifulSoup
  4. sample_script = &quot;&quot;&quot;
  5. &lt;script&gt;
  6. var loadPoints = &#39;/Map/Points&#39;
  7. var mapDetails = {&quot;point&quot;:{&quot;latitude&quot;:-34.023418,&quot;longitude&quot;:18.331407,&quot;title&quot;:&quot;Sandy Bay&quot;,&quot;location&quot;:null,&quot;subject&quot;:&quot;P&quot;,&quot;link&quot;:&quot;/Explore/South-Africa/Western-Cape/Sandy-Bay&quot;},&quot;bounds&quot;:null,&quot;moveMarkerCallback&quot;:null,&quot;changeBoundsCallback&quot;:null};
  8. requireExploreMap(loadPoints, mapDetails);
  9. &lt;/script&gt;
  10. &quot;&quot;&quot;
  11. soup = BeautifulSoup(sample_script, &#39;html.parser&#39;).find(&#39;script&#39;).string
  12. data = json.loads(re.search(r&quot;mapDetails = (.+?);&quot;, soup).group(1))
  13. print(json.dumps(data, indent=4))
  14. # Access the keys
  15. print(data[&#39;point&#39;][&#39;latitude&#39;])
  16. print(data[&#39;point&#39;][&#39;longitude&#39;])

输出:

  1. {
  2. &quot;point&quot;: {
  3. &quot;latitude&quot;: -34.023418,
  4. &quot;longitude&quot;: 18.331407,
  5. &quot;title&quot;: &quot;Sandy Bay&quot;,
  6. &quot;location&quot;: null,
  7. &quot;subject&quot;: &quot;P&quot;,
  8. &quot;link&quot;: &quot;/Explore/South-Africa/Western-Cape/Sandy-Bay&quot;
  9. },
  10. &quot;bounds&quot;: null,
  11. &quot;moveMarkerCallback&quot;: null,
  12. &quot;changeBoundsCallback&quot;: null
  13. }
  14. -34.023418
  15. 18.331407
英文:

Try this:

  1. import json
  2. import re
  3. from bs4 import BeautifulSoup
  4. sample_script = &quot;&quot;&quot;
  5. &lt;script&gt;
  6. var loadPoints = &#39;/Map/Points&#39;;
  7. var mapDetails = {&quot;point&quot;:{&quot;latitude&quot;:-34.023418,&quot;longitude&quot;:18.331407,&quot;title&quot;:&quot;Sandy Bay&quot;,&quot;location&quot;:null,&quot;subject&quot;:&quot;P&quot;,&quot;link&quot;:&quot;/Explore/South-Africa/Western-Cape/Sandy-Bay&quot;},&quot;bounds&quot;:null,&quot;moveMarkerCallback&quot;:null,&quot;changeBoundsCallback&quot;:null};
  8. requireExploreMap(loadPoints, mapDetails);
  9. &lt;/script&gt;
  10. &quot;&quot;&quot;
  11. soup = BeautifulSoup(sample_script, &#39;html.parser&#39;).find(&#39;script&#39;).string
  12. data = json.loads(re.search(r&quot;mapDetails = (.+?);&quot;, soup).group(1))
  13. print(json.dumps(data, indent=4))
  14. # Access the keys
  15. print(data[&#39;point&#39;][&#39;latitude&#39;])
  16. print(data[&#39;point&#39;][&#39;longitude&#39;])

Output:

  1. {
  2. &quot;point&quot;: {
  3. &quot;latitude&quot;: -34.023418,
  4. &quot;longitude&quot;: 18.331407,
  5. &quot;title&quot;: &quot;Sandy Bay&quot;,
  6. &quot;location&quot;: null,
  7. &quot;subject&quot;: &quot;P&quot;,
  8. &quot;link&quot;: &quot;/Explore/South-Africa/Western-Cape/Sandy-Bay&quot;
  9. },
  10. &quot;bounds&quot;: null,
  11. &quot;moveMarkerCallback&quot;: null,
  12. &quot;changeBoundsCallback&quot;: null
  13. }
  14. -34.023418
  15. 18.331407

答案2

得分: 1

我会使用 loads/search

  1. import re, json
  2. stag = soup.find("script")
  3. obj = json.loads(re.search(r"mapDetails\s*=\s*({.*});", str(stag)).group(1))
  4. lat, lon = obj["point"]["latitude"], obj["point"]["longitude"]
  5. Output :
  6. print("Latitude:", lat) # Latitude: -34.023418
  7. print("Longitude:", lon) # Longitude: 18.331407
英文:

I would use a loads/search :

  1. import re, json
  2. stag = soup.find(&quot;script&quot;)
  3. obj = json.loads(re.search(r&quot;mapDetails\s*=\s*({.*});&quot;, str(stag)).group(1))
  4. lat, lon = obj[&quot;point&quot;][&quot;latitude&quot;], obj[&quot;point&quot;][&quot;longitude&quot;]

Output :

  1. print(&quot;Latitude:&quot;, lat) # Latitude: -34.023418
  2. print(&quot;Longitude:&quot;, lon) # Longitude: 18.331407

huangapple
  • 本文由 发表于 2023年4月11日 15:35:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75983472.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定