Beautiful Soup从脚本中提取数据

huangapple go评论76阅读模式
英文:

beautiful soup extract data from script

问题

I can help you translate the code parts you provided into Chinese. Here are the translated code sections:

我获得这段代码需要从脚本标签中提取纬度和经度
   ```JavaScript
<script>
	var loadPoints = '/Map/Points'
	var mapDetails = {"point":{"latitude":-34.023418,"longitude":18.331407,"title":"Sandy Bay","location":null,"subject":"P","link":"/Explore/South-Africa/Western-Cape/Sandy-Bay"},"bounds":null,"moveMarkerCallback":null,"changeBoundsCallback":null}
	requireExploreMap(loadPoints, mapDetails)
</script>

我可以看到所有的HTML内容在soup中,但是当我尝试这种方式时:

def get_textchunk(word1, word2, text):
 if not (word1 in text and word2 in text): return ''
 return text.split(word1)[-1].split(word2)[0]

lat  = get_textchunk('latitude":', ',"longitude', soup.get_text(' '))

它没有返回任何内容。

我如何修复它?

更新

这是我的代码

with open('urls.txt', 'r' ,encoding="utf-8") as inf:
    with open('data2.csv' , 'w' ,encoding="utf-8") as outf:
        outf.write('Titre,add,art,club,tel,\n')

    for row in inf:
        url =  row.strip()
        response = requests.get(url)

        if response.ok:
            print ("ok&quot)

            soup = BeautifulSoup(response.text, 'html.parser')
            print (soup)
            stag = soup.find("script")

            obj = json.loads(re.search(r"mapDetails\s*= \s*({.*});", str(stag)).group(1))

            lat, lon = obj["point"]["latitude"], obj["point"]["longitude"]

            #Faire une pause    
            time.sleep(2)

问题是BS找到第一个脚本标签,所需的信息不在第一个标签中。

谢谢你的帮助

我尝试爬取的页面:
https://worldbeachlist.com/Explore/Australia/Victoria/Bells-Beach


这是你提供的代码的中文翻译部分,如有需要,请随时提出问题。

<details>
<summary>英文:</summary>

I get this code and I need to extract latitude and longitude from script tag. 
   ```JavaScript
&lt;script&gt;
	var loadPoints = &#39;/Map/Points&#39;;
	var mapDetails = {&quot;point&quot;:{&quot;latitude&quot;:-34.023418,&quot;longitude&quot;:18.331407,&quot;title&quot;:&quot;Sandy Bay&quot;,&quot;location&quot;:null,&quot;subject&quot;:&quot;P&quot;,&quot;link&quot;:&quot;/Explore/South-Africa/Western-Cape/Sandy-Bay&quot;},&quot;bounds&quot;:null,&quot;moveMarkerCallback&quot;:null,&quot;changeBoundsCallback&quot;:null};
	requireExploreMap(loadPoints, mapDetails);
&lt;/script&gt;

I can see all HTML content in soup but when I try this way:

def get_textchunk(word1, word2, text):
 if not (word1 in text and word2 in text): return &#39;&#39;
 return text.split(word1)[-1].split(word2)[0]

lat  = get_textchunk(&#39;latitude&quot;:&#39;, &#39;,&quot;longitude&#39;, soup.get_text(&#39; &#39;))

it doesn't return anything.

How can I fix it?

UPDATE

This is my code

with open(&#39;urls.txt&#39;, &#39;r&#39; ,encoding=&quot;utf-8&quot;) as inf:
    with open(&#39;data2.csv&#39; , &#39;w&#39; ,encoding=&quot;utf-8&quot;) as outf:
        outf.write(&#39;Titre,add,art,club,tel,\n&#39;)
    

    
    
    for row in inf:      
        url =  row.strip()
        response = requests.get(url)
  
     
        
        if response.ok:
            print (&quot;ok&quot;)
           
            
           
           
            soup = BeautifulSoup(response.text, &#39;html.parser&#39;)
            print (soup)
            stag = soup.find(&quot;script&quot;)

            obj = json.loads(re.search(r&quot;mapDetails\s*= \s*({.*});&quot;, str(stag)).group(1))

            lat, lon = obj[&quot;point&quot;][&quot;latitude&quot;], obj[&quot;point&quot;][&quot;longitude&quot;]                
            
     
           
            
        
            #Faire une pause    
            time.sleep(2) 

The problem is BS find the first script tag and the information needed are not in the first tag.
Thanks a lot for your help

The page i try to scrap :
https://worldbeachlist.com/Explore/Australia/Victoria/Bells-Beach

答案1

得分: 1

以下是代码部分的翻译:

import json
import re

from bs4 import BeautifulSoup

sample_script = &quot;&quot;&quot;
&lt;script&gt;
    var loadPoints = &#39;/Map/Points&#39;
    var mapDetails = {&quot;point&quot;:{&quot;latitude&quot;:-34.023418,&quot;longitude&quot;:18.331407,&quot;title&quot;:&quot;Sandy Bay&quot;,&quot;location&quot;:null,&quot;subject&quot;:&quot;P&quot;,&quot;link&quot;:&quot;/Explore/South-Africa/Western-Cape/Sandy-Bay&quot;},&quot;bounds&quot;:null,&quot;moveMarkerCallback&quot;:null,&quot;changeBoundsCallback&quot;:null};
    requireExploreMap(loadPoints, mapDetails);
&lt;/script&gt;
&quot;&quot;&quot;

soup = BeautifulSoup(sample_script, &#39;html.parser&#39;).find(&#39;script&#39;).string
data = json.loads(re.search(r&quot;mapDetails = (.+?);&quot;, soup).group(1))
print(json.dumps(data, indent=4))

# Access the keys
print(data[&#39;point&#39;][&#39;latitude&#39;])
print(data[&#39;point&#39;][&#39;longitude&#39;])

输出:

{
    &quot;point&quot;: {
        &quot;latitude&quot;: -34.023418,
        &quot;longitude&quot;: 18.331407,
        &quot;title&quot;: &quot;Sandy Bay&quot;,
        &quot;location&quot;: null,
        &quot;subject&quot;: &quot;P&quot;,
        &quot;link&quot;: &quot;/Explore/South-Africa/Western-Cape/Sandy-Bay&quot;
    },
    &quot;bounds&quot;: null,
    &quot;moveMarkerCallback&quot;: null,
    &quot;changeBoundsCallback&quot;: null
}
-34.023418
18.331407
英文:

Try this:

import json
import re

from bs4 import BeautifulSoup

sample_script = &quot;&quot;&quot;
&lt;script&gt;
                var loadPoints = &#39;/Map/Points&#39;;
                var mapDetails = {&quot;point&quot;:{&quot;latitude&quot;:-34.023418,&quot;longitude&quot;:18.331407,&quot;title&quot;:&quot;Sandy Bay&quot;,&quot;location&quot;:null,&quot;subject&quot;:&quot;P&quot;,&quot;link&quot;:&quot;/Explore/South-Africa/Western-Cape/Sandy-Bay&quot;},&quot;bounds&quot;:null,&quot;moveMarkerCallback&quot;:null,&quot;changeBoundsCallback&quot;:null};
                requireExploreMap(loadPoints, mapDetails);
            &lt;/script&gt;
&quot;&quot;&quot;

soup = BeautifulSoup(sample_script, &#39;html.parser&#39;).find(&#39;script&#39;).string
data = json.loads(re.search(r&quot;mapDetails = (.+?);&quot;, soup).group(1))
print(json.dumps(data, indent=4))

# Access the keys
print(data[&#39;point&#39;][&#39;latitude&#39;])
print(data[&#39;point&#39;][&#39;longitude&#39;])

Output:

{
    &quot;point&quot;: {
        &quot;latitude&quot;: -34.023418,
        &quot;longitude&quot;: 18.331407,
        &quot;title&quot;: &quot;Sandy Bay&quot;,
        &quot;location&quot;: null,
        &quot;subject&quot;: &quot;P&quot;,
        &quot;link&quot;: &quot;/Explore/South-Africa/Western-Cape/Sandy-Bay&quot;
    },
    &quot;bounds&quot;: null,
    &quot;moveMarkerCallback&quot;: null,
    &quot;changeBoundsCallback&quot;: null
}
-34.023418
18.331407

答案2

得分: 1

我会使用 loads/search

import re, json

stag = soup.find("script")

obj = json.loads(re.search(r"mapDetails\s*=\s*({.*});", str(stag)).group(1))

lat, lon = obj["point"]["latitude"], obj["point"]["longitude"]

Output :

print("Latitude:", lat) # Latitude: -34.023418
print("Longitude:", lon) # Longitude: 18.331407
英文:

I would use a loads/search :

import re, json

stag = soup.find(&quot;script&quot;)

obj = json.loads(re.search(r&quot;mapDetails\s*=\s*({.*});&quot;, str(stag)).group(1))

lat, lon = obj[&quot;point&quot;][&quot;latitude&quot;], obj[&quot;point&quot;][&quot;longitude&quot;]

Output :

print(&quot;Latitude:&quot;, lat) # Latitude: -34.023418
print(&quot;Longitude:&quot;, lon) # Longitude: 18.331407

huangapple
  • 本文由 发表于 2023年4月11日 15:35:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75983472.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定