Python 网络抓取空标签

huangapple go评论104阅读模式
英文:

Python Scraping empty tag

问题

我遇到了从页面中提取某些元素的问题:
https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i

代码:

  1. import requests
  2. from bs4 import BeautifulSoup
  3. URL = "https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i"
  4. page = requests.get(URL)
  5. soup = BeautifulSoup(page.content, 'html.parser')
  6. title = soup.find(class_="product_cart_title").text
  7. price = soup.find(class_="icon_main_block_price_a")
  8. number = soup.find(class_="product_cart_info").findAll('tr')[1].findAll('td')[1]
  9. description = soup.find(id="tab_a")
  10. print(description)

问题是当我想要获取tab_a时出现问题,

而在

  1. <div align="left" class="product_cart_info" id="charlong_id">
  2. </div>

中是空的。我该如何获取它?
我认为这可能与JavaScript有关。也许在页面加载时存在一些延迟?

英文:

I have a problem with scraping some element from a page:
https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i

code:

  1. import requests
  2. from bs4 import BeautifulSoup
  3. URL=&quot;https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i&quot;
  4. page = requests.get(URL)
  5. soup = BeautifulSoup(page.content, &#39;html.parser&#39;)
  6. title=soup.find(class_=&quot;product_cart_title&quot;).text
  7. price=soup.find(class_=&quot;icon_main_block_price_a&quot;)
  8. number=soup.find(class_=&quot;product_cart_info&quot;).findAll(&#39;tr&#39;)[1].findAll(&#39;td&#39;)[1]
  9. description=soup.find(id=&quot;tab_a&quot;)
  10. print(description)

Problem is when I want to get to: tab_a

And its a problem cause inside

  1. &lt;div align=&quot;left&quot; class=&quot;product_cart_info&quot; id=&quot;charlong_id&quot;&gt;
  2. &lt;/div&gt;

is empty. How I can get it?
I see its about js i think. Maybe there is some delay when the page loads?

答案1

得分: 2

如评论中所述,信息是通过JavaScript加载的,因此BeautifulSoup无法看到它。但是,如果您查看Chrome/Firefox网络选项卡,您可以看到页面发出请求的位置:

  1. import re
  2. import requests
  3. from bs4 import BeautifulSoup
  4. url = 'https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i'
  5. ajax_url = 'https://tuning-tec.com/_template/_show_normal/_show_charlong.php?itemId={}&#39;
  6. soup = BeautifulSoup(requests.get(url).content, 'html.parser')
  7. print(soup.select_one('.product_cart_title').get_text(strip=True))
  8. print(soup.select_one('.icon_main_block_price_a').get_text(strip=True))
  9. print(soup.select_one('td:contains("Symbol") ~ td').get_text(strip=True))
  10. item_id = re.findall(r"ajax_update_stat\('(\d+)'\)", soup.text)[0]
  11. soup2 = BeautifulSoup(requests.get(ajax_url.format(item_id)).content, 'html.parser')
  12. print()
  13. # just print some info:
  14. for tr in soup2.select('tr'):
  15. print(re.sub(r' {2,}', ' ', tr.select_one('td').get_text(strip=True, separator=' ')))

输出:

  1. MERCEDES W164 ML M-KLASA 05-07 BLACK LED SEQ
  2. 1788.62 PLN
  3. LPMED0
  4. PL
  5. Opis
  6. Lampy soczewkowe ze światłem pozycyjnym LED. Z dynamicznym kierunkowskazem. 100% nowe, w komplecie (lewa i prawa). Homologacja: norma E13 - dopuszczone do ruchu.
  7. Szczegóły
  8. Światła pozycyjne: DIODY Kierunkowskaz: DIODY Światła mijania: H9 w zestawie Światła drogowe: H1 w zestawie Regulacja: elektryczna (silniczek znajduje się w komplecie).
  9. LED TUBE LIGHT Dynamic Turn Signal >>
英文:

As stated in the comments, the info is loaded via JavaScript, so BeautifulSoup doesn't see it. But you if you look to Chrome/Firefox network tab, you can see where the page is making requests:

  1. import re
  2. import requests
  3. from bs4 import BeautifulSoup
  4. url = &#39;https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i&#39;
  5. ajax_url = &#39;https://tuning-tec.com/_template/_show_normal/_show_charlong.php?itemId={}&#39;
  6. soup = BeautifulSoup(requests.get(url).content, &#39;html.parser&#39;)
  7. print(soup.select_one(&#39;.product_cart_title&#39;).get_text(strip=True))
  8. print(soup.select_one(&#39;.icon_main_block_price_a&#39;).get_text(strip=True))
  9. print(soup.select_one(&#39;td:contains(&quot;Symbol&quot;) ~ td&#39;).get_text(strip=True))
  10. item_id = re.findall(r&quot;ajax_update_stat\(&#39;(\d+)&#39;\)&quot;, soup.text)[0]
  11. soup2 = BeautifulSoup(requests.get(ajax_url.format(item_id)).content, &#39;html.parser&#39;)
  12. print()
  13. # just print some info:
  14. for tr in soup2.select(&#39;tr&#39;):
  15. print(re.sub(r&#39; {2,}&#39;, &#39; &#39;, tr.select_one(&#39;td&#39;).get_text(strip=True, separator=&#39; &#39;)))

Prints:

  1. MERCEDES W164 ML M-KLASA 05-07 BLACK LED SEQ
  2. 1788.62 PLN
  3. LPMED0
  4. PL
  5. Opis
  6. Lampy
  7. soczewkowe ze światłem
  8. pozycyjnym LED. Z dynamicznym
  9. kierunkowskazem. 100% nowe, w komplecie
  10. (lewa i prawa). Homologacja: norma E13 -
  11. dopuszczone do ruchu.
  12. Szczeg&#243y
  13. Światła pozycyjne: DIODY Kierunkowskaz: DIODY Światła
  14. mijania: H9 w
  15. zestawie Światła
  16. drogowe: H1 w
  17. zestawie Regulacja: elektryczna (silniczek znajduje się w
  18. komplecie).
  19. LED TUBE LIGHT Dynamic Turn Signal &gt;&gt;

答案2

得分: 0

A little change in the description, I don't know if it's working, have a look at the following code:

  1. import re
  2. import requests
  3. from bs4 import BeautifulSoup
  4. url = 'https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i'
  5. ajax_url = 'https://tuning-tec.com/_template/_show_normal/_show_charlong.php?itemId={}'
  6. soup = BeautifulSoup(requests.get(url).content, 'html.parser')
  7. def unwrapElements(soup, elementsToFind):
  8. elements = soup.find_all(elementsToFind)
  9. for element in elements:
  10. element.unwrap()
  11. print(soup.select_one('.product_cart_title').get_text(strip=True))
  12. print(soup.select_one('.icon_main_block_price_a').get_text(strip=True))
  13. print(soup.select_one('td:contains("Symbol") ~ td').get_text(strip=True))
  14. item_id = re.findall(r"ajax_update_stat\('(\d+)'\)", soup.text)[0]
  15. soup2 = BeautifulSoup(requests.get(ajax_url.format(item_id)).content, 'html.parser')
  16. description=soup2.findAll('tr')[2].findAll('td')[1]
  17. description.append(soup2.findAll('tr')[4].findAll('td')[1])
  18. unwrapElements(description, "td")
  19. unwrapElements(description, "font")
  20. unwrapElements(description, "span")
  21. print(description)
  22. I need just these elements of description in the English language. It will be OK?
  23. And anyway, thanks for the help!!
  24. Only one thing, I don't know why it didn't remove all <td>.
  25. <details>
  26. <summary>英文:</summary>
  27. A little change in the description, I don&#39;t know if it&#39;s working, have a look on the following code:
  28. import re
  29. import requests
  30. from bs4 import BeautifulSoup
  31. url = &#39;https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i&#39;
  32. ajax_url = &#39;https://tuning-tec.com/_template/_show_normal/_show_charlong.php?itemId={}&#39;
  33. soup = BeautifulSoup(requests.get(url).content, &#39;html.parser&#39;)
  34. def unwrapElements(soup, elementsToFind):
  35. elements = soup.find_all(elementsToFind)
  36. for element in elements:
  37. element.unwrap()
  38. print(soup.select_one(&#39;.product_cart_title&#39;).get_text(strip=True))
  39. print(soup.select_one(&#39;.icon_main_block_price_a&#39;).get_text(strip=True))
  40. print(soup.select_one(&#39;td:contains(&quot;Symbol&quot;) ~ td&#39;).get_text(strip=True))
  41. item_id = re.findall(r&quot;ajax_update_stat\(&#39;(\d+)&#39;\)&quot;, soup.text)[0]
  42. soup2 = BeautifulSoup(requests.get(ajax_url.format(item_id)).content, &#39;html.parser&#39;)
  43. description=soup2.findAll(&#39;tr&#39;)[2].findAll(&#39;td&#39;)[1]
  44. description.append(soup2.findAll(&#39;tr&#39;)[4].findAll(&#39;td&#39;)[1])
  45. unwrapElements(description, &quot;td&quot;)
  46. unwrapElements(description, &quot;font&quot;)
  47. unwrapElements(description, &quot;span&quot;)
  48. print(description)
  49. I need just these elements of description in English language. It will be OK?
  50. And anyway thanks for help !!
  51. Only one thing i don&#39;t know why he didn&#39;t remove all &lt;td&gt;
  52. </details>

huangapple
  • 本文由 发表于 2020年1月3日 15:40:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/59574852.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定