如何使用Python爬取多个网页并将它们从英语翻译成印地语?

huangapple go评论97阅读模式
英文:

How to scrape multiple web-pages that will translate them from English to Hindi using python?

问题

I am struggling with the small issue, the code works and no errors. But I need to figure out how to translate multiple pages from the website. From English to Hindi and each pages has to be Hindi, so far I only translated one specific text from the main website.

  1. #Script scraps the website using request and beautifulSoup library
  2. from google_translate import browser
  3. from google_translate import selenium
  4. import requests
  5. from bs4 import BeautifulSoup
  6. URL = "https://www.classcentral.com/?"
  7. headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246"}
  8. # Here the user agent is for Edge browser on windows 10. You can find your browser user agent from the above given link.
  9. r = requests.get(url=URL, headers=headers)
  10. print(r.content)
  11. # Parsing the HTML
  12. soup = BeautifulSoup(r.content, 'html.parser')
  13. # find all the anchor tags with "href"
  14. for link in soup.find_all('a'):
  15. print(link.get('href'))
  1. #Script transalate text into Hindi using google translate API
  2. import time
  3. from selenium import webdriver
  4. from selenium.webdriver.common.by import By
  5. import selenium
  6. # Give Language code in which you want to translate the text:=>
  7. lang_code = 'hi '
  8. # Provide text that you want to translate:=>
  9. input1 = " Find your next course.Class Central aggregates courses from many providers to help you find the best courses on almost any subject, wherever they exist"
  10. # launch browser with selenium:=>
  11. browser = webdriver.Chrome() #browser = webdriver.Chrome('path of chromedriver.exe file') if the chromedriver.exe is in different folder
  12. # copy google Translator link here:=>
  13. browser.get("https://translate.google.co.in/?sl=auto&tl="+lang_code+"&text="+input1+"&op=translate")
  14. # just wait for some time for translating input text:=>
  15. time.sleep(6)
  16. # Given below x path contains the translated output that we are storing in output variable:=>
  17. output1 = browser.find_element(By.CLASS_NAME,'HwtZe').text
  18. # Display the output:=>
  19. print("Translated Paragraph=> " + output1)
英文:

I am struggling with the small issue, the code works and no errors. But I need to figure out how to translate multiple pages from the website. From English to Hindi and each pages has to be Hindi, so far I only translated one specific text from the main website.

  1. #Script scraps the website using request and beautifulSoup library
  2. from google_translate import browser
  3. from google_translate import selenium
  4. import requests
  5. from bs4 import BeautifulSoup
  6. URL = "https://www.classcentral.com/?"
  7. headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246"}
  8. # Here the user agent is for Edge browser on windows 10. You can find your browser user agent from the above given link.
  9. r = requests.get(url=URL, headers=headers)
  10. print(r.content)
  11. # Parsing the HTML
  12. soup = BeautifulSoup(r.content, 'html.parser')
  13. # find all the anchor tags with "href"
  14. for link in soup.find_all('a'):
  15. print(link.get('href'))
  1. #Script transalate text into Hindi using google translate API
  2. import time
  3. from selenium import webdriver
  4. from selenium.webdriver.common.by import By
  5. import selenium
  6. # Give Language code in which you want to translate the text:=>
  7. lang_code = 'hi '
  8. # Provide text that you want to translate:=>
  9. input1 = " Find your next course.Class Central aggregates courses from many providers to help you find the best courses on almost any subject, wherever they exist"
  10. # launch browser with selenium:=>
  11. browser = webdriver.Chrome() #browser = webdriver.Chrome('path of chromedriver.exe file') if the chromedriver.exe is in different folder
  12. # copy google Translator link here:=>
  13. browser.get("https://translate.google.co.in/?sl=auto&tl="+lang_code+"&text="+input1+"&op=translate")
  14. # just wait for some time for translating input text:=>
  15. time.sleep(6)
  16. # Given below x path contains the translated output that we are storing in output variable:=>
  17. output1 = browser.find_element(By.CLASS_NAME,'HwtZe').text
  18. # Display the output:=>
  19. print("Translated Paragraph:=> " + output1)

答案1

得分: 1

Google翻译存在一些限制。根据我的理解,您无法在单个请求中翻译所有字符。因此,我建议您将文本拆分成多个请求进行翻译。

在下面的代码中,我使用了googletrans模块,在从网站获取文本后将其翻译成印地语。作为替代方案,您可以尝试以下代码,希望对您有所帮助:

  1. import requests
  2. from bs4 import BeautifulSoup
  3. from googletrans import Translator
  4. translator = Translator()
  5. def scrape_web_page(url):
  6. response = requests.get(url)
  7. soup = BeautifulSoup(response.content, 'html.parser')
  8. text = soup.get_text()
  9. return text
  10. def language_translator(urls):
  11. count = 1
  12. for url in urls:
  13. new_text = ""
  14. new_text = str(f"从页面 {count}")
  15. print(f'.........................从页面 {count} ...........................................')
  16. text = scrape_web_page(url)
  17. k = text.split()
  18. for i in k:
  19. # print(i)
  20. translated_text = translator.translate(i, dest='hi')
  21. new_text = new_text + " " + str(translated_text.text)
  22. # print(translated_text.text)
  23. count = count + 1
  24. print(new_text)
  25. urls = [
  26. 'https://demo1/page1', 'https://demo1/page2'
  27. ]
  28. language_translator(urls)

注意:网站抓取涉及一些版权问题。

英文:

Google translate has some limitations. Based on my understanding you can’t translate all the characters in a single request. So I recommend you to translate the text in multiple requests.

In the below code, I am using the googletrans module and after fetching text from the website I am translating them into Hindi. As an alternative you can try below code,I hope this will helpful for you:

  1. import requests
  2. from bs4 import BeautifulSoup
  3. from googletrans import Translator
  4. translator = Translator()
  5. def scrape_web_page(url):
  6. response = requests.get(url)
  7. soup = BeautifulSoup(response.content, 'html.parser')
  8. text = soup.get_text()
  9. return text
  10. def language_translator(url):
  11. count=1
  12. for url in urls:
  13. newtest=""
  14. newtest=str(f"From page {count}")
  15. print(f'.........................from page {count} ...........................................')
  16. text = scrape_web_page(url)
  17. k=text.split()
  18. for i in k:
  19. #print(i)
  20. translated_text = translator.translate(i, dest='hi')
  21. newtest=newtest+" "+str(translated_text.text)
  22. #print(translated_text.text)
  23. count=count+1
  24. print(newtest)
  25. urls = [
  26. 'https://demo1/page1','https://demo1/page2'
  27. ]
  28. language_translator(urls)

NB: There are some copyright issues involved in website scraping.

huangapple
  • 本文由 发表于 2023年2月27日 11:56:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/75576659.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定