WebScrapping自动每天运行2次 – Python (selenium)

huangapple go评论99阅读模式
英文:

WebScrapping running automatically 2 times a day - Python (selenium)

问题

我已创建一个简单的网页抓取器,它从CNN网站获取一些信息并将其放入数据库表中。它在Python中正常工作,我在VScode中使用它。

我正在寻找一种自动运行此脚本两次每天的方法,有人知道如何做吗?我尝试过AWS,但无法成功!

我想要在线自动运行代码,即使我的计算机关闭,它也必须更新我的CSV文件。

一些重要的信息:

  • 考虑到这是一个网页抓取器,我有一些文件必须放在我的文件夹中,例如chromedriver.exe和一个CSV文件,用于追加新的信息。
  • 这是我的代码:

导入:

  1. import pandas as pd
  2. from datetime import datetime
  3. import requests
  4. import json
  5. from pandas_datareader import data as web
  6. import yfinance as yf
  7. from selenium import webdriver
  8. from selenium.webdriver.common.by import By
  9. from time import sleep
  10. from datetime import date, timedelta
  11. from selenium.webdriver.chrome.options import Options
  12. import pyodbc

网页抓取代码:

  1. dataset.to_csv(r"C:\Users\belig\OneDrive\Python\MeuProjeto\Projetos\WebScrapping_News\WebScrapping_News\dataset.csv", index=False)
  2. dataset = pd.read_csv(r"C:\Users\belig\OneDrive\Python\MeuProjeto\Projetos\WebScrapping_News\WebScrapping_News\dataset_news.csv", sep=";")
  3. # 创建变量
  4. Date = 1
  5. WeekDay = 2
  6. Brazil_Ibovespa = 3
  7. BRL_Dollar = 4
  8. Titulo_CNNBrasil = 5
  9. # 设置日期变量
  10. Date = datetime.now().strftime("%d/%m/%Y, %H:%M:%S")
  11. Date
  12. # 设置星期几变量
  13. date_now = datetime.now()
  14. WeekDay = date_now.strftime("%A")
  15. WeekDay
  16. # 设置Brazil_Ibovespa变量
  17. today = date.today()
  18. start_day = today - timedelta(days=7)
  19. tickers_DowJones = "^BVSP"
  20. datayf = yf.download(tickers_DowJones, start=start_day, end=today)
  21. print(datayf)
  22. datayf = datayf['Adj Close']
  23. Brazil_Ibovespa = datayf[-1]
  24. Brazil_Ibovespa
  25. # 设置BRL_Dollar变量
  26. requisicao = requests.get('https://economia.awesomeapi.com.br/all/USD-BRL')
  27. cotacao = requisicao.json()
  28. BRL_Dollar = round(float(cotacao['USD']['bid']), 2)
  29. BRL_Dollar
  30. # 启动Web抓取驱动程序(隐藏窗口选项)
  31. driver_exe = r'C:\Users\belig\OneDrive\Python\MeuProjeto\Projetos\WebScrapping_News\WebScrapping_News\chromedriver.exe'
  32. options = Options()
  33. options.add_argument("--headless")
  34. driver = webdriver.Chrome(driver_exe, options=options)
  35. # 设置Titulo_CNNBrasil变量
  36. driver.get('https://www.cnnbrasil.com.br/')
  37. Titulo_CNNBrasil = driver.find_element(By.XPATH, '//*[@id="block1847327"]/div/div/a/h2').text
  38. print(Titulo_CNNBrasil)
  39. # 设置Url_CNNBrasil变量
  40. driver.find_element(By.XPATH, '//*[@id="block1847327"]/div/div/a/h2').click()
  41. Url_CNNBrasil = driver.current_url
  42. print(Url_CNNBrasil)
  43. # 设置Topics_CNNBrasil变量
  44. try:
  45. Topics_CNNBrasil = driver.find_element(By.CLASS_NAME, 'tags__list').text
  46. Topics_CNNBrasil = Topics_CNNBrasil.replace('\n', ', ')
  47. print(Topics_CNNBrasil)
  48. except:
  49. Topics_CNNBrasil = 'None'
  50. print

添加到SQL和DataFrame:

  1. # 添加行到DataFrame
  2. new_row = pd.DataFrame({"Date": [Date], "WeekDay": [WeekDay], "Brazil_Ibovespa": [Brazil_Ibovespa], "BRL_Dollar": [BRL_Dollar], "Titulo_CNNBrasil": [Titulo_CNNBrasil], "Url_CNNBrasil": [Url_CNNBrasil], "Topics_CNNBrasil": [Topics_CNNBrasil]}, index=[0])
  3. print(new_row)
  4. dataset = pd.concat([dataset, new_row], ignore_index=True)
  5. dataset.to_csv(r'C:\Users\belig\OneDrive\Python\MeuProjeto\Projetos\WebScrapping_News\WebScrapping_News\dataset_news.csv', index=False, encoding="utf-8-sig", sep=";")
  6. # 添加信息到SQL Server
  7. dados_conexao = (
  8. "Driver={SQL Server};"
  9. "Server=Beligolli;"
  10. "Database=WebScrappingNews;"
  11. 'Trusted_Connection=yes;'
  12. )
  13. conexao = pyodbc.connect(dados_conexao)
  14. cursor = conexao.cursor()
  15. comando = "INSERT INTO NewsDataBase (Date_Hour, WeekDay_, Brazil_Ibovespa, BRL_Dollar, Titulo_CNNBrasil, Url_CNNBrasil, Topics_CNNBrasil) VALUES (?, ?, ?, ?, ?, ?, ?)"
  16. valores = (Date, WeekDay, Brazil_Ibovespa, BRL_Dollar, Titulo_CNNBrasil, Url_CNNBrasil, Topics_CNNBrasil)
  17. cursor.execute(comando, valores)
  18. cursor.commit()
  19. cursor.close()
  20. conexao.close()
  21. print(f'Adicionado {Date} - {WeekDay} ao dataset')
英文:

I've created a simple webScrapper that gets some information from CNN website and puts it into a database table.
Its working properly in Python and I'm using VScode.
I am looking for a way to run this script 2 times a day automatically, anyone knows how to do it? I tried AWS but I was not able to do it!

I want to run the code automatically online, with my computer off and it has to update my CSV file.

Some important information:

  • Considering that it is a webScrapper I have some files that I have to use in my folders such as chromedriver.exe and a CSV that append the new roll with new information.
  • Here is my code:

imports:

  1. import pandas as pd
  2. from datetime import datetime
  3. import requests
  4. import json
  5. from pandas_datareader import data as web
  6. import yfinance as yf
  7. from selenium import webdriver
  8. from selenium.webdriver.common.by import By
  9. from time import sleep
  10. from datetime import date, timedelta
  11. from selenium.webdriver.chrome.options import Options
  12. import pyodbc

WebScrapping code:

  1. dataset.to_csv(r"C:\Users\belig\OneDrive\Python\MeuProjeto\Projetos\WebScrapping_News\WebScrapping_News\dataset.csv", index=False)
  2. dataset = pd.read_csv(r"C:\Users\belig\OneDrive\Python\MeuProjeto\Projetos\WebScrapping_News\WebScrapping_News\dataset_news.csv", sep=";")
  3. # Creating Variables
  4. # %%
  5. Date = 1
  6. WeekDay = 2
  7. Brazil_Ibovespa = 3
  8. BRL_Dollar = 4
  9. Titulo_CNNBrasil = 5
  10. # Setup Date Var
  11. Date = datetime.now().strftime("%d/%m/%Y, %H:%M:%S")
  12. Date
  13. # Setup WeekDay Var
  14. date_now = datetime.now()
  15. WeekDay = date_now.strftime("%A")
  16. WeekDay
  17. # Setup Brazil_Ibovespa Var
  18. today = date.today()
  19. start_day = today - timedelta(days = 7)
  20. tickers_DowJones = "^BVSP"
  21. datayf = yf.download(tickers_DowJones, start=start_day, end=today)
  22. print(datayf)
  23. datayf = datayf['Adj Close']
  24. Brazil_Ibovespa = datayf[-1]
  25. Brazil_Ibovespa
  26. # Setup BRL_Dollar Var
  27. requisicao = requests.get('https://economia.awesomeapi.com.br/all/USD-BRL')
  28. cotacao = requisicao.json()
  29. BRL_Dollar = round(float(cotacao['USD']['bid']),2)
  30. BRL_Dollar
  31. # Starting Driver WebScrapping (option to hide windown)
  32. driver_exe = r'C:\Users\belig\OneDrive\Python\MeuProjeto\Projetos\WebScrapping_News\WebScrapping_News\chromedriver.exe'
  33. options = Options()
  34. options.add_argument("--headless")
  35. driver = webdriver.Chrome(driver_exe, options=options)
  36. # Setup Titulo_CNNBrasil Var
  37. driver.get('https://www.cnnbrasil.com.br/')
  38. Titulo_CNNBrasil = driver.find_element(By.XPATH, '//*[@id="block1847327"]/div/div/a/h2').text
  39. print(Titulo_CNNBrasil)
  40. # Setup Url_CNNBrasil Var
  41. driver.find_element(By.XPATH, '//*[@id="block1847327"]/div/div/a/h2').click()
  42. Url_CNNBrasil = driver.current_url
  43. print(Url_CNNBrasil)
  44. # Setup Topics_CNNBrasil Var
  45. try:
  46. Topics_CNNBrasil = driver.find_element(By.CLASS_NAME, 'tags__list').text
  47. Topics_CNNBrasil = Topics_CNNBrasil.replace('\n', ', ')
  48. print(Topics_CNNBrasil)
  49. except:
  50. Topics_CNNBrasil = 'None'
  51. print

Add to SQL and DataFrame:

  1. # Add Row to DataFrame
  2. new_row = pd.DataFrame({"Date":[Date], "WeekDay":[WeekDay], "Brazil_Ibovespa":[Brazil_Ibovespa], "BRL_Dollar":[BRL_Dollar], "Titulo_CNNBrasil":[Titulo_CNNBrasil], "Url_CNNBrasil":[Url_CNNBrasil], "Topics_CNNBrasil":[Topics_CNNBrasil], index=[0])
  3. print(new_row)
  4. dataset = pd.concat([dataset, new_row], ignore_index=True)
  5. # dataset = dataset.append({"Date":Date, "WeekDay": WeekDay}, ignore_index=True)
  6. print(dataset)
  7. dataset.to_csv(r'C:\Users\belig\OneDrive\Python\MeuProjeto\Projetos\WebScrapping_News\WebScrapping_News\dataset_news.csv', index=False, encoding="utf-8-sig", sep = ';')
  8. # Add info to SQL Server
  9. dados_conexao = (
  10. "Driver={SQL Server};"
  11. "Server=Beligolli;"
  12. "Database=WebScrappingNews;"
  13. 'Trusted_Connection=yes;'
  14. # UID = Login;
  15. # PWD=Senha;
  16. )
  17. conexao = pyodbc.connect(dados_conexao)
  18. cursor = conexao.cursor()
  19. comando = "INSERT INTO NewsDataBase (Date_Hour, WeekDay_, Brazil_Ibovespa, BRL_Dollar, Titulo_CNNBrasil, Url_CNNBrasil, Topics_CNNBrasil VALUES (?, ?, ?, ?, ?, ?, ?)"
  20. valores = (Date, WeekDay, Brazil_Ibovespa, BRL_Dollar, Titulo_CNNBrasil, Url_CNNBrasil, Topics_CNNBrasil)
  21. cursor.execute(comando, valores)
  22. cursor.commit()
  23. cursor.close()
  24. conexao.close()
  25. print(f'Adicionado {Date} - {WeekDay} ao dataset')

答案1

得分: 1

以下是翻译好的代码部分:

  1. import schedule
  2. import time
  3. def job():
  4. print("我在工作...")
  5. schedule.every(10).minutes.do(job)
  6. schedule.every().hour.do(job)
  7. schedule.every().day.at("10:30").do(job)
  8. while 1:
  9. schedule.run_pending()
  10. time.sleep(1)

如果你想将应用程序作为Windows服务运行,请保存脚本并执行相应的操作。

英文:

First i start with this library to schedule the events:

  1. import schedule
  2. import time
  3. def job():
  4. print("I'm working...")
  5. schedule.every(10).minutes.do(job)
  6. schedule.every().hour.do(job)
  7. schedule.every().day.at("10:30").do(job)
  8. while 1:
  9. schedule.run_pending()
  10. time.sleep(1)

and then save the script and if you want to run the app as a windows service

huangapple
  • 本文由 发表于 2023年2月8日 19:15:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/75384974.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定