WebScrapping自动每天运行2次 – Python (selenium)

huangapple go评论75阅读模式
英文:

WebScrapping running automatically 2 times a day - Python (selenium)

问题

我已创建一个简单的网页抓取器,它从CNN网站获取一些信息并将其放入数据库表中。它在Python中正常工作,我在VScode中使用它。

我正在寻找一种自动运行此脚本两次每天的方法,有人知道如何做吗?我尝试过AWS,但无法成功!

我想要在线自动运行代码,即使我的计算机关闭,它也必须更新我的CSV文件。

一些重要的信息:

  • 考虑到这是一个网页抓取器,我有一些文件必须放在我的文件夹中,例如chromedriver.exe和一个CSV文件,用于追加新的信息。
  • 这是我的代码:

导入:

import pandas as pd
from datetime import datetime
import requests
import json
from pandas_datareader import data as web
import yfinance as yf
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
from datetime import date, timedelta
from selenium.webdriver.chrome.options import Options
import pyodbc

网页抓取代码:

dataset.to_csv(r"C:\Users\belig\OneDrive\Python\MeuProjeto\Projetos\WebScrapping_News\WebScrapping_News\dataset.csv", index=False)

dataset = pd.read_csv(r"C:\Users\belig\OneDrive\Python\MeuProjeto\Projetos\WebScrapping_News\WebScrapping_News\dataset_news.csv", sep=";")

# 创建变量
Date = 1
WeekDay = 2
Brazil_Ibovespa = 3
BRL_Dollar = 4
Titulo_CNNBrasil = 5

# 设置日期变量
Date = datetime.now().strftime("%d/%m/%Y, %H:%M:%S")
Date

# 设置星期几变量
date_now = datetime.now()
WeekDay = date_now.strftime("%A")
WeekDay

# 设置Brazil_Ibovespa变量
today = date.today()
start_day = today - timedelta(days=7)
tickers_DowJones = "^BVSP"
datayf = yf.download(tickers_DowJones, start=start_day, end=today)
print(datayf)
datayf = datayf['Adj Close']
Brazil_Ibovespa = datayf[-1]
Brazil_Ibovespa

# 设置BRL_Dollar变量
requisicao = requests.get('https://economia.awesomeapi.com.br/all/USD-BRL')
cotacao = requisicao.json()
BRL_Dollar = round(float(cotacao['USD']['bid']), 2)
BRL_Dollar

# 启动Web抓取驱动程序(隐藏窗口选项)
driver_exe = r'C:\Users\belig\OneDrive\Python\MeuProjeto\Projetos\WebScrapping_News\WebScrapping_News\chromedriver.exe'
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(driver_exe, options=options)

# 设置Titulo_CNNBrasil变量
driver.get('https://www.cnnbrasil.com.br/')
Titulo_CNNBrasil = driver.find_element(By.XPATH, '//*[@id="block1847327"]/div/div/a/h2').text
print(Titulo_CNNBrasil)

# 设置Url_CNNBrasil变量
driver.find_element(By.XPATH, '//*[@id="block1847327"]/div/div/a/h2').click()
Url_CNNBrasil = driver.current_url
print(Url_CNNBrasil)

# 设置Topics_CNNBrasil变量
try:
    Topics_CNNBrasil =  driver.find_element(By.CLASS_NAME, 'tags__list').text
    Topics_CNNBrasil = Topics_CNNBrasil.replace('\n', ', ')
    print(Topics_CNNBrasil)
except:
    Topics_CNNBrasil = 'None'
    print

添加到SQL和DataFrame:

# 添加行到DataFrame
new_row = pd.DataFrame({"Date": [Date], "WeekDay": [WeekDay], "Brazil_Ibovespa": [Brazil_Ibovespa], "BRL_Dollar": [BRL_Dollar], "Titulo_CNNBrasil": [Titulo_CNNBrasil], "Url_CNNBrasil": [Url_CNNBrasil], "Topics_CNNBrasil": [Topics_CNNBrasil]}, index=[0])
print(new_row)
dataset = pd.concat([dataset, new_row], ignore_index=True)
dataset.to_csv(r'C:\Users\belig\OneDrive\Python\MeuProjeto\Projetos\WebScrapping_News\WebScrapping_News\dataset_news.csv', index=False, encoding="utf-8-sig", sep=";")

# 添加信息到SQL Server
dados_conexao = (
    "Driver={SQL Server};"
    "Server=Beligolli;"
    "Database=WebScrappingNews;"
    'Trusted_Connection=yes;'
)

conexao = pyodbc.connect(dados_conexao)
cursor = conexao.cursor()

comando = "INSERT INTO NewsDataBase (Date_Hour, WeekDay_, Brazil_Ibovespa, BRL_Dollar, Titulo_CNNBrasil, Url_CNNBrasil, Topics_CNNBrasil) VALUES (?, ?, ?, ?, ?, ?, ?)"

valores = (Date, WeekDay, Brazil_Ibovespa, BRL_Dollar, Titulo_CNNBrasil, Url_CNNBrasil, Topics_CNNBrasil)

cursor.execute(comando, valores)
cursor.commit()
cursor.close()
conexao.close()

print(f'Adicionado {Date} - {WeekDay} ao dataset')
英文:

I've created a simple webScrapper that gets some information from CNN website and puts it into a database table.
Its working properly in Python and I'm using VScode.
I am looking for a way to run this script 2 times a day automatically, anyone knows how to do it? I tried AWS but I was not able to do it!

I want to run the code automatically online, with my computer off and it has to update my CSV file.

Some important information:

  • Considering that it is a webScrapper I have some files that I have to use in my folders such as chromedriver.exe and a CSV that append the new roll with new information.
  • Here is my code:

imports:

import pandas as pd
from datetime import datetime
import requests
import json
from pandas_datareader import data as web
import yfinance as yf
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
from datetime import date, timedelta
from selenium.webdriver.chrome.options import Options
import pyodbc

WebScrapping code:

dataset.to_csv(r"C:\Users\belig\OneDrive\Python\MeuProjeto\Projetos\WebScrapping_News\WebScrapping_News\dataset.csv", index=False)
dataset = pd.read_csv(r"C:\Users\belig\OneDrive\Python\MeuProjeto\Projetos\WebScrapping_News\WebScrapping_News\dataset_news.csv", sep=";")
# Creating Variables
# %%
Date = 1
WeekDay = 2
Brazil_Ibovespa = 3
BRL_Dollar = 4
Titulo_CNNBrasil = 5
# Setup Date Var
Date = datetime.now().strftime("%d/%m/%Y, %H:%M:%S")
Date
# Setup WeekDay Var
date_now = datetime.now()
WeekDay = date_now.strftime("%A")
WeekDay
# Setup Brazil_Ibovespa Var
today = date.today()
start_day = today - timedelta(days = 7)
tickers_DowJones = "^BVSP"
datayf = yf.download(tickers_DowJones, start=start_day, end=today)
print(datayf)
datayf = datayf['Adj Close']
Brazil_Ibovespa = datayf[-1]
Brazil_Ibovespa
# Setup BRL_Dollar Var
requisicao = requests.get('https://economia.awesomeapi.com.br/all/USD-BRL')
cotacao = requisicao.json()
BRL_Dollar = round(float(cotacao['USD']['bid']),2)
BRL_Dollar
# Starting Driver WebScrapping (option to hide windown)
driver_exe = r'C:\Users\belig\OneDrive\Python\MeuProjeto\Projetos\WebScrapping_News\WebScrapping_News\chromedriver.exe'
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(driver_exe, options=options)
# Setup Titulo_CNNBrasil Var
driver.get('https://www.cnnbrasil.com.br/')
Titulo_CNNBrasil = driver.find_element(By.XPATH, '//*[@id="block1847327"]/div/div/a/h2').text
print(Titulo_CNNBrasil)
# Setup Url_CNNBrasil Var
driver.find_element(By.XPATH, '//*[@id="block1847327"]/div/div/a/h2').click()
Url_CNNBrasil = driver.current_url
print(Url_CNNBrasil)
# Setup Topics_CNNBrasil Var
try:
Topics_CNNBrasil =  driver.find_element(By.CLASS_NAME, 'tags__list').text
Topics_CNNBrasil = Topics_CNNBrasil.replace('\n', ', ')
print(Topics_CNNBrasil)
except:
Topics_CNNBrasil = 'None'
print

Add to SQL and DataFrame:

# Add Row to DataFrame
new_row = pd.DataFrame({"Date":[Date], "WeekDay":[WeekDay], "Brazil_Ibovespa":[Brazil_Ibovespa], "BRL_Dollar":[BRL_Dollar], "Titulo_CNNBrasil":[Titulo_CNNBrasil], "Url_CNNBrasil":[Url_CNNBrasil], "Topics_CNNBrasil":[Topics_CNNBrasil], index=[0])
print(new_row)
dataset = pd.concat([dataset, new_row], ignore_index=True)
# dataset = dataset.append({"Date":Date, "WeekDay": WeekDay}, ignore_index=True)
print(dataset)
dataset.to_csv(r'C:\Users\belig\OneDrive\Python\MeuProjeto\Projetos\WebScrapping_News\WebScrapping_News\dataset_news.csv', index=False, encoding="utf-8-sig", sep = ';')
# Add info to SQL Server
dados_conexao = (
"Driver={SQL Server};"
"Server=Beligolli;"
"Database=WebScrappingNews;"
'Trusted_Connection=yes;'
# UID  = Login;
# PWD=Senha;
)
conexao = pyodbc.connect(dados_conexao)
cursor = conexao.cursor()
comando = "INSERT INTO NewsDataBase (Date_Hour, WeekDay_, Brazil_Ibovespa, BRL_Dollar, Titulo_CNNBrasil, Url_CNNBrasil, Topics_CNNBrasil VALUES (?, ?, ?, ?, ?, ?, ?)"
valores = (Date, WeekDay, Brazil_Ibovespa, BRL_Dollar, Titulo_CNNBrasil, Url_CNNBrasil, Topics_CNNBrasil)
cursor.execute(comando, valores)
cursor.commit()
cursor.close()
conexao.close()
print(f'Adicionado {Date} - {WeekDay} ao dataset')

答案1

得分: 1

以下是翻译好的代码部分:

import schedule
import time

def job():
    print("我在工作...")

schedule.every(10).minutes.do(job)
schedule.every().hour.do(job)
schedule.every().day.at("10:30").do(job)

while 1:
    schedule.run_pending()
    time.sleep(1)

如果你想将应用程序作为Windows服务运行,请保存脚本并执行相应的操作。

英文:

First i start with this library to schedule the events:

import schedule
import time
def job():
print("I'm working...")
schedule.every(10).minutes.do(job)
schedule.every().hour.do(job)
schedule.every().day.at("10:30").do(job)
while 1:
schedule.run_pending()
time.sleep(1)

and then save the script and if you want to run the app as a windows service

huangapple
  • 本文由 发表于 2023年2月8日 19:15:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/75384974.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定