英文:
Writing to sql database with pandas
问题
抱歉,我只会回答问题,不提供代码翻译。如果您有任何关于这段代码或任何其他问题的具体问题,我将很高兴回答。
英文:
Confused. Trying to build a scraper of UK news in python.
import feedparser
import pandas as pd
def poll_rss(rss_url):
feed = feedparser.parse(rss_url)
for entry in feed.entries:
print("Title:", entry.title)
print("Description:", entry.description)
print("\n")
def poll_rss(rss_url):
feed = feedparser.parse(rss_url)
for entry in feed.entries:
print("Title:", entry.title)
print("Description:", entry.description)
print("\n")
# Example usage:
feeds = [{"type": "news","title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},
{"type": "news","title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},
{"type": "news","title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},
{"type": "news","title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},
{"type": "news","title": "Metro UK","url": "https://metro.co.uk/feed/"},
{"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss.xml"},
{"type": "news","title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},
{"type": "news","title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
{"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},
{"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},
{"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
{"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},
{"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},
{"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},
{"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"},
{"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/rss.xml"}]
for feed in feeds:
parsed_feed = feedparser.parse(feed['url'])
print("Title:", feed['title'])
print("Number of Articles:", len(parsed_feed.entries))
print("\n")
data = []
for entry in parsed_feed.entries:
title = entry.title
url = entry.link
print(entry.summary)
if entry.summary:
summary = entry.summary
data.append(summary)
else:
entry.summary = "No summary available"
if entry.published:
date = entry.published
data.append (data)
else:
data.append("No data available")
I then have a bit of code to sort out the saving.
df = pd.DataFrame(data)
df.columns = ['title', 'url', 'summary', 'date']
print("data" + df)
from sqlalchemy import create_engine
import mysql.connector
engine = create_engine('mysql+pymysql://root:password_thingbob@localhost/somedatabase')
df.to_sql('nationals', con = engine, if_exists = 'append', index = False)
Although the nationals table has been created and the credentials are right, why does it not save?
答案1
得分: 2
如果您所说的凭据是正确的,那么 to_sql
调用是正确的。我认为问题在于用于解析提要的Python循环。特别是,data.append(data)
这一行创建了一个递归列表,无法构建为数据框。此外,我认为 data
列表应该是一个嵌套列表,其中每个子列表是 parsed_feed
中的一个条目(这样数据框中的每一行都是一个条目)。
我会将循环写成这样:
data = [] # <---- 在这里初始化空列表
for feed in feeds:
parsed_feed = feedparser.parse(feed['url'])
print("Title:", feed['title'])
print("Number of Articles:", len(parsed_feed.entries))
print("\n")
for entry in parsed_feed.entries:
title = entry.title
url = entry.link
print(entry.summary)
summary = entry.summary or "No summary available" # 我在这里简化了三元运算符
date = entry.published or "No data available" # 我在这里简化了三元运算符
data.append([title, url, summary, date]) # <---- 从每个条目追加数据到这里
df = pd.DataFrame(data, columns=['title', 'url', 'summary', 'date'])
from sqlalchemy import create_engine
import mysql.connector
engine = create_engine('mysql+pymysql://root:password_thingbob@localhost/somedatabase')
df.to_sql('nationals', con=engine, if_exists='append', index=False)
我使用您提供的提要列表进行了检查,它运行正常。
英文:
If the credentials are correct as you say, then the to_sql
call is fine. I think the problem is the Python loop to parse the feed. In particular, the line data.append (data)
is creating a recursive list that cannot be constructed into a dataframe. Also, I think data
list should be a nested list where each sub-list is an entry in a parsed_feed
(so that each row in the dataframe is one entry).
I would write the loop as
data = [] # <---- initialize empty list here
for feed in feeds:
parsed_feed = feedparser.parse(feed['url'])
print("Title:", feed['title'])
print("Number of Articles:", len(parsed_feed.entries))
print("\n")
for entry in parsed_feed.entries:
title = entry.title
url = entry.link
print(entry.summary)
summary = entry.summary or "No summary available" # I simplified the ternary operators here
date = entry.published or "No data available" # I simplified the ternary operators here
data.append([title, url, summary, date]) # <---- append data from each entry here
df = pd.DataFrame(data, columns = ['title', 'url', 'summary', 'date'])
from sqlalchemy import create_engine
import mysql.connector
engine = create_engine('mysql+pymysql://root:password_thingbob@localhost/somedatabase')
df.to_sql('nationals', con = engine, if_exists = 'append', index = False)
I checked it with the feed list you provided and it works fine.
答案2
得分: 1
由于RSS源是XML文件,请考虑使用pandas.read_xml和通过列表推导绑定数据,这样可以避免初始化列表和追加元素的繁琐操作。
此外,通过用户定义的方法处理每个RSS源,并且由于您正在提取可能会更改的网页链接,可以考虑使用try...except
语句,以显示帖子中的三个有问题的URL。
import pandas as pd
feeds = [
{"type": "news", "title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},
{"type": "news", "title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},
{"type": "news", "title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},
{"type": "news", "title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},
{"type": "news", "title": "Metro UK", "url": "https://metro.co.uk/feed/"},
{"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss"}, # 修复URL: 移除 .xml
{"type": "news", "title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},
{"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"}, # 问题URL
{"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},
{"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},
{"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"}, # 问题URL
{"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},
{"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},
{"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},
{"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"}, # 问题URL
{"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/rss.xml"}
]
hdr = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
def proc_rss(feed):
rss_df = None
print("Title:", feed['title'])
try:
# 使用指定的列解析RSS XML,重命名列名
rss_df = (
pd.read_xml(feed["url"], xpath=".//item", storage_options=hdr)
.reindex(["title", "link", "description", "pubDate"], axis="columns")
.set_axis(["title", "url", "summary", "date"], axis="columns")
)
print("文章数量:", rss_df.shape[0])
except Exception as e:
print("文章数量: 无。原因:", e)
print("")
return rss_df
# 使用列表推导绑定到单个数据帧
rss_df = pd.concat([proc_rss(f) for f in feeds], ignore_index=True)
print(rss_df)
输出(部分内容):
Title: BBC
文章数量: 34
Title: The Economist
文章数量: 100
Title: The New Statesman
文章数量: 20
Title: The New York Times
文章数量: 27
Title: Metro UK
文章数量: 30
Title: Evening Standard
文章数量: 100
Title: Daily Mail
文章数量: 153
Title: Sky News
文章数量: 无。原因: HTTP错误 404: 未找到
Title: The Mirror
文章数量: 25
Title: The Sun
文章数量: 100
Title: Sky News
文章数量: 无。原因: HTTP错误 404: 未找到
Title: The Guardian
文章数量: 113
Title: The Independent
文章数量: 100
Title: The Telegraph
文章数量: 100
Title: The Times
文章数量: 无。原因: xmlParseEntityRef: 没有名称,行 1,列 1556 (<string>, 行 1)
Title: The Mirror
文章数量: 25
英文:
Since RSS feeds are XML files, consider pandas.read_xml
and bind data via a list comprehension which avoids the bookkeeping of initializing list and appending elements.
Additionally, process each feed via a user-defined method and since you are scrapping potential web links that can change incorporate try...except
which shows three problematic URLs in your post.
import pandas as pd
feeds = [
{"type": "news", "title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},
{"type": "news", "title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},
{"type": "news", "title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},
{"type": "news", "title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},
{"type": "news", "title": "Metro UK", "url": "https://metro.co.uk/feed/"},
{"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss"}, # FIXED URL: REMOVE .xml
{"type": "news", "title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},
{"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"}, # PROBLEM URL
{"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},
{"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},
{"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"}, # PROBLEM URL
{"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},
{"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},
{"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},
{"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"}, # PROBLEM URL
{"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/rss.xml"}
]
hdr = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
def proc_rss(feed):
rss_df = None
print("Title:", feed['title'])
try:
# PARSE RSS XML W/ HEADERS, KEEP SPECIFIC COLUMNS, RENAME COLUMNS
rss_df = (
pd.read_xml(feed["url"], xpath=".//item", storage_options=hdr)
.reindex(["title", "link", "description", "pubDate"], axis="columns")
.set_axis(["title", "url", "summary", "date"], axis="columns")
)
print("Number of Articles:", rss_df.shape[0])
except Exception as e:
print("Number of Articles: NONE. Reason:", e)
print("")
return rss_df
# LIST COMPREHENSION BINDED TO SINGLE DATA FRAME
rss_df = pd.concat([proc_rss(f) for f in feeds], ignore_index=True)
print(rss_df)
Output
Title: BBC
Number of Articles: 34
Title: The Economist
Number of Articles: 100
Title: The New Statesman
Number of Articles: 20
Title: The New York Times
Number of Articles: 27
Title: Metro UK
Number of Articles: 30
Title: Evening Standard
Number of Articles: 100
Title: Daily Mail
Number of Articles: 153
Title: Sky News
Number of Articles: NONE. Reason: HTTP Error 404: Not Found
Title: The Mirror
Number of Articles: 25
Title: The Sun
Number of Articles: 100
Title: Sky News
Number of Articles: NONE. Reason: HTTP Error 404: Not Found
Title: The Guardian
Number of Articles: 113
Title: The Independent
Number of Articles: 100
Title: The Telegraph
Number of Articles: 100
Title: The Times
Number of Articles: NONE. Reason: xmlParseEntityRef: no name, line 1, column 1556 (<string>, line 1)
Title: The Mirror
Number of Articles: 25
title url summary date
0 Nicola Bulley: Lancashire Police find body in ... https://www.bbc.co.uk/news/uk-england-64697300... Officers searching for the missing mother-of t... Sun, 19 Feb 2023 17:54:18 GMT
1 Baftas 2023: All Quiet on the Western Front do... https://www.bbc.co.uk/news/entertainment-arts-... Netflix's World War One epic won best film and... Sun, 19 Feb 2023 23:12:05 GMT
2 Dickie Davies, host of ITV's World of Sport fo... https://www.bbc.co.uk/news/uk-england-lancashi... The presenter anchored the five-hour live TV m... Mon, 20 Feb 2023 00:47:00 GMT
3 Son Heung-min: Tottenham condemn 'utterly repr... https://www.bbc.co.uk/sport/football/64700428?... Tottenham call for social media companies to t... Sun, 19 Feb 2023 22:25:04 GMT
4 Argentina Open: British number one Cameron Nor... https://www.bbc.co.uk/sport/tennis/64700048?at... British number one Cameron Norrie misses out o... Sun, 19 Feb 2023 21:45:24 GMT
.. ... ... ... ...
922 Nicola Bulley's family 'incredibly heartbroken... https://www.mirror.co.uk/news/uk-news/breaking... Lancashire Police has recovered a body around ... Sun, 19 Feb 2023 19:51:09 +0000
923 Shamed Matt Hancock gets 'worked like a barbec... https://www.mirror.co.uk/tv/tv-news/shamed-mat... SAS: Who Dares Wins star Rudy Reyessays shamed... Sun, 19 Feb 2023 19:35:03 +0000
924 Treasure hunter uses map left by his father to... https://www.mirror.co.uk/news/world-news/treas... Jan Glazewski dug up the silver treasure burie... Sun, 19 Feb 2023 19:19:15 +0000
925 'My husband refuses to be in the delivery room... https://www.mirror.co.uk/news/weird-news/my-hu... A first-time mum-to-be says she's now feeling ... Sun, 19 Feb 2023 19:17:34 +0000
926 Nicola Bulley search diver sends message of su... https://www.mirror.co.uk/news/uk-news/nicola-b... The expert search diver called in to assist wi... Sun, 19 Feb 2023 19:16:13 +0000
[927 rows x 4 columns]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论