写入 SQL 数据库使用 pandas

huangapple go评论61阅读模式
英文:

Writing to sql database with pandas

问题

抱歉,我只会回答问题,不提供代码翻译。如果您有任何关于这段代码或任何其他问题的具体问题,我将很高兴回答。

英文:

Confused. Trying to build a scraper of UK news in python.

import feedparser
import pandas as pd

def poll_rss(rss_url):
    feed = feedparser.parse(rss_url)
    for entry in feed.entries:
        print("Title:", entry.title)
        print("Description:", entry.description)
        print("\n")

def poll_rss(rss_url):
    feed = feedparser.parse(rss_url)
    for entry in feed.entries:
        print("Title:", entry.title)
        print("Description:", entry.description)
        print("\n")

# Example usage:
feeds = [{"type": "news","title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},
        {"type": "news","title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},    
        {"type": "news","title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},    
        {"type": "news","title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},
        {"type": "news","title": "Metro UK","url": "https://metro.co.uk/feed/"},
        {"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss.xml"},
        {"type": "news","title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},
        {"type": "news","title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
        {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},
        {"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},
        {"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
        {"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},
        {"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},
        {"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},
        {"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"},
        {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/rss.xml"}]

for feed in feeds:
    parsed_feed = feedparser.parse(feed['url'])
    
    print("Title:", feed['title'])
    print("Number of Articles:", len(parsed_feed.entries))
    print("\n")
    data = []
    for entry in parsed_feed.entries:
        title = entry.title
        url = entry.link
        print(entry.summary)
        if entry.summary:
            summary = entry.summary
            data.append(summary)
        else:
            entry.summary = "No summary available"
        if entry.published:
            date = entry.published
            data.append (data)
        else:
            data.append("No data available")

I then have a bit of code to sort out the saving.

df = pd.DataFrame(data)
df.columns = ['title', 'url', 'summary', 'date']
print("data" + df)
from sqlalchemy import create_engine
import mysql.connector
engine = create_engine('mysql+pymysql://root:password_thingbob@localhost/somedatabase')  
df.to_sql('nationals', con = engine, if_exists = 'append', index = False)

Although the nationals table has been created and the credentials are right, why does it not save?

答案1

得分: 2

如果您所说的凭据是正确的,那么 to_sql 调用是正确的。我认为问题在于用于解析提要的Python循环。特别是,data.append(data) 这一行创建了一个递归列表,无法构建为数据框。此外,我认为 data 列表应该是一个嵌套列表,其中每个子列表是 parsed_feed 中的一个条目(这样数据框中的每一行都是一个条目)。

我会将循环写成这样:

data = []  # <---- 在这里初始化空列表
for feed in feeds:
    parsed_feed = feedparser.parse(feed['url'])
    print("Title:", feed['title'])
    print("Number of Articles:", len(parsed_feed.entries))
    print("\n")
    for entry in parsed_feed.entries:
        title = entry.title
        url = entry.link
        print(entry.summary)
        summary = entry.summary or "No summary available" # 我在这里简化了三元运算符
        date = entry.published or "No data available" # 我在这里简化了三元运算符
        data.append([title, url, summary, date]) # <---- 从每个条目追加数据到这里

df = pd.DataFrame(data, columns=['title', 'url', 'summary', 'date'])
from sqlalchemy import create_engine
import mysql.connector
engine = create_engine('mysql+pymysql://root:password_thingbob@localhost/somedatabase')
df.to_sql('nationals', con=engine, if_exists='append', index=False)

我使用您提供的提要列表进行了检查,它运行正常。

英文:

If the credentials are correct as you say, then the to_sql call is fine. I think the problem is the Python loop to parse the feed. In particular, the line data.append (data) is creating a recursive list that cannot be constructed into a dataframe. Also, I think data list should be a nested list where each sub-list is an entry in a parsed_feed (so that each row in the dataframe is one entry).

I would write the loop as

data = []                               # &lt;---- initialize empty list here
for feed in feeds:    
    parsed_feed = feedparser.parse(feed[&#39;url&#39;])
    print(&quot;Title:&quot;, feed[&#39;title&#39;])
    print(&quot;Number of Articles:&quot;, len(parsed_feed.entries))
    print(&quot;\n&quot;)
    for entry in parsed_feed.entries:
        title = entry.title
        url = entry.link
        print(entry.summary)
        summary = entry.summary or &quot;No summary available&quot; # I simplified the ternary operators here
        date = entry.published or &quot;No data available&quot;     # I simplified the ternary operators here
        data.append([title, url, summary, date])          # &lt;---- append data from each entry here

df = pd.DataFrame(data, columns = [&#39;title&#39;, &#39;url&#39;, &#39;summary&#39;, &#39;date&#39;])
from sqlalchemy import create_engine
import mysql.connector
engine = create_engine(&#39;mysql+pymysql://root:password_thingbob@localhost/somedatabase&#39;)  
df.to_sql(&#39;nationals&#39;, con = engine, if_exists = &#39;append&#39;, index = False)

I checked it with the feed list you provided and it works fine.

答案2

得分: 1

由于RSS源是XML文件,请考虑使用pandas.read_xml和通过列表推导绑定数据,这样可以避免初始化列表和追加元素的繁琐操作。

此外,通过用户定义的方法处理每个RSS源,并且由于您正在提取可能会更改的网页链接,可以考虑使用try...except语句,以显示帖子中的三个有问题的URL。

import pandas as pd

feeds = [
    {"type": "news", "title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},
    {"type": "news", "title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},    
    {"type": "news", "title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},    
    {"type": "news", "title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},
    {"type": "news", "title": "Metro UK", "url": "https://metro.co.uk/feed/"},
    {"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss"},  # 修复URL: 移除 .xml
    {"type": "news", "title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},
    {"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},  # 问题URL
    {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},
    {"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},
    {"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},  # 问题URL
    {"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},
    {"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},
    {"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},
    {"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"},  # 问题URL
    {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/rss.xml"}          
]

hdr = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding': 'none',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive'
}

def proc_rss(feed):
    rss_df = None
    print("Title:", feed['title'])

    try:
        # 使用指定的列解析RSS XML,重命名列名
        rss_df = (
            pd.read_xml(feed["url"], xpath=".//item", storage_options=hdr)
              .reindex(["title", "link", "description", "pubDate"], axis="columns")
              .set_axis(["title", "url", "summary", "date"], axis="columns")
        )

        print("文章数量:", rss_df.shape[0])

    except Exception as e:
        print("文章数量: 无。原因:", e)

    print("")
    return rss_df 

# 使用列表推导绑定到单个数据帧
rss_df = pd.concat([proc_rss(f) for f in feeds], ignore_index=True)

print(rss_df)

输出(部分内容):

Title: BBC
文章数量: 34
Title: The Economist
文章数量: 100
Title: The New Statesman
文章数量: 20
Title: The New York Times
文章数量: 27
Title: Metro UK
文章数量: 30
Title: Evening Standard
文章数量: 100
Title: Daily Mail
文章数量: 153
Title: Sky News
文章数量: 无。原因: HTTP错误 404: 未找到
Title: The Mirror
文章数量: 25
Title: The Sun
文章数量: 100
Title: Sky News
文章数量: 无。原因: HTTP错误 404: 未找到
Title: The Guardian
文章数量: 113
Title: The Independent
文章数量: 100
Title: The Telegraph
文章数量: 100
Title: The Times
文章数量: 无。原因: xmlParseEntityRef: 没有名称,行 1,列 1556 (<string>, 行 1)
Title: The Mirror
文章数量: 25
英文:

Since RSS feeds are XML files, consider pandas.read_xml and bind data via a list comprehension which avoids the bookkeeping of initializing list and appending elements.

Additionally, process each feed via a user-defined method and since you are scrapping potential web links that can change incorporate try...except which shows three problematic URLs in your post.

import pandas as pd

feeds = [
    {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;BBC&quot;, &quot;url&quot;: &quot;http://feeds.bbci.co.uk/news/uk/rss.xml&quot;},
    {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The Economist&quot;, &quot;url&quot;: &quot;https://www.economist.com/international/rss.xml&quot;},    
    {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The New Statesman&quot;, &quot;url&quot;: &quot;https://www.newstatesman.com/feed&quot;},    
    {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The New York Times&quot;, &quot;url&quot;: &quot;https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml&quot;},
    {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;Metro UK&quot;, &quot;url&quot;: &quot;https://metro.co.uk/feed/&quot;},
    {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;Evening Standard&quot;, &quot;url&quot;: &quot;https://www.standard.co.uk/rss&quot;},          # FIXED URL: REMOVE .xml
    {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;Daily Mail&quot;, &quot;url&quot;: &quot;https://www.dailymail.co.uk/articles.rss&quot;},
    {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;Sky News&quot;, &quot;url&quot;: &quot;https://news.sky.com/feeds/rss/home.xml&quot;},         # PROBLEM URL
    {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The Mirror&quot;, &quot;url&quot;: &quot;https://www.mirror.co.uk/news/?service=rss&quot;},
    {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The Sun&quot;, &quot;url&quot;: &quot;https://www.thesun.co.uk/news/feed/&quot;},
    {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;Sky News&quot;, &quot;url&quot;: &quot;https://news.sky.com/feeds/rss/home.xml&quot;},         # PROBLEM URL
    {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The Guardian&quot;, &quot;url&quot;: &quot;https://www.theguardian.com/uk/rss&quot;},
    {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The Independent&quot;, &quot;url&quot;: &quot;https://www.independent.co.uk/news/uk/rss&quot;},
    {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The Telegraph&quot;, &quot;url&quot;: &quot;https://www.telegraph.co.uk/news/rss.xml&quot;},
    {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The Times&quot;, &quot;url&quot;: &quot;https://www.thetimes.co.uk/?service=rss&quot;},        # PROBLEM URL
    {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The Mirror&quot;, &quot;url&quot;: &quot;https://www.mirror.co.uk/news/rss.xml&quot;}          
]

hdr = {
    &#39;User-Agent&#39;: &#39;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11&#39;,
    &#39;Accept&#39;: &#39;text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8&#39;,
    &#39;Accept-Charset&#39;: &#39;ISO-8859-1,utf-8;q=0.7,*;q=0.3&#39;,
    &#39;Accept-Encoding&#39;: &#39;none&#39;,
    &#39;Accept-Language&#39;: &#39;en-US,en;q=0.8&#39;,
    &#39;Connection&#39;: &#39;keep-alive&#39;
}

def proc_rss(feed):
    rss_df = None
    print(&quot;Title:&quot;, feed[&#39;title&#39;])

    try:
        # PARSE RSS XML W/ HEADERS, KEEP SPECIFIC COLUMNS, RENAME COLUMNS
        rss_df = (
            pd.read_xml(feed[&quot;url&quot;], xpath=&quot;.//item&quot;, storage_options=hdr)
              .reindex([&quot;title&quot;, &quot;link&quot;, &quot;description&quot;, &quot;pubDate&quot;], axis=&quot;columns&quot;)
              .set_axis([&quot;title&quot;, &quot;url&quot;, &quot;summary&quot;, &quot;date&quot;], axis=&quot;columns&quot;)
        )

        print(&quot;Number of Articles:&quot;, rss_df.shape[0])

    except Exception as e:
        print(&quot;Number of Articles: NONE. Reason:&quot;, e)

    print(&quot;&quot;)
    return rss_df 

# LIST COMPREHENSION BINDED TO SINGLE DATA FRAME
rss_df = pd.concat([proc_rss(f) for f in feeds], ignore_index=True)

print(rss_df)

Output

Title: BBC
Number of Articles: 34
Title: The Economist
Number of Articles: 100
Title: The New Statesman
Number of Articles: 20
Title: The New York Times
Number of Articles: 27
Title: Metro UK
Number of Articles: 30
Title: Evening Standard
Number of Articles: 100
Title: Daily Mail
Number of Articles: 153
Title: Sky News
Number of Articles: NONE. Reason: HTTP Error 404: Not Found
Title: The Mirror
Number of Articles: 25
Title: The Sun
Number of Articles: 100
Title: Sky News
Number of Articles: NONE. Reason: HTTP Error 404: Not Found
Title: The Guardian
Number of Articles: 113
Title: The Independent
Number of Articles: 100
Title: The Telegraph
Number of Articles: 100
Title: The Times
Number of Articles: NONE. Reason: xmlParseEntityRef: no name, line 1, column 1556 (&lt;string&gt;, line 1)
Title: The Mirror
Number of Articles: 25
title                                                url                                            summary                             date
0    Nicola Bulley: Lancashire Police find body in ...  https://www.bbc.co.uk/news/uk-england-64697300...  Officers searching for the missing mother-of t...    Sun, 19 Feb 2023 17:54:18 GMT
1    Baftas 2023: All Quiet on the Western Front do...  https://www.bbc.co.uk/news/entertainment-arts-...  Netflix&#39;s World War One epic won best film and...    Sun, 19 Feb 2023 23:12:05 GMT
2    Dickie Davies, host of ITV&#39;s World of Sport fo...  https://www.bbc.co.uk/news/uk-england-lancashi...  The presenter anchored the five-hour live TV m...    Mon, 20 Feb 2023 00:47:00 GMT
3    Son Heung-min: Tottenham condemn &#39;utterly repr...  https://www.bbc.co.uk/sport/football/64700428?...  Tottenham call for social media companies to t...    Sun, 19 Feb 2023 22:25:04 GMT
4    Argentina Open: British number one Cameron Nor...  https://www.bbc.co.uk/sport/tennis/64700048?at...  British number one Cameron Norrie misses out o...    Sun, 19 Feb 2023 21:45:24 GMT
..                                                 ...                                                ...                                                ...                              ...
922  Nicola Bulley&#39;s family &#39;incredibly heartbroken...  https://www.mirror.co.uk/news/uk-news/breaking...  Lancashire Police has recovered a body around ...  Sun, 19 Feb 2023 19:51:09 +0000
923  Shamed Matt Hancock gets &#39;worked like a barbec...  https://www.mirror.co.uk/tv/tv-news/shamed-mat...  SAS: Who Dares Wins star Rudy Reyessays shamed...  Sun, 19 Feb 2023 19:35:03 +0000
924  Treasure hunter uses map left by his father to...  https://www.mirror.co.uk/news/world-news/treas...  Jan Glazewski dug up the silver treasure burie...  Sun, 19 Feb 2023 19:19:15 +0000
925  &#39;My husband refuses to be in the delivery room...  https://www.mirror.co.uk/news/weird-news/my-hu...  A first-time mum-to-be says she&#39;s now feeling ...  Sun, 19 Feb 2023 19:17:34 +0000
926  Nicola Bulley search diver sends message of su...  https://www.mirror.co.uk/news/uk-news/nicola-b...  The expert search diver called in to assist wi...  Sun, 19 Feb 2023 19:16:13 +0000
[927 rows x 4 columns]

huangapple
  • 本文由 发表于 2023年2月19日 17:44:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75499220.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定