写入 SQL 数据库使用 pandas

huangapple go评论112阅读模式
英文:

Writing to sql database with pandas

问题

抱歉,我只会回答问题,不提供代码翻译。如果您有任何关于这段代码或任何其他问题的具体问题,我将很高兴回答。

英文:

Confused. Trying to build a scraper of UK news in python.

  1. import feedparser
  2. import pandas as pd
  3. def poll_rss(rss_url):
  4. feed = feedparser.parse(rss_url)
  5. for entry in feed.entries:
  6. print("Title:", entry.title)
  7. print("Description:", entry.description)
  8. print("\n")
  9. def poll_rss(rss_url):
  10. feed = feedparser.parse(rss_url)
  11. for entry in feed.entries:
  12. print("Title:", entry.title)
  13. print("Description:", entry.description)
  14. print("\n")
  15. # Example usage:
  16. feeds = [{"type": "news","title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},
  17. {"type": "news","title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},
  18. {"type": "news","title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},
  19. {"type": "news","title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},
  20. {"type": "news","title": "Metro UK","url": "https://metro.co.uk/feed/"},
  21. {"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss.xml"},
  22. {"type": "news","title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},
  23. {"type": "news","title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
  24. {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},
  25. {"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},
  26. {"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
  27. {"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},
  28. {"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},
  29. {"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},
  30. {"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"},
  31. {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/rss.xml"}]
  32. for feed in feeds:
  33. parsed_feed = feedparser.parse(feed['url'])
  34. print("Title:", feed['title'])
  35. print("Number of Articles:", len(parsed_feed.entries))
  36. print("\n")
  37. data = []
  38. for entry in parsed_feed.entries:
  39. title = entry.title
  40. url = entry.link
  41. print(entry.summary)
  42. if entry.summary:
  43. summary = entry.summary
  44. data.append(summary)
  45. else:
  46. entry.summary = "No summary available"
  47. if entry.published:
  48. date = entry.published
  49. data.append (data)
  50. else:
  51. data.append("No data available")

I then have a bit of code to sort out the saving.

  1. df = pd.DataFrame(data)
  2. df.columns = ['title', 'url', 'summary', 'date']
  3. print("data" + df)
  4. from sqlalchemy import create_engine
  5. import mysql.connector
  6. engine = create_engine('mysql+pymysql://root:password_thingbob@localhost/somedatabase')
  7. df.to_sql('nationals', con = engine, if_exists = 'append', index = False)

Although the nationals table has been created and the credentials are right, why does it not save?

答案1

得分: 2

如果您所说的凭据是正确的,那么 to_sql 调用是正确的。我认为问题在于用于解析提要的Python循环。特别是,data.append(data) 这一行创建了一个递归列表,无法构建为数据框。此外,我认为 data 列表应该是一个嵌套列表,其中每个子列表是 parsed_feed 中的一个条目(这样数据框中的每一行都是一个条目)。

我会将循环写成这样:

  1. data = [] # <---- 在这里初始化空列表
  2. for feed in feeds:
  3. parsed_feed = feedparser.parse(feed['url'])
  4. print("Title:", feed['title'])
  5. print("Number of Articles:", len(parsed_feed.entries))
  6. print("\n")
  7. for entry in parsed_feed.entries:
  8. title = entry.title
  9. url = entry.link
  10. print(entry.summary)
  11. summary = entry.summary or "No summary available" # 我在这里简化了三元运算符
  12. date = entry.published or "No data available" # 我在这里简化了三元运算符
  13. data.append([title, url, summary, date]) # <---- 从每个条目追加数据到这里
  14. df = pd.DataFrame(data, columns=['title', 'url', 'summary', 'date'])
  15. from sqlalchemy import create_engine
  16. import mysql.connector
  17. engine = create_engine('mysql+pymysql://root:password_thingbob@localhost/somedatabase')
  18. df.to_sql('nationals', con=engine, if_exists='append', index=False)

我使用您提供的提要列表进行了检查,它运行正常。

英文:

If the credentials are correct as you say, then the to_sql call is fine. I think the problem is the Python loop to parse the feed. In particular, the line data.append (data) is creating a recursive list that cannot be constructed into a dataframe. Also, I think data list should be a nested list where each sub-list is an entry in a parsed_feed (so that each row in the dataframe is one entry).

I would write the loop as

  1. data = [] # &lt;---- initialize empty list here
  2. for feed in feeds:
  3. parsed_feed = feedparser.parse(feed[&#39;url&#39;])
  4. print(&quot;Title:&quot;, feed[&#39;title&#39;])
  5. print(&quot;Number of Articles:&quot;, len(parsed_feed.entries))
  6. print(&quot;\n&quot;)
  7. for entry in parsed_feed.entries:
  8. title = entry.title
  9. url = entry.link
  10. print(entry.summary)
  11. summary = entry.summary or &quot;No summary available&quot; # I simplified the ternary operators here
  12. date = entry.published or &quot;No data available&quot; # I simplified the ternary operators here
  13. data.append([title, url, summary, date]) # &lt;---- append data from each entry here
  14. df = pd.DataFrame(data, columns = [&#39;title&#39;, &#39;url&#39;, &#39;summary&#39;, &#39;date&#39;])
  15. from sqlalchemy import create_engine
  16. import mysql.connector
  17. engine = create_engine(&#39;mysql+pymysql://root:password_thingbob@localhost/somedatabase&#39;)
  18. df.to_sql(&#39;nationals&#39;, con = engine, if_exists = &#39;append&#39;, index = False)

I checked it with the feed list you provided and it works fine.

答案2

得分: 1

由于RSS源是XML文件,请考虑使用pandas.read_xml和通过列表推导绑定数据,这样可以避免初始化列表和追加元素的繁琐操作。

此外,通过用户定义的方法处理每个RSS源,并且由于您正在提取可能会更改的网页链接,可以考虑使用try...except语句,以显示帖子中的三个有问题的URL。

  1. import pandas as pd
  2. feeds = [
  3. {"type": "news", "title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},
  4. {"type": "news", "title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},
  5. {"type": "news", "title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},
  6. {"type": "news", "title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},
  7. {"type": "news", "title": "Metro UK", "url": "https://metro.co.uk/feed/"},
  8. {"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss"}, # 修复URL: 移除 .xml
  9. {"type": "news", "title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},
  10. {"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"}, # 问题URL
  11. {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},
  12. {"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},
  13. {"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"}, # 问题URL
  14. {"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},
  15. {"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},
  16. {"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},
  17. {"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"}, # 问题URL
  18. {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/rss.xml"}
  19. ]
  20. hdr = {
  21. 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
  22. 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  23. 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
  24. 'Accept-Encoding': 'none',
  25. 'Accept-Language': 'en-US,en;q=0.8',
  26. 'Connection': 'keep-alive'
  27. }
  28. def proc_rss(feed):
  29. rss_df = None
  30. print("Title:", feed['title'])
  31. try:
  32. # 使用指定的列解析RSS XML,重命名列名
  33. rss_df = (
  34. pd.read_xml(feed["url"], xpath=".//item", storage_options=hdr)
  35. .reindex(["title", "link", "description", "pubDate"], axis="columns")
  36. .set_axis(["title", "url", "summary", "date"], axis="columns")
  37. )
  38. print("文章数量:", rss_df.shape[0])
  39. except Exception as e:
  40. print("文章数量: 无。原因:", e)
  41. print("")
  42. return rss_df
  43. # 使用列表推导绑定到单个数据帧
  44. rss_df = pd.concat([proc_rss(f) for f in feeds], ignore_index=True)
  45. print(rss_df)

输出(部分内容):

  1. Title: BBC
  2. 文章数量: 34
  3. Title: The Economist
  4. 文章数量: 100
  5. Title: The New Statesman
  6. 文章数量: 20
  7. Title: The New York Times
  8. 文章数量: 27
  9. Title: Metro UK
  10. 文章数量: 30
  11. Title: Evening Standard
  12. 文章数量: 100
  13. Title: Daily Mail
  14. 文章数量: 153
  15. Title: Sky News
  16. 文章数量: 无。原因: HTTP错误 404: 未找到
  17. Title: The Mirror
  18. 文章数量: 25
  19. Title: The Sun
  20. 文章数量: 100
  21. Title: Sky News
  22. 文章数量: 无。原因: HTTP错误 404: 未找到
  23. Title: The Guardian
  24. 文章数量: 113
  25. Title: The Independent
  26. 文章数量: 100
  27. Title: The Telegraph
  28. 文章数量: 100
  29. Title: The Times
  30. 文章数量: 无。原因: xmlParseEntityRef: 没有名称,行 1,列 1556 (<string>, 1)
  31. Title: The Mirror
  32. 文章数量: 25
英文:

Since RSS feeds are XML files, consider pandas.read_xml and bind data via a list comprehension which avoids the bookkeeping of initializing list and appending elements.

Additionally, process each feed via a user-defined method and since you are scrapping potential web links that can change incorporate try...except which shows three problematic URLs in your post.

  1. import pandas as pd
  2. feeds = [
  3. {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;BBC&quot;, &quot;url&quot;: &quot;http://feeds.bbci.co.uk/news/uk/rss.xml&quot;},
  4. {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The Economist&quot;, &quot;url&quot;: &quot;https://www.economist.com/international/rss.xml&quot;},
  5. {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The New Statesman&quot;, &quot;url&quot;: &quot;https://www.newstatesman.com/feed&quot;},
  6. {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The New York Times&quot;, &quot;url&quot;: &quot;https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml&quot;},
  7. {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;Metro UK&quot;, &quot;url&quot;: &quot;https://metro.co.uk/feed/&quot;},
  8. {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;Evening Standard&quot;, &quot;url&quot;: &quot;https://www.standard.co.uk/rss&quot;}, # FIXED URL: REMOVE .xml
  9. {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;Daily Mail&quot;, &quot;url&quot;: &quot;https://www.dailymail.co.uk/articles.rss&quot;},
  10. {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;Sky News&quot;, &quot;url&quot;: &quot;https://news.sky.com/feeds/rss/home.xml&quot;}, # PROBLEM URL
  11. {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The Mirror&quot;, &quot;url&quot;: &quot;https://www.mirror.co.uk/news/?service=rss&quot;},
  12. {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The Sun&quot;, &quot;url&quot;: &quot;https://www.thesun.co.uk/news/feed/&quot;},
  13. {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;Sky News&quot;, &quot;url&quot;: &quot;https://news.sky.com/feeds/rss/home.xml&quot;}, # PROBLEM URL
  14. {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The Guardian&quot;, &quot;url&quot;: &quot;https://www.theguardian.com/uk/rss&quot;},
  15. {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The Independent&quot;, &quot;url&quot;: &quot;https://www.independent.co.uk/news/uk/rss&quot;},
  16. {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The Telegraph&quot;, &quot;url&quot;: &quot;https://www.telegraph.co.uk/news/rss.xml&quot;},
  17. {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The Times&quot;, &quot;url&quot;: &quot;https://www.thetimes.co.uk/?service=rss&quot;}, # PROBLEM URL
  18. {&quot;type&quot;: &quot;news&quot;, &quot;title&quot;: &quot;The Mirror&quot;, &quot;url&quot;: &quot;https://www.mirror.co.uk/news/rss.xml&quot;}
  19. ]
  20. hdr = {
  21. &#39;User-Agent&#39;: &#39;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11&#39;,
  22. &#39;Accept&#39;: &#39;text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8&#39;,
  23. &#39;Accept-Charset&#39;: &#39;ISO-8859-1,utf-8;q=0.7,*;q=0.3&#39;,
  24. &#39;Accept-Encoding&#39;: &#39;none&#39;,
  25. &#39;Accept-Language&#39;: &#39;en-US,en;q=0.8&#39;,
  26. &#39;Connection&#39;: &#39;keep-alive&#39;
  27. }
  28. def proc_rss(feed):
  29. rss_df = None
  30. print(&quot;Title:&quot;, feed[&#39;title&#39;])
  31. try:
  32. # PARSE RSS XML W/ HEADERS, KEEP SPECIFIC COLUMNS, RENAME COLUMNS
  33. rss_df = (
  34. pd.read_xml(feed[&quot;url&quot;], xpath=&quot;.//item&quot;, storage_options=hdr)
  35. .reindex([&quot;title&quot;, &quot;link&quot;, &quot;description&quot;, &quot;pubDate&quot;], axis=&quot;columns&quot;)
  36. .set_axis([&quot;title&quot;, &quot;url&quot;, &quot;summary&quot;, &quot;date&quot;], axis=&quot;columns&quot;)
  37. )
  38. print(&quot;Number of Articles:&quot;, rss_df.shape[0])
  39. except Exception as e:
  40. print(&quot;Number of Articles: NONE. Reason:&quot;, e)
  41. print(&quot;&quot;)
  42. return rss_df
  43. # LIST COMPREHENSION BINDED TO SINGLE DATA FRAME
  44. rss_df = pd.concat([proc_rss(f) for f in feeds], ignore_index=True)
  45. print(rss_df)

Output

  1. Title: BBC
  2. Number of Articles: 34
  3. Title: The Economist
  4. Number of Articles: 100
  5. Title: The New Statesman
  6. Number of Articles: 20
  7. Title: The New York Times
  8. Number of Articles: 27
  9. Title: Metro UK
  10. Number of Articles: 30
  11. Title: Evening Standard
  12. Number of Articles: 100
  13. Title: Daily Mail
  14. Number of Articles: 153
  15. Title: Sky News
  16. Number of Articles: NONE. Reason: HTTP Error 404: Not Found
  17. Title: The Mirror
  18. Number of Articles: 25
  19. Title: The Sun
  20. Number of Articles: 100
  21. Title: Sky News
  22. Number of Articles: NONE. Reason: HTTP Error 404: Not Found
  23. Title: The Guardian
  24. Number of Articles: 113
  25. Title: The Independent
  26. Number of Articles: 100
  27. Title: The Telegraph
  28. Number of Articles: 100
  29. Title: The Times
  30. Number of Articles: NONE. Reason: xmlParseEntityRef: no name, line 1, column 1556 (&lt;string&gt;, line 1)
  31. Title: The Mirror
  32. Number of Articles: 25
  33. title url summary date
  34. 0 Nicola Bulley: Lancashire Police find body in ... https://www.bbc.co.uk/news/uk-england-64697300... Officers searching for the missing mother-of t... Sun, 19 Feb 2023 17:54:18 GMT
  35. 1 Baftas 2023: All Quiet on the Western Front do... https://www.bbc.co.uk/news/entertainment-arts-... Netflix&#39;s World War One epic won best film and... Sun, 19 Feb 2023 23:12:05 GMT
  36. 2 Dickie Davies, host of ITV&#39;s World of Sport fo... https://www.bbc.co.uk/news/uk-england-lancashi... The presenter anchored the five-hour live TV m... Mon, 20 Feb 2023 00:47:00 GMT
  37. 3 Son Heung-min: Tottenham condemn &#39;utterly repr... https://www.bbc.co.uk/sport/football/64700428?... Tottenham call for social media companies to t... Sun, 19 Feb 2023 22:25:04 GMT
  38. 4 Argentina Open: British number one Cameron Nor... https://www.bbc.co.uk/sport/tennis/64700048?at... British number one Cameron Norrie misses out o... Sun, 19 Feb 2023 21:45:24 GMT
  39. .. ... ... ... ...
  40. 922 Nicola Bulley&#39;s family &#39;incredibly heartbroken... https://www.mirror.co.uk/news/uk-news/breaking... Lancashire Police has recovered a body around ... Sun, 19 Feb 2023 19:51:09 +0000
  41. 923 Shamed Matt Hancock gets &#39;worked like a barbec... https://www.mirror.co.uk/tv/tv-news/shamed-mat... SAS: Who Dares Wins star Rudy Reyessays shamed... Sun, 19 Feb 2023 19:35:03 +0000
  42. 924 Treasure hunter uses map left by his father to... https://www.mirror.co.uk/news/world-news/treas... Jan Glazewski dug up the silver treasure burie... Sun, 19 Feb 2023 19:19:15 +0000
  43. 925 &#39;My husband refuses to be in the delivery room... https://www.mirror.co.uk/news/weird-news/my-hu... A first-time mum-to-be says she&#39;s now feeling ... Sun, 19 Feb 2023 19:17:34 +0000
  44. 926 Nicola Bulley search diver sends message of su... https://www.mirror.co.uk/news/uk-news/nicola-b... The expert search diver called in to assist wi... Sun, 19 Feb 2023 19:16:13 +0000
  45. [927 rows x 4 columns]

huangapple
  • 本文由 发表于 2023年2月19日 17:44:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75499220.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定