2023年8月5日 01:31:29go评论131阅读模式

英文:

Converting a List into a formated pandas Dataframe

问题

我正在尝试从以下网站抓取美国前100个城市的数据：

**https://www.nationalpopularvote.com/100-biggest-cities-have-59849899-people-and-rural-areas-have-59492267-people
**

我已经从网站上获取了数据并将其转换为列表：

#注意这不是完整的列表，因为完整的列表太长了
List = ['City   Population', '1   New York, NY   8,175,133   \xa0Biggest city is 2.6% of U.S. population', '2   Los Angeles, CA   3,792,621   \xa0Top 2 cities are 3.8% of U.S. population', '3   Chicago, IL   2,695,598   \xa0Top 3 cities are 4.7% of U.S. population]

我正在尝试将其转换为一个有组织的数据框。

我已经尝试过这样做：

df = pd.DataFrame(List)
print(df)

我希望这可以这么简单，但它返回了以下内容：

0                                    City   Population
1    1   New York, NY   8,175,133    Biggest city i...
2    2   Los Angeles, CA   3,792,621    Top 2 citie...
3    3   Chicago, IL   2,695,598    Top 3 cities ar...
4    4   Houston, TX   2,099,451    Top 4 cities ar...
..                                                 ...
97   97   Birmingham, AL   212,237
98   98   Rochester, NY   210,565
99   99   San Bernadino, CA   209,924
100  100   Spokane, WA   208,916   Top 100 cities a...
101                                 Total   59,849,899
[102 rows x 1 columns]

问题是，它实际上并不是“有组织的”；我无法执行 print(df['City'])。我想要的是这样的：

0   RANK CITY           POPULATION                    
1    1   New York, NY   8,175,133    
2    2   Los Angeles, CA   3,792,621    
3    3   Chicago, IL   2,695,598    
4    4   Houston, TX   2,099,451    
...............................                                  
97   97   Birmingham, AL   212,237
98   98   Rochester, NY   210,565
99   99   San Bernadino, CA   209,924
100  100   Spokane, WA   208,916  
  
[101 rows x 3 columns]

有人可以帮我解决这个问题吗？

英文:

I am trying to scrape the top 100 cities in the US from the following website:

**https://www.nationalpopularvote.com/100-biggest-cities-have-59849899-people-and-rural-areas-have-59492267-people
**

I have gotten the data from the website and converted into a list:

#Note this is not the full list because the full list is too long
List = [&#39;City   Population&#39;, &#39;1   New York, NY   8,175,133   \xa0Biggest city is 2.6% of U.S. population&#39;, &#39;2   Los Angeles, CA   3,792,621   \xa0Top 2 cities are 3.8% of U.S. population&#39;, &#39;3   Chicago, IL   2,695,598   \xa0Top 3 cities are 4.7% of U.S. population]

I am trying to convert that into an organized dataframe.

I have already tried this:

df = pd.DataFrame(List)
print(df)

I wish this could have been that simple but it returns the following:

    0                                    City   Population
    1    1   New York, NY   8,175,133    Biggest city i...
    2    2   Los Angeles, CA   3,792,621    Top 2 citie...
    3    3   Chicago, IL   2,695,598    Top 3 cities ar...
    4    4   Houston, TX   2,099,451    Top 4 cities ar...
    ..                                                 ...
    97   97   Birmingham, AL   212,237
    98   98   Rochester, NY   210,565
    99   99   San Bernadino, CA   209,924
    100  100   Spokane, WA   208,916   Top 100 cities a...
    101                                 Total   59,849,899
[102 rows x 1 columns]

The problem is, its not actually 'orginized'; I cant do print(df['City']). I want this:

    0   RANK CITY           POPULATION                    
    1    1   New York, NY   8,175,133    
    2    2   Los Angeles, CA   3,792,621    
    3    3   Chicago, IL   2,695,598    
    4    4   Houston, TX   2,099,451    
    ...............................                                    
    97   97   Birmingham, AL   212,237
    98   98   Rochester, NY   210,565
    99   99   San Bernadino, CA   209,924
    100  100   Spokane, WA   208,916  
[101 rows x 3 columns]

Can someone help me with this?

答案1

得分: 2

你可以在一行代码中完成这个操作：

使用 pd.read_html 函数，同时将 header 和 index_col 参数都设置为 0。结果将是一个包含 dfs 的列表。在这种情况下只有一个 df，所以我们选择第一个元素 ([0])。
使用 df.reset_index 来重置索引，因为原始索引值会因最后一行的 NaN 值 被转换为浮点数。

import pandas as pd
url = 'https://www.nationalpopularvote.com/100-biggest-cities-have-59849899-people-and-rural-areas-have-59492267-people'
df = pd.read_html(url, header=0, index_col=0)[0].reset_index(drop=True)
df.head()
               City  Population                                Unnamed: 3
0      New York, NY     8175133   Biggest city is 2.6% of U.S. population
1   Los Angeles, CA     3792621  Top 2 cities are 3.8% of U.S. population
2       Chicago, IL     2695598  Top 3 cities are 4.7% of U.S. population
3       Houston, TX     2099451  Top 4 cities are 5.4% of U.S. population
4  Philadelphia, PA     1526006  Top 5 cities are 5.9% of U.S. population
# 使用 `df.rename` 来更改第三列（无名称）的名称
df = df.rename(columns={'Unnamed: 3': 'Comment'})

英文:

You can do this in a one-liner:

Use pd.read_html with the header and index_col parameters both set to 0. The result will be a list of dfs. In this case with only one df, so we select the first element ([0]).
Use df.reset_index to reset the index, since the original index values will have been turned into floats on account of the NaN value in the final row.

import pandas as pd
url = &#39;https://www.nationalpopularvote.com/100-biggest-cities-have-59849899-people-and-rural-areas-have-59492267-people&#39;
df = pd.read_html(url, header=0, index_col=0)[0].reset_index(drop=True)
df.head()
               City  Population                                Unnamed: 3
0      New York, NY     8175133   Biggest city is 2.6% of U.S. population
1   Los Angeles, CA     3792621  Top 2 cities are 3.8% of U.S. population
2       Chicago, IL     2695598  Top 3 cities are 4.7% of U.S. population
3       Houston, TX     2099451  Top 4 cities are 5.4% of U.S. population
4  Philadelphia, PA     1526006  Top 5 cities are 5.9% of U.S. population
# use `df.rename` to change the name of the 3rd (nameless) column)
df = df.rename(columns={&#39;Unnamed: 3&#39;: &#39;Comment&#39;})

答案2

得分: 0

这可能有助于解决您正在寻找的问题：

import pandas as pd
List = ['City   Population', '1   New York, NY   8,175,133   \xa0Biggest city is 2.6% of U.S. population', '2   Los Angeles, CA   3,792,621   \xa0Top 2 cities are 3.8% of U.S. population', '3   Chicago, IL   2,695,598   \xa0Top 3 cities are 4.7% of U.S. population']
# 创建一个包含数据的字典列表
data = []
for item in List[1:]:
    rank, city, population, _ = item.split('   ')
    population = int(population.replace(',', ''))
    data.append({'Rank': int(rank), 'City': city, 'Population': population})
# 从字典列表创建一个pandas数据帧
df = pd.DataFrame(data)
# 打印数据帧
print(df)

Rank	City	Population
1	New York, NY	8175133
2	Los Angeles, CA	3792621
3	Chicago, IL	2695598

英文:

This probably would help solve for what you are looking:

import pandas as pd
List = [&#39;City   Population&#39;, &#39;1   New York, NY   8,175,133   \xa0Biggest city is 2.6% of U.S. population&#39;, &#39;2   Los Angeles, CA   3,792,621   \xa0Top 2 cities are 3.8% of U.S. population&#39;, &#39;3   Chicago, IL   2,695,598   \xa0Top 3 cities are 4.7% of U.S. population&#39;]
# create a list of dictionaries with the data
data = []
for item in List[1:]:
    rank, city, population, _ = item.split(&#39;   &#39;)
    population = int(population.replace(&#39;,&#39;, &#39;&#39;))
    data.append({&#39;Rank&#39;: int(rank), &#39;City&#39;: city, &#39;Population&#39;: population})
# create a pandas dataframe from the list of dictionaries
df = pd.DataFrame(data)
# print the dataframe
print(df)

Rank	City	Population
1	New York, NY	8175133
2	Los Angeles, CA	3792621
3	Chicago, IL	2695598

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将列表转换为格式化的pandas数据框。

问题

答案1

答案2

如何比较两个Python AST，忽略参数？

Python似乎将错误的对象传递给函数。

在Django中基于角色实现访问控制

如何正确将左侧画布和右侧画布的滚动条绑定到滚轮事件？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。