将列表转换为格式化的pandas数据框。

huangapple go评论97阅读模式
英文:

Converting a List into a formated pandas Dataframe

问题

我正在尝试从以下网站抓取美国前100个城市的数据:

**https://www.nationalpopularvote.com/100-biggest-cities-have-59849899-people-and-rural-areas-have-59492267-people
**

我已经从网站上获取了数据并将其转换为列表:

#注意这不是完整的列表,因为完整的列表太长了
List = ['City   Population', '1   New York, NY   8,175,133   \xa0Biggest city is 2.6% of U.S. population', '2   Los Angeles, CA   3,792,621   \xa0Top 2 cities are 3.8% of U.S. population', '3   Chicago, IL   2,695,598   \xa0Top 3 cities are 4.7% of U.S. population]

我正在尝试将其转换为一个有组织的数据框。

我已经尝试过这样做:

df = pd.DataFrame(List)

print(df)

我希望这可以这么简单,但它返回了以下内容:

0                                    City   Population
1    1   New York, NY   8,175,133    Biggest city i...
2    2   Los Angeles, CA   3,792,621    Top 2 citie...
3    3   Chicago, IL   2,695,598    Top 3 cities ar...
4    4   Houston, TX   2,099,451    Top 4 cities ar...
..                                                 ...
97   97   Birmingham, AL   212,237
98   98   Rochester, NY   210,565
99   99   San Bernadino, CA   209,924
100  100   Spokane, WA   208,916   Top 100 cities a...
101                                 Total   59,849,899
[102 rows x 1 columns]

问题是,它实际上并不是“有组织的”;我无法执行 print(df['City'])。我想要的是这样的:

0   RANK CITY           POPULATION                    
1    1   New York, NY   8,175,133    
2    2   Los Angeles, CA   3,792,621    
3    3   Chicago, IL   2,695,598    
4    4   Houston, TX   2,099,451    
...............................                                  
97   97   Birmingham, AL   212,237
98   98   Rochester, NY   210,565
99   99   San Bernadino, CA   209,924
100  100   Spokane, WA   208,916  
  
[101 rows x 3 columns]

有人可以帮我解决这个问题吗?

英文:

I am trying to scrape the top 100 cities in the US from the following website:

**https://www.nationalpopularvote.com/100-biggest-cities-have-59849899-people-and-rural-areas-have-59492267-people
**

I have gotten the data from the website and converted into a list:

#Note this is not the full list because the full list is too long
List = ['City   Population', '1   New York, NY   8,175,133   \xa0Biggest city is 2.6% of U.S. population', '2   Los Angeles, CA   3,792,621   \xa0Top 2 cities are 3.8% of U.S. population', '3   Chicago, IL   2,695,598   \xa0Top 3 cities are 4.7% of U.S. population]

I am trying to convert that into an organized dataframe.

I have already tried this:

df = pd.DataFrame(List)

print(df)

I wish this could have been that simple but it returns the following:

    0                                    City   Population
    1    1   New York, NY   8,175,133    Biggest city i...
    2    2   Los Angeles, CA   3,792,621    Top 2 citie...
    3    3   Chicago, IL   2,695,598    Top 3 cities ar...
    4    4   Houston, TX   2,099,451    Top 4 cities ar...
    ..                                                 ...
    97   97   Birmingham, AL   212,237
    98   98   Rochester, NY   210,565
    99   99   San Bernadino, CA   209,924
    100  100   Spokane, WA   208,916   Top 100 cities a...
    101                                 Total   59,849,899
[102 rows x 1 columns]

The problem is, its not actually 'orginized'; I cant do print(df['City']). I want this:

    0   RANK CITY           POPULATION                    
    1    1   New York, NY   8,175,133    
    2    2   Los Angeles, CA   3,792,621    
    3    3   Chicago, IL   2,695,598    
    4    4   Houston, TX   2,099,451    
    ...............................                                    
    97   97   Birmingham, AL   212,237
    98   98   Rochester, NY   210,565
    99   99   San Bernadino, CA   209,924
    100  100   Spokane, WA   208,916  

[101 rows x 3 columns]

Can someone help me with this?

答案1

得分: 2

你可以在一行代码中完成这个操作:

  • 使用 pd.read_html 函数,同时将 headerindex_col 参数都设置为 0。结果将是一个包含 dfs 的列表。在这种情况下只有一个 df,所以我们选择第一个元素 ([0])。
  • 使用 df.reset_index 来重置索引,因为原始索引值会因最后一行的 NaN 值 被转换为浮点数。
import pandas as pd

url = 'https://www.nationalpopularvote.com/100-biggest-cities-have-59849899-people-and-rural-areas-have-59492267-people'

df = pd.read_html(url, header=0, index_col=0)[0].reset_index(drop=True)

df.head()

               City  Population                                Unnamed: 3
0      New York, NY     8175133   Biggest city is 2.6% of U.S. population
1   Los Angeles, CA     3792621  Top 2 cities are 3.8% of U.S. population
2       Chicago, IL     2695598  Top 3 cities are 4.7% of U.S. population
3       Houston, TX     2099451  Top 4 cities are 5.4% of U.S. population
4  Philadelphia, PA     1526006  Top 5 cities are 5.9% of U.S. population

# 使用 `df.rename` 来更改第三列(无名称)的名称
df = df.rename(columns={'Unnamed: 3': 'Comment'})
英文:

You can do this in a one-liner:

  • Use pd.read_html with the header and index_col parameters both set to 0. The result will be a list of dfs. In this case with only one df, so we select the first element ([0]).
  • Use df.reset_index to reset the index, since the original index values will have been turned into floats on account of the NaN value in the final row.
import pandas as pd

url = 'https://www.nationalpopularvote.com/100-biggest-cities-have-59849899-people-and-rural-areas-have-59492267-people'

df = pd.read_html(url, header=0, index_col=0)[0].reset_index(drop=True)

df.head()

               City  Population                                Unnamed: 3
0      New York, NY     8175133   Biggest city is 2.6% of U.S. population
1   Los Angeles, CA     3792621  Top 2 cities are 3.8% of U.S. population
2       Chicago, IL     2695598  Top 3 cities are 4.7% of U.S. population
3       Houston, TX     2099451  Top 4 cities are 5.4% of U.S. population
4  Philadelphia, PA     1526006  Top 5 cities are 5.9% of U.S. population

# use `df.rename` to change the name of the 3rd (nameless) column)
df = df.rename(columns={'Unnamed: 3': 'Comment'})

答案2

得分: 0

这可能有助于解决您正在寻找的问题:

import pandas as pd
List = ['City   Population', '1   New York, NY   8,175,133   \xa0Biggest city is 2.6% of U.S. population', '2   Los Angeles, CA   3,792,621   \xa0Top 2 cities are 3.8% of U.S. population', '3   Chicago, IL   2,695,598   \xa0Top 3 cities are 4.7% of U.S. population']
# 创建一个包含数据的字典列表
data = []
for item in List[1:]:
    rank, city, population, _ = item.split('   ')
    population = int(population.replace(',', ''))
    data.append({'Rank': int(rank), 'City': city, 'Population': population})

# 从字典列表创建一个pandas数据帧
df = pd.DataFrame(data)

# 打印数据帧
print(df)
Rank City Population
1 New York, NY 8175133
2 Los Angeles, CA 3792621
3 Chicago, IL 2695598
英文:

This probably would help solve for what you are looking:

import pandas as pd
List = ['City   Population', '1   New York, NY   8,175,133   \xa0Biggest city is 2.6% of U.S. population', '2   Los Angeles, CA   3,792,621   \xa0Top 2 cities are 3.8% of U.S. population', '3   Chicago, IL   2,695,598   \xa0Top 3 cities are 4.7% of U.S. population']
# create a list of dictionaries with the data
data = []
for item in List[1:]:
    rank, city, population, _ = item.split('   ')
    population = int(population.replace(',', ''))
    data.append({'Rank': int(rank), 'City': city, 'Population': population})

# create a pandas dataframe from the list of dictionaries
df = pd.DataFrame(data)

# print the dataframe
print(df)
Rank City Population
1 New York, NY 8175133
2 Los Angeles, CA 3792621
3 Chicago, IL 2695598

huangapple
  • 本文由 发表于 2023年8月5日 01:31:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76838066.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定