英文:
Converting a List into a formated pandas Dataframe
问题
我正在尝试从以下网站抓取美国前100个城市的数据:
我已经从网站上获取了数据并将其转换为列表:
#注意这不是完整的列表,因为完整的列表太长了
List = ['City Population', '1 New York, NY 8,175,133 \xa0Biggest city is 2.6% of U.S. population', '2 Los Angeles, CA 3,792,621 \xa0Top 2 cities are 3.8% of U.S. population', '3 Chicago, IL 2,695,598 \xa0Top 3 cities are 4.7% of U.S. population]
我正在尝试将其转换为一个有组织的数据框。
我已经尝试过这样做:
df = pd.DataFrame(List)
print(df)
我希望这可以这么简单,但它返回了以下内容:
0 City Population
1 1 New York, NY 8,175,133 Biggest city i...
2 2 Los Angeles, CA 3,792,621 Top 2 citie...
3 3 Chicago, IL 2,695,598 Top 3 cities ar...
4 4 Houston, TX 2,099,451 Top 4 cities ar...
.. ...
97 97 Birmingham, AL 212,237
98 98 Rochester, NY 210,565
99 99 San Bernadino, CA 209,924
100 100 Spokane, WA 208,916 Top 100 cities a...
101 Total 59,849,899
[102 rows x 1 columns]
问题是,它实际上并不是“有组织的”;我无法执行 print(df['City'])
。我想要的是这样的:
0 RANK CITY POPULATION
1 1 New York, NY 8,175,133
2 2 Los Angeles, CA 3,792,621
3 3 Chicago, IL 2,695,598
4 4 Houston, TX 2,099,451
...............................
97 97 Birmingham, AL 212,237
98 98 Rochester, NY 210,565
99 99 San Bernadino, CA 209,924
100 100 Spokane, WA 208,916
[101 rows x 3 columns]
有人可以帮我解决这个问题吗?
英文:
I am trying to scrape the top 100 cities in the US from the following website:
I have gotten the data from the website and converted into a list:
#Note this is not the full list because the full list is too long
List = ['City Population', '1 New York, NY 8,175,133 \xa0Biggest city is 2.6% of U.S. population', '2 Los Angeles, CA 3,792,621 \xa0Top 2 cities are 3.8% of U.S. population', '3 Chicago, IL 2,695,598 \xa0Top 3 cities are 4.7% of U.S. population]
I am trying to convert that into an organized dataframe.
I have already tried this:
df = pd.DataFrame(List)
print(df)
I wish this could have been that simple but it returns the following:
0 City Population
1 1 New York, NY 8,175,133 Biggest city i...
2 2 Los Angeles, CA 3,792,621 Top 2 citie...
3 3 Chicago, IL 2,695,598 Top 3 cities ar...
4 4 Houston, TX 2,099,451 Top 4 cities ar...
.. ...
97 97 Birmingham, AL 212,237
98 98 Rochester, NY 210,565
99 99 San Bernadino, CA 209,924
100 100 Spokane, WA 208,916 Top 100 cities a...
101 Total 59,849,899
[102 rows x 1 columns]
The problem is, its not actually 'orginized'; I cant do print(df['City'])
. I want this:
0 RANK CITY POPULATION
1 1 New York, NY 8,175,133
2 2 Los Angeles, CA 3,792,621
3 3 Chicago, IL 2,695,598
4 4 Houston, TX 2,099,451
...............................
97 97 Birmingham, AL 212,237
98 98 Rochester, NY 210,565
99 99 San Bernadino, CA 209,924
100 100 Spokane, WA 208,916
[101 rows x 3 columns]
Can someone help me with this?
答案1
得分: 2
你可以在一行代码中完成这个操作:
- 使用
pd.read_html
函数,同时将header
和index_col
参数都设置为0
。结果将是一个包含dfs
的列表。在这种情况下只有一个df
,所以我们选择第一个元素 ([0]
)。 - 使用
df.reset_index
来重置索引,因为原始索引值会因最后一行的NaN 值
被转换为浮点数。
import pandas as pd
url = 'https://www.nationalpopularvote.com/100-biggest-cities-have-59849899-people-and-rural-areas-have-59492267-people'
df = pd.read_html(url, header=0, index_col=0)[0].reset_index(drop=True)
df.head()
City Population Unnamed: 3
0 New York, NY 8175133 Biggest city is 2.6% of U.S. population
1 Los Angeles, CA 3792621 Top 2 cities are 3.8% of U.S. population
2 Chicago, IL 2695598 Top 3 cities are 4.7% of U.S. population
3 Houston, TX 2099451 Top 4 cities are 5.4% of U.S. population
4 Philadelphia, PA 1526006 Top 5 cities are 5.9% of U.S. population
# 使用 `df.rename` 来更改第三列(无名称)的名称
df = df.rename(columns={'Unnamed: 3': 'Comment'})
英文:
You can do this in a one-liner:
- Use
pd.read_html
with theheader
andindex_col
parameters both set to0
. The result will be a list ofdfs
. In this case with only onedf
, so we select the first element ([0]
). - Use
df.reset_index
to reset the index, since the original index values will have been turned into floats on account of theNaN value
in the final row.
import pandas as pd
url = 'https://www.nationalpopularvote.com/100-biggest-cities-have-59849899-people-and-rural-areas-have-59492267-people'
df = pd.read_html(url, header=0, index_col=0)[0].reset_index(drop=True)
df.head()
City Population Unnamed: 3
0 New York, NY 8175133 Biggest city is 2.6% of U.S. population
1 Los Angeles, CA 3792621 Top 2 cities are 3.8% of U.S. population
2 Chicago, IL 2695598 Top 3 cities are 4.7% of U.S. population
3 Houston, TX 2099451 Top 4 cities are 5.4% of U.S. population
4 Philadelphia, PA 1526006 Top 5 cities are 5.9% of U.S. population
# use `df.rename` to change the name of the 3rd (nameless) column)
df = df.rename(columns={'Unnamed: 3': 'Comment'})
答案2
得分: 0
这可能有助于解决您正在寻找的问题:
import pandas as pd
List = ['City Population', '1 New York, NY 8,175,133 \xa0Biggest city is 2.6% of U.S. population', '2 Los Angeles, CA 3,792,621 \xa0Top 2 cities are 3.8% of U.S. population', '3 Chicago, IL 2,695,598 \xa0Top 3 cities are 4.7% of U.S. population']
# 创建一个包含数据的字典列表
data = []
for item in List[1:]:
rank, city, population, _ = item.split(' ')
population = int(population.replace(',', ''))
data.append({'Rank': int(rank), 'City': city, 'Population': population})
# 从字典列表创建一个pandas数据帧
df = pd.DataFrame(data)
# 打印数据帧
print(df)
Rank | City | Population |
---|---|---|
1 | New York, NY | 8175133 |
2 | Los Angeles, CA | 3792621 |
3 | Chicago, IL | 2695598 |
英文:
This probably would help solve for what you are looking:
import pandas as pd
List = ['City Population', '1 New York, NY 8,175,133 \xa0Biggest city is 2.6% of U.S. population', '2 Los Angeles, CA 3,792,621 \xa0Top 2 cities are 3.8% of U.S. population', '3 Chicago, IL 2,695,598 \xa0Top 3 cities are 4.7% of U.S. population']
# create a list of dictionaries with the data
data = []
for item in List[1:]:
rank, city, population, _ = item.split(' ')
population = int(population.replace(',', ''))
data.append({'Rank': int(rank), 'City': city, 'Population': population})
# create a pandas dataframe from the list of dictionaries
df = pd.DataFrame(data)
# print the dataframe
print(df)
Rank | City | Population |
---|---|---|
1 | New York, NY | 8175133 |
2 | Los Angeles, CA | 3792621 |
3 | Chicago, IL | 2695598 |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论