如何使用Pandas DataFrame从NHS或ONS中选择数据的开头和结尾。

huangapple go评论161阅读模式
英文:

How to top and tail data from NHS or ONS using pandas dataframe

问题

以下是您要翻译的内容:

"I'm playing with Pandas to look at alcohol-related admissions from https://fingertips.phe.org.uk/documents/LAPE_Statistical_Tables_for_England_2021.xlsx

I've got a bit of code that sort of looks like this:

df = pd.read_excel("../LAPE_Statistical_Tables_for_England_2021.xlsx", sheet_name="1.3", skiprows=5)
df = (df.dropna(how="all", axis="columns"),
df.dropna(how="all", axis="rows")
)
df

The results look like this:

Unnamed: 0 Unnamed: 1 Admissions Number of admissions per 100,000 population6 Admissions.1 Number of admissions per 100,000 population6.1 Admissions.2 Number of admissions per 100,000 population6.2 Admissions.3 Number of admissions per 100,000 population6.3 ...
1 E92000001 ENGLAND7 976420.0 1810.0 713550.0 2810.0 262870.0 940.0 938620 1770 ... 635060.0 2570.0 235020.0 860.0 841760 1620 614050 2530 227710 840
3 NaN Unknown 5100.0 0.0 4360.0 0.0 730.0 0.0 5260 0 ... 4370.0 0.0 620.0 0.0 5220 0 4460 0 760 0
5 E12000004 EAST MIDLANDS 86990.0 1820.0 63300.0 2790.0 23690.0 970.0 81060 1720 ... 54360.0 2470.0 20760.0 870.0 67800 1480 48930 2260 18870 800
7 E07000032 Amber Valley 2630.0 1900.0 1860.0 2810.0 780.0 1110.0 2280 1670 ... 1500.0 2350.0 610.0 890.0 2040 1550 1450 2330 590 880
8 E07000170 Ashfieldⁱ 2550.0 2010.0 1830.0 3050.0 710.0 1090.0 2550 2030 ... 1650.0 2810.0 620.0 970.0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
382 Sources NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
383 ● Hospital Episode Statistics, NHS Digital - D... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
384 ● Public Health England - Attributable fractio... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
385 ● Mid-year Population Estimatesin England, The... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
387 Hospital Episode Statistics (HES) Copyright © ... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
363 rows × 29 columns

I can change the column names manually but I want to drop rows where column Unnamed: 1 contains all upper case characters.

I ideally want to drop all the attribution rows at the bottom. Is that possible with pd.read_excel()?

UPDATE: yes, it is possible to skip the footer with skipfooter in pd.read_excel."

英文:

I'm playing with Pandas to look at alcohol-related admissions from https://fingertips.phe.org.uk/documents/LAPE_Statistical_Tables_for_England_2021.xlsx

I've got a bit of code that sort of looks like this:

 df = pd.read_excel("../LAPE_Statistical_Tables_for_England_2021.xlsx", sheet_name="1.3", skiprows=5)
df = (df.dropna(how="all", axis="columns"),
      df.dropna(how="all", axis="rows")
)
df

The results look like this:

Unnamed: 0	Unnamed: 1	Admissions	Number of admissions per 100,000 population6	Admissions.1	Number of admissions per 100,000 population6.1	Admissions.2	Number of admissions per 100,000 population6.2	Admissions.3	Number of admissions per 100,000 population6.3	...	Admissions.7	Number of admissions per 100,000 population6.7	Admissions.8	Number of admissions per 100,000 population6.8	Admissions.9	Number of admissions per 100,000 population6.9	Admissions.10	Number of admissions per 100,000 population6.10	Admissions.11	Number of admissions per 100,000 population6.11
1	E92000001	ENGLAND7	976420.0	1810.0	713550.0	2810.0	262870.0	940.0	938620	1770	...	635060.0	2570.0	235020.0	860.0	841760	1620	614050	2530	227710	840
3	NaN	Unknown	5100.0	0.0	4360.0	0.0	730.0	0.0	5260	0	...	4370.0	0.0	620.0	0.0	5220	0	4460	0	760	0
5	E12000004	EAST MIDLANDS	86990.0	1820.0	63300.0	2790.0	23690.0	970.0	81060	1720	...	54360.0	2470.0	20760.0	870.0	67800	1480	48930	2260	18870	800
7	E07000032	Amber Valley	2630.0	1900.0	1860.0	2810.0	780.0	1110.0	2280	1670	...	1500.0	2350.0	610.0	890.0	2040	1550	1450	2330	590	880
8	E07000170	Ashfieldⁱ	2550.0	2010.0	1830.0	3050.0	710.0	1090.0	2550	2030	...	1650.0	2810.0	620.0	970.0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
382	Sources	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
383	● Hospital Episode Statistics, NHS Digital - D...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
384	● Public Health England - Attributable fractio...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
385	● Mid-year Population Estimatesin England, The...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
387	Hospital Episode Statistics (HES) Copyright © ...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
363 rows × 29 columns

I can change the column names manually but I want to drop rows where column Unnamed: 1 contains all upper case characters.

I ideally want to drop all the attribution rows at the bottom. Is that possible with pd.read_excel()?

UPDATE: yes, it is possible to skip the footer with skipfooter in pd.read_excel.

答案1

得分: 1

这是您可以删除值在Unnamed: 1中全部由大写字符组成的行的方法。请注意,我还调整了您的代码,删除NaN值,因为原始版本返回了一个包含DataFrames元组,而不是在同一DataFrame上应用.dropna()操作(我假设您希望在同一DataFrame上应用这两个操作)。

df = pd.read_excel(filepath, sheet_name="1.3", skiprows=5)
df = (
    df
    .dropna(how="all", axis="columns")
    .dropna(how="all", axis="rows")
)

print(df['Unnamed: 1'].head(n=5).to_markdown(index=False))
    
df = df.loc[
    df['Unnamed: 1'] != df['Unnamed: 1'].str.upper(),
    :
]

print(df['Unnamed: 1'].head(n=5).to_markdown(index=False))

请注意,这是您提供的代码的翻译部分。

英文:

Here's how you can drop the rows whose value in Unnamed: 1 consists entirely of uppercase characters. Note that I've also adjusted your code dropping NaN's as the original version returned a tuple of DataFrames instead of applying both .dropna() on the same DataFrame (I assume you wanted to apply both operations on the same DataFrame).

df = pd.read_excel(filepath, sheet_name="1.3", skiprows=5)
df = (
    df
    .dropna(how="all", axis="columns")
    .dropna(how="all", axis="rows")
)

print(df['Unnamed: 1'].head(n=5).to_markdown(index=False))

# | Unnamed: 1    |
# |:--------------|
# | ENGLAND7      |
# | Unknown       |
# | EAST MIDLANDS |
# | Amber Valley  |
# | Ashfieldⁱ     |

df = df.loc[
    df['Unnamed: 1'] != df['Unnamed: 1'].str.upper(),
    :
]

print(df['Unnamed: 1'].head(n=5).to_markdown(index=False))
# | Unnamed: 1   |
# |:-------------|
# | Unknown      |
# | Amber Valley |
# | Ashfieldⁱ    |
# | Bassetlaw    |
# | Blaby        |

huangapple
  • 本文由 发表于 2023年5月25日 19:19:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76331715.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定