2023年5月25日 19:19:51go评论166阅读模式

英文:

How to top and tail data from NHS or ONS using pandas dataframe

问题

以下是您要翻译的内容：

"I'm playing with Pandas to look at alcohol-related admissions from https://fingertips.phe.org.uk/documents/LAPE_Statistical_Tables_for_England_2021.xlsx

I've got a bit of code that sort of looks like this:

df = pd.read_excel("../LAPE_Statistical_Tables_for_England_2021.xlsx", sheet_name="1.3", skiprows=5)
df = (df.dropna(how="all", axis="columns"),
df.dropna(how="all", axis="rows")
)
df

The results look like this:

Unnamed: 0 Unnamed: 1 Admissions Number of admissions per 100,000 population6 Admissions.1 Number of admissions per 100,000 population6.1 Admissions.2 Number of admissions per 100,000 population6.2 Admissions.3 Number of admissions per 100,000 population6.3 ...
1 E92000001 ENGLAND7 976420.0 1810.0 713550.0 2810.0 262870.0 940.0 938620 1770 ... 635060.0 2570.0 235020.0 860.0 841760 1620 614050 2530 227710 840
3 NaN Unknown 5100.0 0.0 4360.0 0.0 730.0 0.0 5260 0 ... 4370.0 0.0 620.0 0.0 5220 0 4460 0 760 0
5 E12000004 EAST MIDLANDS 86990.0 1820.0 63300.0 2790.0 23690.0 970.0 81060 1720 ... 54360.0 2470.0 20760.0 870.0 67800 1480 48930 2260 18870 800
7 E07000032 Amber Valley 2630.0 1900.0 1860.0 2810.0 780.0 1110.0 2280 1670 ... 1500.0 2350.0 610.0 890.0 2040 1550 1450 2330 590 880
8 E07000170 Ashfieldⁱ 2550.0 2010.0 1830.0 3050.0 710.0 1090.0 2550 2030 ... 1650.0 2810.0 620.0 970.0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
382 Sources NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
383 ● Hospital Episode Statistics, NHS Digital - D... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
384 ● Public Health England - Attributable fractio... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
385 ● Mid-year Population Estimatesin England, The... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
387 Hospital Episode Statistics (HES) Copyright © ... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
363 rows × 29 columns

I can change the column names manually but I want to drop rows where column Unnamed: 1 contains all upper case characters.

I ideally want to drop all the attribution rows at the bottom. Is that possible with pd.read_excel()?

UPDATE: yes, it is possible to skip the footer with skipfooter in pd.read_excel."

英文:

I'm playing with Pandas to look at alcohol-related admissions from https://fingertips.phe.org.uk/documents/LAPE_Statistical_Tables_for_England_2021.xlsx

I've got a bit of code that sort of looks like this:

 df = pd.read_excel(&quot;../LAPE_Statistical_Tables_for_England_2021.xlsx&quot;, sheet_name=&quot;1.3&quot;, skiprows=5)
df = (df.dropna(how=&quot;all&quot;, axis=&quot;columns&quot;),
      df.dropna(how=&quot;all&quot;, axis=&quot;rows&quot;)
)
df

The results look like this:

Unnamed: 0	Unnamed: 1	Admissions	Number of admissions per 100,000 population6	Admissions.1	Number of admissions per 100,000 population6.1	Admissions.2	Number of admissions per 100,000 population6.2	Admissions.3	Number of admissions per 100,000 population6.3	...	Admissions.7	Number of admissions per 100,000 population6.7	Admissions.8	Number of admissions per 100,000 population6.8	Admissions.9	Number of admissions per 100,000 population6.9	Admissions.10	Number of admissions per 100,000 population6.10	Admissions.11	Number of admissions per 100,000 population6.11
1	E92000001	ENGLAND7	976420.0	1810.0	713550.0	2810.0	262870.0	940.0	938620	1770	...	635060.0	2570.0	235020.0	860.0	841760	1620	614050	2530	227710	840
3	NaN	Unknown	5100.0	0.0	4360.0	0.0	730.0	0.0	5260	0	...	4370.0	0.0	620.0	0.0	5220	0	4460	0	760	0
5	E12000004	EAST MIDLANDS	86990.0	1820.0	63300.0	2790.0	23690.0	970.0	81060	1720	...	54360.0	2470.0	20760.0	870.0	67800	1480	48930	2260	18870	800
7	E07000032	Amber Valley	2630.0	1900.0	1860.0	2810.0	780.0	1110.0	2280	1670	...	1500.0	2350.0	610.0	890.0	2040	1550	1450	2330	590	880
8	E07000170	Ashfieldⁱ	2550.0	2010.0	1830.0	3050.0	710.0	1090.0	2550	2030	...	1650.0	2810.0	620.0	970.0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
382	Sources	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
383	● Hospital Episode Statistics, NHS Digital - D...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
384	● Public Health England - Attributable fractio...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
385	● Mid-year Population Estimatesin England, The...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
387	Hospital Episode Statistics (HES) Copyright &#169; ...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
363 rows &#215; 29 columns

I can change the column names manually but I want to drop rows where column Unnamed: 1 contains all upper case characters.

I ideally want to drop all the attribution rows at the bottom. Is that possible with pd.read_excel()?

UPDATE: yes, it is possible to skip the footer with skipfooter in pd.read_excel.

答案1

得分: 1

这是您可以删除值在Unnamed: 1中全部由大写字符组成的行的方法。请注意，我还调整了您的代码，删除NaN值，因为原始版本返回了一个包含DataFrames元组，而不是在同一DataFrame上应用.dropna()操作（我假设您希望在同一DataFrame上应用这两个操作）。

df = pd.read_excel(filepath, sheet_name="1.3", skiprows=5)
df = (
    df
    .dropna(how="all", axis="columns")
    .dropna(how="all", axis="rows")
)

print(df['Unnamed: 1'].head(n=5).to_markdown(index=False))
    
df = df.loc[
    df['Unnamed: 1'] != df['Unnamed: 1'].str.upper(),
    :
]

print(df['Unnamed: 1'].head(n=5).to_markdown(index=False))

请注意，这是您提供的代码的翻译部分。

英文:

Here's how you can drop the rows whose value in Unnamed: 1 consists entirely of uppercase characters. Note that I've also adjusted your code dropping NaN's as the original version returned a tuple of DataFrames instead of applying both .dropna() on the same DataFrame (I assume you wanted to apply both operations on the same DataFrame).

df = pd.read_excel(filepath, sheet_name=&quot;1.3&quot;, skiprows=5)
df = (
    df
    .dropna(how=&quot;all&quot;, axis=&quot;columns&quot;)
    .dropna(how=&quot;all&quot;, axis=&quot;rows&quot;)
)

print(df[&#39;Unnamed: 1&#39;].head(n=5).to_markdown(index=False))

# | Unnamed: 1    |
# |:--------------|
# | ENGLAND7      |
# | Unknown       |
# | EAST MIDLANDS |
# | Amber Valley  |
# | Ashfieldⁱ     |

df = df.loc[
    df[&#39;Unnamed: 1&#39;] != df[&#39;Unnamed: 1&#39;].str.upper(),
    :
]

print(df[&#39;Unnamed: 1&#39;].head(n=5).to_markdown(index=False))
# | Unnamed: 1   |
# |:-------------|
# | Unknown      |
# | Amber Valley |
# | Ashfieldⁱ    |
# | Bassetlaw    |
# | Blaby        |

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Pandas DataFrame从NHS或ONS中选择数据的开头和结尾。

问题

答案1

Pandas DataFrame 按自定义日期（季度）分组

Pandas：使用可能不存在于CSV文件中的usecols进行read_csv。

使用正则表达式在`pivot_longer`中，将具有共同分组变量的多个列集合展开。

Pandas数据框架：根据索引和条件替换列中的值。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论