如何在pandas数据帧中筛选行

huangapple go评论105阅读模式
英文:

How to filter rows in pandas dataframed

问题

以下是已翻译的内容:

df = pd.read_excel("LAPE_Statistical_Tables_for_England_2021.xlsx", sheet_name="1.3", skiprows=5, skipfooter=24)
df = (
    df.dropna(how="all", axis="columns"),
    df.dropna(how="all", axis="rows"),
    # 使用 pd.query() 过滤包含全部大写字符的 "Unnamed: 1" 列的行
    df.dropna(subset=["Unnamed: 1"], how="any")
)

df = pd.concat(df)

# 如果需要的话,重置索引
df = df.reset_index(drop=True)
df

为什么它不会删除包含 NaN 的列 1,以及为什么它不会删除列 1 包含全大写字符的行?为什么它不起作用?

英文:

Confused.

Here's a dataset.

Unnamed: 0	Unnamed: 1	Admissions	Number of admissions per 100,000 population6	Unnamed: 4	Admissions.1	Number of admissions per 100,000 population6.1	Unnamed: 7	Admissions.2	Number of admissions per 100,000 population6.2	...	Unnamed: 28	Admissions.9	Number of admissions per 100,000 population6.9	Unnamed: 31	Admissions.10	Number of admissions per 100,000 population6.10	Unnamed: 34	Admissions.11	Number of admissions per 100,000 population6.11	Unnamed: 37
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	E92000001	ENGLAND7	976420.0	1810.0	NaN	713550.0	2810.0	NaN	262870.0	940.0	...	NaN	841760	1620	NaN	614050	2530	NaN	227710	840	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	Unknown	5100.0	0.0	NaN	4360.0	0.0	NaN	730.0	0.0	...	NaN	5220	0	NaN	4460	0	NaN	760	0	NaN
4	NaN	1	2.0	3.0	4.0	5.0	6.0	7.0

My code to transform it is:

df = pd.read_excel("LAPE_Statistical_Tables_for_England_2021.xlsx", sheet_name="1.3", skiprows=5, skipfooter=24)
df = (
    df.dropna(how="all", axis="columns"),
    df.dropna(how="all", axis="rows"),
    # Filter rows where "Unnamed: 1" column contains all uppercase characters using pd.query()
    df.dropna(subset=["Unnamed: 1"], how="any")
)

df = pd.concat(df)

# Reset the index if needed
df = df.reset_index(drop=True)
df

But why does it not remove the NaNs where column 1 is clearly contains NaN. I also want to remove the rows where column 1 is all uppercase.

Why does this not work?

答案1

得分: 1

在代码中存在一些问题。括号创建了一个元组,而不是修改原始的 DataFrame df,而且不是将一个元组 (df) 连接起来,而是连接了前面操作生成的各个 DataFrame 来解决错误。

我考虑了前两列,以下是已经更正的代码:

df = pd.read_excel("test1.xlsx")
df = df.dropna(how="all", axis="columns")  # 移除所有 NaN 值的列
df = df.dropna(how="all", axis="rows")     # 移除所有 NaN 值的行

# 使用 pd.query() 过滤 "Unnamed: 1" 列中包含全部大写字符的行
df = df.dropna(subset=["Unnamed: 1"], how="any")

# 如果需要,重置索引
df = df.reset_index(drop=True)

使用你的代码输出:

df

Unnamed: 0	Unnamed: 1
0	NaN
1	E92000001
2	NaN
3	NaN
4	NaN
0	NaN
1	E92000001
2	NaN
3	NaN
4	NaN
1	E92000001

应用我的修改后的输出:

Unnamed: 0	Unnamed: 1
1	E92000001
英文:

There are couple of issues in the code. The parentheses create a tuple instead of modifying the original DataFrame df and instead of concatenating a tuple (df), concatenate the individual DataFrames resulting from the previous operations to resolve the error.

I have considered first two columns and Here's the corrected code:

df = pd.read_excel("test1.xlsx")
df = df.dropna(how="all", axis="columns")  # Remove columns with all NaN values
df = df.dropna(how="all", axis="rows")     # Remove rows with all NaN values

# Filter rows where "Unnamed: 1" column contains all uppercase characters using pd.query()
df = df.dropna(subset=["Unnamed: 1"], how="any")

# Reset the index if needed
df = df.reset_index(drop=True)

Output using your code:

df

Unnamed: 0	Unnamed: 1
0	NaN
1	E92000001
2	NaN
3	NaN
4	NaN
0	NaN
1	E92000001
2	NaN
3	NaN
4	NaN
1	E92000001

Output after applying my modifications:

Unnamed: 0	Unnamed: 1
1	E92000001

huangapple
  • 本文由 发表于 2023年5月29日 09:08:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76354169.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定