英文:
How to filter rows in pandas dataframed
问题
以下是已翻译的内容:
df = pd.read_excel("LAPE_Statistical_Tables_for_England_2021.xlsx", sheet_name="1.3", skiprows=5, skipfooter=24)
df = (
df.dropna(how="all", axis="columns"),
df.dropna(how="all", axis="rows"),
# 使用 pd.query() 过滤包含全部大写字符的 "Unnamed: 1" 列的行
df.dropna(subset=["Unnamed: 1"], how="any")
)
df = pd.concat(df)
# 如果需要的话,重置索引
df = df.reset_index(drop=True)
df
为什么它不会删除包含 NaN 的列 1,以及为什么它不会删除列 1 包含全大写字符的行?为什么它不起作用?
英文:
Confused.
Here's a dataset.
Unnamed: 0 Unnamed: 1 Admissions Number of admissions per 100,000 population6 Unnamed: 4 Admissions.1 Number of admissions per 100,000 population6.1 Unnamed: 7 Admissions.2 Number of admissions per 100,000 population6.2 ... Unnamed: 28 Admissions.9 Number of admissions per 100,000 population6.9 Unnamed: 31 Admissions.10 Number of admissions per 100,000 population6.10 Unnamed: 34 Admissions.11 Number of admissions per 100,000 population6.11 Unnamed: 37
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 E92000001 ENGLAND7 976420.0 1810.0 NaN 713550.0 2810.0 NaN 262870.0 940.0 ... NaN 841760 1620 NaN 614050 2530 NaN 227710 840 NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN Unknown 5100.0 0.0 NaN 4360.0 0.0 NaN 730.0 0.0 ... NaN 5220 0 NaN 4460 0 NaN 760 0 NaN
4 NaN 1 2.0 3.0 4.0 5.0 6.0 7.0
My code to transform it is:
df = pd.read_excel("LAPE_Statistical_Tables_for_England_2021.xlsx", sheet_name="1.3", skiprows=5, skipfooter=24)
df = (
df.dropna(how="all", axis="columns"),
df.dropna(how="all", axis="rows"),
# Filter rows where "Unnamed: 1" column contains all uppercase characters using pd.query()
df.dropna(subset=["Unnamed: 1"], how="any")
)
df = pd.concat(df)
# Reset the index if needed
df = df.reset_index(drop=True)
df
But why does it not remove the NaNs where column 1 is clearly contains NaN. I also want to remove the rows where column 1 is all uppercase.
Why does this not work?
答案1
得分: 1
在代码中存在一些问题。括号创建了一个元组,而不是修改原始的 DataFrame df,而且不是将一个元组 (df) 连接起来,而是连接了前面操作生成的各个 DataFrame 来解决错误。
我考虑了前两列,以下是已经更正的代码:
df = pd.read_excel("test1.xlsx")
df = df.dropna(how="all", axis="columns") # 移除所有 NaN 值的列
df = df.dropna(how="all", axis="rows") # 移除所有 NaN 值的行
# 使用 pd.query() 过滤 "Unnamed: 1" 列中包含全部大写字符的行
df = df.dropna(subset=["Unnamed: 1"], how="any")
# 如果需要,重置索引
df = df.reset_index(drop=True)
使用你的代码输出:
df
Unnamed: 0 Unnamed: 1
0 NaN
1 E92000001
2 NaN
3 NaN
4 NaN
0 NaN
1 E92000001
2 NaN
3 NaN
4 NaN
1 E92000001
应用我的修改后的输出:
Unnamed: 0 Unnamed: 1
1 E92000001
英文:
There are couple of issues in the code. The parentheses create a tuple instead of modifying the original DataFrame df and instead of concatenating a tuple (df), concatenate the individual DataFrames resulting from the previous operations to resolve the error.
I have considered first two columns and Here's the corrected code:
df = pd.read_excel("test1.xlsx")
df = df.dropna(how="all", axis="columns") # Remove columns with all NaN values
df = df.dropna(how="all", axis="rows") # Remove rows with all NaN values
# Filter rows where "Unnamed: 1" column contains all uppercase characters using pd.query()
df = df.dropna(subset=["Unnamed: 1"], how="any")
# Reset the index if needed
df = df.reset_index(drop=True)
Output using your code:
df
Unnamed: 0 Unnamed: 1
0 NaN
1 E92000001
2 NaN
3 NaN
4 NaN
0 NaN
1 E92000001
2 NaN
3 NaN
4 NaN
1 E92000001
Output after applying my modifications:
Unnamed: 0 Unnamed: 1
1 E92000001
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论