英文:
Efficient ways of handling dataframes with columns that are mutually exclusive?
问题
作为一个示例,想象一下我有一个如下的CSV文件:
id | a1 | a2 |....| aN | b1 | b2 |....| bM |
___________________________________________
0 |data|data|....|data|None|None|....|None|
1 |data|data|....|data|None|None|....|None|
2 |data|data|....|data|None|None|....|None|
3 |None|None|....|None|data|data|....|data|
4 |None|None|....|None|data|data|....|data|
....
我有N个a
列和M个b
列,a
列和b
列是互斥的,也就是说,如果我在a
列中有data
,那么在b
列中就不会有任何内容。在这种情况下,data
主要是字符串或浮点数值。
感觉效率不高,因为每一行都会有M或N个不包含任何内容的元素。
我可以将上面的数据分成两个不同的数据框,即:
df_a
id | a1 | a2 |....| aN |
________________________
0 |data|data|....|data|
1 |data|data|....|data|
2 |data|data|....|data|
df_b
id | b1 | b2 |....| bM |
________________________
3 |data|data|....|data|
4 |data|data|....|data|
但是这样我需要跟踪两个数据框,而不是一个数据框。
在不创建冗余数据框的情况下,保持数据在一起的最有效方法是什么?如果我有c
和d
列,这个解决方案是否适用?
一种方法是将CSV文件转换为Excel表格,并将b列放在不同的工作表中。但是这仍然对我来说有点不够灵活。
英文:
As an example imagine I have a csv below
id | a1 | a2 |....| aN | b1 | b2 |....| bM |
___________________________________________
0 |data|data|....|data|None|None|....|None|
1 |data|data|....|data|None|None|....|None|
2 |data|data|....|data|None|None|....|None|
3 |None|None|....|None|data|data|....|data|
4 |None|None|....|None|data|data|....|data|
....
I have N a
columns and M b
columns, and the a
columns and b
columns are mutually exclusive i.e. if I have data
in a
then I wouldn't have anything in b
. data
in this case is mostly string or float values.
It feels inefficient that I will have M or N elements for each row that don't contain anything.
I can split the above into two different dataframes i.e.
df_a
id | a1 | a2 |....| aN |
________________________
0 |data|data|....|data|
1 |data|data|....|data|
2 |data|data|....|data|
df_b
id | b1 | b2 |....| bM |
________________________
3 |data|data|....|data|
4 |data|data|....|data|
But then I have two dataframes that I have to keep track of instead of one dataframe.
What is the most efficient way of keeping the data together without creating a bloated dataframe? Would the solution work if I have c
and d
columns as well?
One thing I can do is make the csv into an excel sheet and put the b columns in a different sheet. But it's still a bit clunky for me.
答案1
得分: 1
以下是翻译好的部分:
一个可能的解决方案是使用稀疏数据结构(它不存储None
值):
txt = """
id a1 a2 a3 b1 b2 b3
0 10.0 20.0 30.0 None None None
1 40.0 50.0 60.0 None None None
2 70.0 80.0 90.0 None None None
3 None None None 100.0 200.0 300.0
4 None None None 400.0 500.0 600.0
"""
df = pd.read_csv(StringIO(txt), sep='\s+')
df = df.astype(pd.SparseDtype("float", None))
print(df.memory_usage())
输出:
Index 132
id 60
a1 36
a2 36
a3 36
b1 24
b2 24
b3 24
dtype: int64
英文:
A possible solution is to use Sparse data structures
(it does not store None
values):
txt = """
id a1 a2 a3 b1 b2 b3
0 10.0 20.0 30.0 None None None
1 40.0 50.0 60.0 None None None
2 70.0 80.0 90.0 None None None
3 None None None 100.0 200.0 300.0
4 None None None 400.0 500.0 600.0
"""
df = pd.read_csv(StringIO(txt), sep='\s+')
df = df.astype(pd.SparseDtype("float", None))
print(df.memory_usage())
Output:
Index 132
id 60
a1 36
a2 36
a3 36
b1 24
b2 24
b3 24
dtype: int64
答案2
得分: 0
@kkawabat,首先分开a和b,然后为a的数据框和b的数据框分配一个新行名为'TYPE',对于a的数据框,将'TYPE'填充为'a',对于b的数据框,将'TYPE'填充为'b',然后合并这两个数据框。在这种情况下,您不需要创建像a1、a2、b1、b2这样的数据框列,您可以只是使用col1、col2等列名,或者您还可以将'TYPE'设置为布尔值,并将True表示为'a',False表示为'b'。
英文:
@kkawabat, how about this, first separate a,b then assign both dataframes, a new row called, 'TYPE', for df of a fill TYPE with 'a' and TYPE 'b' for other and then merge both data frame, in such case u don't need to make df columns like a1, a2, b1, b2, you can just do col1,col2,.... or also u can set that TYPE to boolean and refer True as a and False as b.
答案3
得分: 0
这有点巧妙,但我最终采用的方法是:
创建一个新的数据框,包含列 id
、type
和 value_dict
,其中 type
可以是 a
或 b
,而 value_dict
的值是一个字典,其中键是所有具有值的列名。例如,
id | a1 | a2 |....| aN | b1 | b2 |....| bM |
___________________________________________
0 |data|data|....|data|None|None|....|None|
1 |data|data|....|data|None|None|....|None|
2 |data|data|....|data|None|None|....|None|
3 |None|None|....|None|data|data|....|data|
4 |None|None|....|None|data|data|....|data|
....
将变成:
id | type | value_dict |
___________________________________________
0 | "a" | {"a1": data, "a2": data, ... "aN": data}|
1 | "a" | {"a1": data, "a2": data, ... "aN": data}|
2 | "a" | {"a1": data, "a2": data, ... "aN": data}|
3 | "b" | {"b1": data, "b2": data, ... "bM": data}|
4 | "b" | {"b1": data, "b2": data, ... "bM": data}|
....
使用字典作为元素似乎有点奇怪,但这似乎是我能想到的最直观的解决方案。
英文:
It's a bit hacky but what I ended up doing was:
create a new dataframe with the columns id
, type
, value_dict
where type is either a
or b
and the values of value_dict
is a dictionary where the keys are all the col names with values in them. So for example,
id | a1 | a2 |....| aN | b1 | b2 |....| bM |
___________________________________________
0 |data|data|....|data|None|None|....|None|
1 |data|data|....|data|None|None|....|None|
2 |data|data|....|data|None|None|....|None|
3 |None|None|....|None|data|data|....|data|
4 |None|None|....|None|data|data|....|data|
....
will be:
id | type | value_dict |
___________________________________________
0 | "a" | {"a1": data, "a2": data, ... "aN": data}|
1 | "a" | {"a1": data, "a2": data, ... "aN": data}|
2 | "a" | {"a1": data, "a2": data, ... "aN": data}|
3 | "b" | {"b1": data, "b2": data, ... "bM": data}|
4 | "b" | {"b1": data, "b2": data, ... "bM": data}|
....
It's kind of weird that I'm using dictionaries as elements but this seems the most intuitive solution that I could come up with.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论