处理具有互斥列的数据框的有效方法?

huangapple go评论71阅读模式
英文:

Efficient ways of handling dataframes with columns that are mutually exclusive?

问题

作为一个示例,想象一下我有一个如下的CSV文件:

id | a1 | a2 |....| aN | b1 | b2 |....| bM |
___________________________________________
0  |data|data|....|data|None|None|....|None|
1  |data|data|....|data|None|None|....|None|
2  |data|data|....|data|None|None|....|None|
3  |None|None|....|None|data|data|....|data|
4  |None|None|....|None|data|data|....|data|
....

我有N个a列和M个b列,a列和b列是互斥的,也就是说,如果我在a列中有data,那么在b列中就不会有任何内容。在这种情况下,data主要是字符串或浮点数值。

感觉效率不高,因为每一行都会有M或N个不包含任何内容的元素。

我可以将上面的数据分成两个不同的数据框,即:

df_a
id | a1 | a2 |....| aN |
________________________
0  |data|data|....|data|
1  |data|data|....|data|
2  |data|data|....|data|

df_b
id | b1 | b2 |....| bM |
________________________
3  |data|data|....|data|
4  |data|data|....|data|

但是这样我需要跟踪两个数据框,而不是一个数据框。

在不创建冗余数据框的情况下,保持数据在一起的最有效方法是什么?如果我有cd列,这个解决方案是否适用?

一种方法是将CSV文件转换为Excel表格,并将b列放在不同的工作表中。但是这仍然对我来说有点不够灵活。

英文:

As an example imagine I have a csv below

id | a1 | a2 |....| aN | b1 | b2 |....| bM |
___________________________________________
0  |data|data|....|data|None|None|....|None|
1  |data|data|....|data|None|None|....|None|
2  |data|data|....|data|None|None|....|None|
3  |None|None|....|None|data|data|....|data|
4  |None|None|....|None|data|data|....|data|
....

I have N a columns and M b columns, and the a columns and b columns are mutually exclusive i.e. if I have data in a then I wouldn't have anything in b. data in this case is mostly string or float values.

It feels inefficient that I will have M or N elements for each row that don't contain anything.

I can split the above into two different dataframes i.e.

df_a
id | a1 | a2 |....| aN |
________________________
0  |data|data|....|data|
1  |data|data|....|data|
2  |data|data|....|data|


df_b
id | b1 | b2 |....| bM |
________________________
3  |data|data|....|data|
4  |data|data|....|data|

But then I have two dataframes that I have to keep track of instead of one dataframe.

What is the most efficient way of keeping the data together without creating a bloated dataframe? Would the solution work if I have c and d columns as well?

One thing I can do is make the csv into an excel sheet and put the b columns in a different sheet. But it's still a bit clunky for me.

答案1

得分: 1

以下是翻译好的部分:

一个可能的解决方案是使用稀疏数据结构(它不存储None值):

txt = """
id    a1    a2    a3     b1     b2     b3
0  10.0  20.0  30.0   None   None   None
1  40.0  50.0  60.0   None   None   None
2  70.0  80.0  90.0   None   None   None
3   None   None   None  100.0  200.0  300.0
4   None   None   None  400.0  500.0  600.0
"""

df = pd.read_csv(StringIO(txt), sep='\s+')

df = df.astype(pd.SparseDtype("float", None))

print(df.memory_usage())

输出:

Index    132
id        60
a1        36
a2        36
a3        36
b1        24
b2        24
b3        24
dtype: int64
英文:

A possible solution is to use Sparse data structures (it does not store None values):

txt = """
id    a1    a2    a3     b1     b2     b3
0  10.0  20.0  30.0   None   None   None
1  40.0  50.0  60.0   None   None   None
2  70.0  80.0  90.0   None   None   None
3   None   None   None  100.0  200.0  300.0
4   None   None   None  400.0  500.0  600.0
"""

df = pd.read_csv(StringIO(txt), sep='\s+')

df = df.astype(pd.SparseDtype("float", None))

print(df.memory_usage())

Output:

Index    132
id        60
a1        36
a2        36
a3        36
b1        24
b2        24
b3        24
dtype: int64

答案2

得分: 0

@kkawabat,首先分开a和b,然后为a的数据框和b的数据框分配一个新行名为'TYPE',对于a的数据框,将'TYPE'填充为'a',对于b的数据框,将'TYPE'填充为'b',然后合并这两个数据框。在这种情况下,您不需要创建像a1、a2、b1、b2这样的数据框列,您可以只是使用col1、col2等列名,或者您还可以将'TYPE'设置为布尔值,并将True表示为'a',False表示为'b'。

英文:

@kkawabat, how about this, first separate a,b then assign both dataframes, a new row called, 'TYPE', for df of a fill TYPE with 'a' and TYPE 'b' for other and then merge both data frame, in such case u don't need to make df columns like a1, a2, b1, b2, you can just do col1,col2,.... or also u can set that TYPE to boolean and refer True as a and False as b.

答案3

得分: 0

这有点巧妙,但我最终采用的方法是:

创建一个新的数据框,包含列 idtypevalue_dict,其中 type 可以是 ab,而 value_dict 的值是一个字典,其中键是所有具有值的列名。例如,

id | a1 | a2 |....| aN | b1 | b2 |....| bM |
___________________________________________
0  |data|data|....|data|None|None|....|None|
1  |data|data|....|data|None|None|....|None|
2  |data|data|....|data|None|None|....|None|
3  |None|None|....|None|data|data|....|data|
4  |None|None|....|None|data|data|....|data|
....

将变成:

id | type | value_dict |
___________________________________________
0  | "a"  | {"a1": data, "a2": data, ... "aN": data}|
1  | "a"  | {"a1": data, "a2": data, ... "aN": data}|
2  | "a"  | {"a1": data, "a2": data, ... "aN": data}|
3  | "b"  | {"b1": data, "b2": data, ... "bM": data}|
4  | "b"  | {"b1": data, "b2": data, ... "bM": data}|
....

使用字典作为元素似乎有点奇怪,但这似乎是我能想到的最直观的解决方案。

英文:

It's a bit hacky but what I ended up doing was:

create a new dataframe with the columns id, type, value_dict
where type is either a or b
and the values of value_dict is a dictionary where the keys are all the col names with values in them. So for example,

id | a1 | a2 |....| aN | b1 | b2 |....| bM |
___________________________________________
0  |data|data|....|data|None|None|....|None|
1  |data|data|....|data|None|None|....|None|
2  |data|data|....|data|None|None|....|None|
3  |None|None|....|None|data|data|....|data|
4  |None|None|....|None|data|data|....|data|
....

will be:

id | type | value_dict |
___________________________________________
0  | "a"  | {"a1": data, "a2": data, ... "aN": data}|
1  | "a"  | {"a1": data, "a2": data, ... "aN": data}|
2  | "a"  | {"a1": data, "a2": data, ... "aN": data}|
3  | "b"  | {"b1": data, "b2": data, ... "bM": data}|
4  | "b"  | {"b1": data, "b2": data, ... "bM": data}|
....

It's kind of weird that I'm using dictionaries as elements but this seems the most intuitive solution that I could come up with.

huangapple
  • 本文由 发表于 2023年6月15日 01:11:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76476031.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定