处理具有互斥列的数据框的有效方法?

huangapple go评论104阅读模式
英文:

Efficient ways of handling dataframes with columns that are mutually exclusive?

问题

作为一个示例,想象一下我有一个如下的CSV文件:

  1. id | a1 | a2 |....| aN | b1 | b2 |....| bM |
  2. ___________________________________________
  3. 0 |data|data|....|data|None|None|....|None|
  4. 1 |data|data|....|data|None|None|....|None|
  5. 2 |data|data|....|data|None|None|....|None|
  6. 3 |None|None|....|None|data|data|....|data|
  7. 4 |None|None|....|None|data|data|....|data|
  8. ....

我有N个a列和M个b列,a列和b列是互斥的,也就是说,如果我在a列中有data,那么在b列中就不会有任何内容。在这种情况下,data主要是字符串或浮点数值。

感觉效率不高,因为每一行都会有M或N个不包含任何内容的元素。

我可以将上面的数据分成两个不同的数据框,即:

  1. df_a
  2. id | a1 | a2 |....| aN |
  3. ________________________
  4. 0 |data|data|....|data|
  5. 1 |data|data|....|data|
  6. 2 |data|data|....|data|
  7. df_b
  8. id | b1 | b2 |....| bM |
  9. ________________________
  10. 3 |data|data|....|data|
  11. 4 |data|data|....|data|

但是这样我需要跟踪两个数据框,而不是一个数据框。

在不创建冗余数据框的情况下,保持数据在一起的最有效方法是什么?如果我有cd列,这个解决方案是否适用?

一种方法是将CSV文件转换为Excel表格,并将b列放在不同的工作表中。但是这仍然对我来说有点不够灵活。

英文:

As an example imagine I have a csv below

  1. id | a1 | a2 |....| aN | b1 | b2 |....| bM |
  2. ___________________________________________
  3. 0 |data|data|....|data|None|None|....|None|
  4. 1 |data|data|....|data|None|None|....|None|
  5. 2 |data|data|....|data|None|None|....|None|
  6. 3 |None|None|....|None|data|data|....|data|
  7. 4 |None|None|....|None|data|data|....|data|
  8. ....

I have N a columns and M b columns, and the a columns and b columns are mutually exclusive i.e. if I have data in a then I wouldn't have anything in b. data in this case is mostly string or float values.

It feels inefficient that I will have M or N elements for each row that don't contain anything.

I can split the above into two different dataframes i.e.

  1. df_a
  2. id | a1 | a2 |....| aN |
  3. ________________________
  4. 0 |data|data|....|data|
  5. 1 |data|data|....|data|
  6. 2 |data|data|....|data|
  7. df_b
  8. id | b1 | b2 |....| bM |
  9. ________________________
  10. 3 |data|data|....|data|
  11. 4 |data|data|....|data|

But then I have two dataframes that I have to keep track of instead of one dataframe.

What is the most efficient way of keeping the data together without creating a bloated dataframe? Would the solution work if I have c and d columns as well?

One thing I can do is make the csv into an excel sheet and put the b columns in a different sheet. But it's still a bit clunky for me.

答案1

得分: 1

以下是翻译好的部分:

一个可能的解决方案是使用稀疏数据结构(它不存储None值):

  1. txt = """
  2. id a1 a2 a3 b1 b2 b3
  3. 0 10.0 20.0 30.0 None None None
  4. 1 40.0 50.0 60.0 None None None
  5. 2 70.0 80.0 90.0 None None None
  6. 3 None None None 100.0 200.0 300.0
  7. 4 None None None 400.0 500.0 600.0
  8. """
  9. df = pd.read_csv(StringIO(txt), sep='\s+')
  10. df = df.astype(pd.SparseDtype("float", None))
  11. print(df.memory_usage())

输出:

  1. Index 132
  2. id 60
  3. a1 36
  4. a2 36
  5. a3 36
  6. b1 24
  7. b2 24
  8. b3 24
  9. dtype: int64
英文:

A possible solution is to use Sparse data structures (it does not store None values):

  1. txt = """
  2. id a1 a2 a3 b1 b2 b3
  3. 0 10.0 20.0 30.0 None None None
  4. 1 40.0 50.0 60.0 None None None
  5. 2 70.0 80.0 90.0 None None None
  6. 3 None None None 100.0 200.0 300.0
  7. 4 None None None 400.0 500.0 600.0
  8. """
  9. df = pd.read_csv(StringIO(txt), sep='\s+')
  10. df = df.astype(pd.SparseDtype("float", None))
  11. print(df.memory_usage())

Output:

  1. Index 132
  2. id 60
  3. a1 36
  4. a2 36
  5. a3 36
  6. b1 24
  7. b2 24
  8. b3 24
  9. dtype: int64

答案2

得分: 0

@kkawabat,首先分开a和b,然后为a的数据框和b的数据框分配一个新行名为'TYPE',对于a的数据框,将'TYPE'填充为'a',对于b的数据框,将'TYPE'填充为'b',然后合并这两个数据框。在这种情况下,您不需要创建像a1、a2、b1、b2这样的数据框列,您可以只是使用col1、col2等列名,或者您还可以将'TYPE'设置为布尔值,并将True表示为'a',False表示为'b'。

英文:

@kkawabat, how about this, first separate a,b then assign both dataframes, a new row called, 'TYPE', for df of a fill TYPE with 'a' and TYPE 'b' for other and then merge both data frame, in such case u don't need to make df columns like a1, a2, b1, b2, you can just do col1,col2,.... or also u can set that TYPE to boolean and refer True as a and False as b.

答案3

得分: 0

这有点巧妙,但我最终采用的方法是:

创建一个新的数据框,包含列 idtypevalue_dict,其中 type 可以是 ab,而 value_dict 的值是一个字典,其中键是所有具有值的列名。例如,

  1. id | a1 | a2 |....| aN | b1 | b2 |....| bM |
  2. ___________________________________________
  3. 0 |data|data|....|data|None|None|....|None|
  4. 1 |data|data|....|data|None|None|....|None|
  5. 2 |data|data|....|data|None|None|....|None|
  6. 3 |None|None|....|None|data|data|....|data|
  7. 4 |None|None|....|None|data|data|....|data|
  8. ....

将变成:

  1. id | type | value_dict |
  2. ___________________________________________
  3. 0 | "a" | {"a1": data, "a2": data, ... "aN": data}|
  4. 1 | "a" | {"a1": data, "a2": data, ... "aN": data}|
  5. 2 | "a" | {"a1": data, "a2": data, ... "aN": data}|
  6. 3 | "b" | {"b1": data, "b2": data, ... "bM": data}|
  7. 4 | "b" | {"b1": data, "b2": data, ... "bM": data}|
  8. ....

使用字典作为元素似乎有点奇怪,但这似乎是我能想到的最直观的解决方案。

英文:

It's a bit hacky but what I ended up doing was:

create a new dataframe with the columns id, type, value_dict
where type is either a or b
and the values of value_dict is a dictionary where the keys are all the col names with values in them. So for example,

  1. id | a1 | a2 |....| aN | b1 | b2 |....| bM |
  2. ___________________________________________
  3. 0 |data|data|....|data|None|None|....|None|
  4. 1 |data|data|....|data|None|None|....|None|
  5. 2 |data|data|....|data|None|None|....|None|
  6. 3 |None|None|....|None|data|data|....|data|
  7. 4 |None|None|....|None|data|data|....|data|
  8. ....

will be:

  1. id | type | value_dict |
  2. ___________________________________________
  3. 0 | "a" | {"a1": data, "a2": data, ... "aN": data}|
  4. 1 | "a" | {"a1": data, "a2": data, ... "aN": data}|
  5. 2 | "a" | {"a1": data, "a2": data, ... "aN": data}|
  6. 3 | "b" | {"b1": data, "b2": data, ... "bM": data}|
  7. 4 | "b" | {"b1": data, "b2": data, ... "bM": data}|
  8. ....

It's kind of weird that I'm using dictionaries as elements but this seems the most intuitive solution that I could come up with.

huangapple
  • 本文由 发表于 2023年6月15日 01:11:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76476031.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定