2023年6月15日 01:11:19go评论171阅读模式

英文:

Efficient ways of handling dataframes with columns that are mutually exclusive?

问题

作为一个示例，想象一下我有一个如下的CSV文件：

id | a1 | a2 |....| aN | b1 | b2 |....| bM |
___________________________________________
0  |data|data|....|data|None|None|....|None|
1  |data|data|....|data|None|None|....|None|
2  |data|data|....|data|None|None|....|None|
3  |None|None|....|None|data|data|....|data|
4  |None|None|....|None|data|data|....|data|
....

我有N个a列和M个b列，a列和b列是互斥的，也就是说，如果我在a列中有data，那么在b列中就不会有任何内容。在这种情况下，data主要是字符串或浮点数值。

感觉效率不高，因为每一行都会有M或N个不包含任何内容的元素。

我可以将上面的数据分成两个不同的数据框，即：

df_a
id | a1 | a2 |....| aN |
________________________
0  |data|data|....|data|
1  |data|data|....|data|
2  |data|data|....|data|

df_b
id | b1 | b2 |....| bM |
________________________
3  |data|data|....|data|
4  |data|data|....|data|

但是这样我需要跟踪两个数据框，而不是一个数据框。

在不创建冗余数据框的情况下，保持数据在一起的最有效方法是什么？如果我有c和d列，这个解决方案是否适用？

一种方法是将CSV文件转换为Excel表格，并将b列放在不同的工作表中。但是这仍然对我来说有点不够灵活。

英文:

As an example imagine I have a csv below

id | a1 | a2 |....| aN | b1 | b2 |....| bM |
___________________________________________
0  |data|data|....|data|None|None|....|None|
1  |data|data|....|data|None|None|....|None|
2  |data|data|....|data|None|None|....|None|
3  |None|None|....|None|data|data|....|data|
4  |None|None|....|None|data|data|....|data|
....

I have N a columns and M b columns, and the a columns and b columns are mutually exclusive i.e. if I have data in a then I wouldn't have anything in b. data in this case is mostly string or float values.

It feels inefficient that I will have M or N elements for each row that don't contain anything.

I can split the above into two different dataframes i.e.

df_a
id | a1 | a2 |....| aN |
________________________
0  |data|data|....|data|
1  |data|data|....|data|
2  |data|data|....|data|


df_b
id | b1 | b2 |....| bM |
________________________
3  |data|data|....|data|
4  |data|data|....|data|

But then I have two dataframes that I have to keep track of instead of one dataframe.

What is the most efficient way of keeping the data together without creating a bloated dataframe? Would the solution work if I have c and d columns as well?

One thing I can do is make the csv into an excel sheet and put the b columns in a different sheet. But it's still a bit clunky for me.

答案1

得分: 1

以下是翻译好的部分：

一个可能的解决方案是使用稀疏数据结构（它不存储None值）：

txt = &quot;&quot;&quot;
id    a1    a2    a3     b1     b2     b3
0  10.0  20.0  30.0   None   None   None
1  40.0  50.0  60.0   None   None   None
2  70.0  80.0  90.0   None   None   None
3   None   None   None  100.0  200.0  300.0
4   None   None   None  400.0  500.0  600.0
&quot;&quot;&quot;

df = pd.read_csv(StringIO(txt), sep=&#39;\s+&#39;)

df = df.astype(pd.SparseDtype(&quot;float&quot;, None))

print(df.memory_usage())

输出：

Index    132
id        60
a1        36
a2        36
a3        36
b1        24
b2        24
b3        24
dtype: int64

英文:

A possible solution is to use Sparse data structures (it does not store None values):

txt = &quot;&quot;&quot;
id    a1    a2    a3     b1     b2     b3
0  10.0  20.0  30.0   None   None   None
1  40.0  50.0  60.0   None   None   None
2  70.0  80.0  90.0   None   None   None
3   None   None   None  100.0  200.0  300.0
4   None   None   None  400.0  500.0  600.0
&quot;&quot;&quot;

df = pd.read_csv(StringIO(txt), sep=&#39;\s+&#39;)

df = df.astype(pd.SparseDtype(&quot;float&quot;, None))

print(df.memory_usage())

Output:

Index    132
id        60
a1        36
a2        36
a3        36
b1        24
b2        24
b3        24
dtype: int64

答案2

得分: 0

@kkawabat，首先分开a和b，然后为a的数据框和b的数据框分配一个新行名为'TYPE'，对于a的数据框，将'TYPE'填充为'a'，对于b的数据框，将'TYPE'填充为'b'，然后合并这两个数据框。在这种情况下，您不需要创建像a1、a2、b1、b2这样的数据框列，您可以只是使用col1、col2等列名，或者您还可以将'TYPE'设置为布尔值，并将True表示为'a'，False表示为'b'。

英文:

@kkawabat, how about this, first separate a,b then assign both dataframes, a new row called, 'TYPE', for df of a fill TYPE with 'a' and TYPE 'b' for other and then merge both data frame, in such case u don't need to make df columns like a1, a2, b1, b2, you can just do col1,col2,.... or also u can set that TYPE to boolean and refer True as a and False as b.

答案3

得分: 0

这有点巧妙，但我最终采用的方法是：

创建一个新的数据框，包含列 id、type 和 value_dict，其中 type 可以是 a 或 b，而 value_dict 的值是一个字典，其中键是所有具有值的列名。例如，

id | a1 | a2 |....| aN | b1 | b2 |....| bM |
___________________________________________
0  |data|data|....|data|None|None|....|None|
1  |data|data|....|data|None|None|....|None|
2  |data|data|....|data|None|None|....|None|
3  |None|None|....|None|data|data|....|data|
4  |None|None|....|None|data|data|....|data|
....

将变成：

id | type | value_dict |
___________________________________________
0  | "a"  | {"a1": data, "a2": data, ... "aN": data}|
1  | "a"  | {"a1": data, "a2": data, ... "aN": data}|
2  | "a"  | {"a1": data, "a2": data, ... "aN": data}|
3  | "b"  | {"b1": data, "b2": data, ... "bM": data}|
4  | "b"  | {"b1": data, "b2": data, ... "bM": data}|
....

使用字典作为元素似乎有点奇怪，但这似乎是我能想到的最直观的解决方案。

英文:

It's a bit hacky but what I ended up doing was:

create a new dataframe with the columns id, type, value_dict
where type is either a or b
and the values of value_dict is a dictionary where the keys are all the col names with values in them. So for example,

id | a1 | a2 |....| aN | b1 | b2 |....| bM |
___________________________________________
0  |data|data|....|data|None|None|....|None|
1  |data|data|....|data|None|None|....|None|
2  |data|data|....|data|None|None|....|None|
3  |None|None|....|None|data|data|....|data|
4  |None|None|....|None|data|data|....|data|
....

will be:

id | type | value_dict |
___________________________________________
0  | &quot;a&quot;  | {&quot;a1&quot;: data, &quot;a2&quot;: data, ... &quot;aN&quot;: data}|
1  | &quot;a&quot;  | {&quot;a1&quot;: data, &quot;a2&quot;: data, ... &quot;aN&quot;: data}|
2  | &quot;a&quot;  | {&quot;a1&quot;: data, &quot;a2&quot;: data, ... &quot;aN&quot;: data}|
3  | &quot;b&quot;  | {&quot;b1&quot;: data, &quot;b2&quot;: data, ... &quot;bM&quot;: data}|
4  | &quot;b&quot;  | {&quot;b1&quot;: data, &quot;b2&quot;: data, ... &quot;bM&quot;: data}|
....

It's kind of weird that I'm using dictionaries as elements but this seems the most intuitive solution that I could come up with.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

处理具有互斥列的数据框的有效方法？

问题

答案1

答案2

答案3

根据分布随机抽样

连接Postgresql到Django

在循环内创建一个序列的 Python 数组？

如何模拟pymysqlpool.ConnectionPool构造函数？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论