2023年5月25日 23:08:37go评论69阅读模式

英文:

Convert strings in numpy ndarray to nan (or even better replace by interpolated values)

问题

我有一个包含矩阵的Excel表格。不幸的是，一些随机单元格包含任意字符串。第一行和第一列是坐标轴（自变量）。一个示例：

  |     | 10 | 20 | ... | 90 | 100 |
  | -   | -- | -- | --- | -- | --- |
  | 1   | 3  | 9  | ... | blob | 27 |
  | 3   | -1 | 10 | ... | blib | 12 |
  | ... | ... | ... | ... | ... | ... |
  | 15  | 0  | blub | ... | bleb | 17 |

我使用 pandas 读取这个表格，并将其转换为 numpy ndarray 以便进行处理。

tmp = pd.read_excel('path')
y = tmp.iloc[1:,0].values
x = tmp.iloc[0,1:].values
mat = tmp.iloc[1:,1:].values

我最初希望 numpy 会自动将字符串转换为数值表示，但我需要了解到 ndarray 可以容纳混合类型。

对于如何查看 mat 的最小示例，我无法提供。

这个Excel导入会在某些单元格中产生数字，但在其他单元格中产生字符串。

有没有一种有效的方法来将 mat 中的所有字符串替换为 np.nan？也就是说，不需要扫描所有单元格并基于类型进行替换？我尝试使用 pd.read_excel() 的 dtype 参数来做到这一点，但没有太多成功。

最终我想要实现的目标是用线性插值值替换这些原始字符串。所以也许有一种更有效的方式可以直接进行插值，而不是首先将字符串转换为数值。正如你所看到的，字符串也可以位于完整的列中，因此需要进行二维插值。

感谢你的建议！

英文:

I have an excel sheet that contains a matrix. Unfortunately, it happens that some random cells are arbitrary strings. The first row and column are the axes (independent variables). An example:

	10	20	...	90	100
1	3	9	...	blob	27
3	-1	10	...	blib	12
...	...	...	...	...	...
15	0	blub	...	bleb	17

I read this in with pandas and convert it to numpy ndarray in order to work with it.

tmp = pd.read_excel(&#39;path&#39;)
y = tmp.iloc[1:,0].values
x = tmp.iloc[0,1:].values
mat = tmp.iloc[1:,1:].values

My first hope was that numpyconverts the strings automatically to some numerical representation. But I needed to learn that a ndarray can hold mixed types.

Sorry, I am not even able to present a minimal example of how mat looks like.

import numpy as np
mat = np.ndarray(((3,9,&#39;blob&#39;,27),(-1,10,&#39;blib&#39;,12),(0,&#39;blub&#39;,&#39;bleb&#39;,17))) # raises an error
mat = np.asarray(((3,9,&#39;blob&#39;,27),(-1,10,&#39;blib&#39;,12),(0,&#39;blub&#39;,&#39;bleb&#39;,17))) # fills the ndarray with strings only (even the numbers become &#39;3&#39;, &#39;9&#39; etc.)

The above excel import yields numbers in some cells but strings in others.

Is there any efficient way to replace all those strings in mat by np.nan? I mean, without scanning through all cells and replace based on the type? I tried to do this by the dtype argument of pd.read_excel() but without much success.

What I finally want to achieve is to replace those original strings by linear interpolated values. So maybe there is even a more efficient way to directly interpolate instead of first numerize the strings. As you see, the strings can also sit in a complete column, so 2D interpolation is required.

Thank you for your ideas!

答案1

得分: 1

你可以使用Pandas的replace来替换整个数据框，结合regex表达式来消除字符串，然后使用Pandas的interpolate来填充缺失值。

这是一个基于你的帖子的简单示例，其中你似乎只处理带有有符号整数和简单字符串的情况（否则，可能需要不同的正则表达式模式）。

import numpy as np
import pandas as pd

df = pd.DataFrame({"col1": [1, "a", 2], "col2": [-9, 7, "b"]})

print(df)
# 输出  col1 col2
0    1   -9
1    a    7
2    2    b

df = df.replace(regex=r"\w+", value=np.nan).interpolate(method="linear")

print(df)
# 输出

   col1  col2
0   1.0  -9.0
1   1.5   7.0
2   2.0   7.0

英文:

You can use Pandas replace on a whole dataframe in combination with regex expression to get rid of strings, and then use Pandas interpolate to fill in the missing values.

Here is a naive example based on your post, where you seem to deal only with signed integers and simple strings (otherwise, a different regex pattern might be necessary).

import numpy as np
import pandas as pd

df = pd.DataFrame({&quot;col1&quot;: [1, &quot;a&quot;, 2], &quot;col2&quot;: [-9, 7, &quot;b&quot;]})

print(df)
# Output  col1 col2
0    1   -9
1    a    7
2    2    b

df = df.replace(regex=r&quot;\w+&quot;, value=np.nan).interpolate(method=&quot;linear&quot;)

print(df)
# Output

   col1  col2
0   1.0  -9.0
1   1.5   7.0
2   2.0   7.0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将numpy ndarray中的字符串转换为NaN（或更好地替换为插值值）

问题

答案1

Programmatically measure database query complexity in Python SQLAlchemy.

比较两个数据框，看一个数据框的列是否在另一个数据框的范围内。

转置数据列表

Sure, here’s the translation for “Need help in Image.save()”: 需要帮助 Image.save()

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论