2023年6月8日 17:16:25go评论168阅读模式

英文:

How to find if a dataframe contains any string

问题

我已导入一个数据集，尝试处理包含大约 5,000 行和 85 列的数据。我尝试将数据传递给 sklearn 的功能进行特征分析，但遇到了一个错误，即在数据框中的某个地方有一个字符串，但该函数只能处理浮点数或整数。我已经遇到了存在 nan 和 inf 值的问题，但已经成功处理了它们。现在的问题是尝试找到数据框中的字符串值所在的位置。

我已经找到了搜索数据框以查找精确或部分字符串匹配的解决方案，但在找到包含任何字符串值的单元格方面没有取得成功。

我尝试了 df.dtypes，但它报告所有列都是 int 或 float 类型 - 即使在那里存在 nan 和 inf 值时也是如此。

数据集是来自 https://drive.google.com/drive/folders/1XIlVteHaHFqBXqNApYGb3RoHcBkqpRoR 的 testing.csv。

代码：

import pandas as pd
from pathlib import Path
from sklearn.preprocessing import LabelEncoder
import numpy as np
# ANOVA feature selection for numeric input and categorical output
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

svicpath = Path("SVIC APT 2021/Testing.csv")

ds = pd.read_csv(svicpath)

# 填充 inf 和 -inf 值为 0
ds.replace([np.inf, -np.inf], 0, inplace=True)
# 填充任何 nan 值为 0
ds = ds.fillna(0)

y = ds.iloc[:,-1:]
X = ds.iloc[:, :-1]

# 删除非数值列：
X = ds._get_numeric_data()

# 特征提取：

# 配置以选择所有特征
fs = SelectKBest(score_func=f_classif, k='8')
# 从训练数据中学习关系
fs.fit(X, y)

# 特征得分是多少
for i in range(len(fs.scores_)):
    print('Feature %d: %f' % (i, fs.scores_[i]))
# 绘制特征得分
pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)
pyplot.show()

错误：

TypeError: '<=' not supported between instances of 'int' and 'str'

如果您需要帮助解决此错误，请提供更多上下文或问题描述。

英文:

I have imported a dataset I am trying to work with which contains around 5,000 rows and 85 columns. I am trying to feed the data into sklearn function for feature analysis but am running into an error whereby somewhere in the dataframe there is a string but the function only works with float or int. I already had the issue where nan and inf values existed but have managed to deal with them. Now the problem is trying to locate where the string values are in the dataframe.

I have found solutions for searching a dataframe for an exact or partial string match but have had no luck finding a solution to this problem e.g. finding a cell containing any string value.

I have tried df.dtypes but this reports all columns are of type int or float - it also reported the same thing when there was nan and inf values there too.

Dataset is the testing.csv from https://drive.google.com/drive/folders/1XIlVteHaHFqBXqNApYGb3RoHcBkqpRoR

Code:

import pandas as pd
from pathlib import Path
from sklearn.preprocessing import LabelEncoder
import numpy as np
# ANOVA feature selection for numeric input and categorical output
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

svicpath = Path(&quot;SVIC APT 2021/Testing.csv&quot;)

ds = pd.read_csv(svicpath)

#Fill and in with 0
ds.replace([np.inf, -np.inf], 0, inplace=True)
#Fill any nan with 0
ds = ds.fillna(0)

y = ds.iloc[:,-1:]
X = ds.iloc[:, :-1]

#Remove non numeric cols:
X = ds._get_numeric_data()

#Feature Extraction:

#configure to select all features
fs = SelectKBest(score_func=f_classif, k=&#39;8&#39;)
# learn relationship from training data
fs.fit(X, y)


# what are scores for the features
for i in range(len(fs.scores_)):
    print(&#39;Feature %d: %f&#39; % (i, fs.scores_[i]))
# plot the scores
pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)
pyplot.show()

Error:

> --------------------------------------------------------------------------- TypeError Traceback (most recent call
> last) <ipython-input-5-ff2b2e10bd87> in <module>
> 30 fs = SelectKBest(score_func=f_classif, k='8')
> 31 # learn relationship from training data
> ---> 32 fs.fit(X, y)
> 33
> 34
>
> ~\Anaconda3\lib\site-packages\sklearn\feature_selection_univariate_selection.py
> in fit(self, X, y)
> 346 % (self.score_func, type(self.score_func)))
> 347
> --> 348 self._check_params(X, y)
> 349 score_func_ret = self.score_func(X, y)
> 350 if isinstance(score_func_ret, (list, tuple)):
>
> ~\Anaconda3\lib\site-packages\sklearn\feature_selection_univariate_selection.py
> in _check_params(self, X, y)
> 509
> 510 def _check_params(self, X, y):
> --> 511 if not (self.k == "all" or 0 <= self.k <= X.shape[1]):
> 512 raise ValueError("k should be >=0, <= n_features = %d; got %r. "
> 513 "Use k='all' to return all features."
>
> TypeError: '<=' not supported between instances of 'int' and 'str'

答案1

得分: 3

"现在的问题是尝试定位数据框中的字符串值。" 和 "我尝试过使用 df.dtypes，但报告所有列都是 int 或 float 类型。" 这两个陈述是相互矛盾的。

您很可能只有数字、NaN 或 Inf 值。

您可以使用 numpy.isfinite 和 numpy.where 来识别它们：

idx, col = np.where(~np.isfinite(df))

list(zip(df.index[idx], df.columns[col]))
# [(0, 'col2'), (1, 'col3')]

如果您确实有非数字值：

idx, col = np.where(~np.isfinite(df.apply(pd.to_numeric, errors='coerce')))

list(zip(df.index[idx], df.columns[col]))

使用的输入：

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [np.nan, 4, 5], 'col3': [6, np.inf, 7]})

英文:

"Now the problem is trying to locate where the string values are in the dataframe." and "I have tried df.dtypes but this reports all columns are of type int or float." are two contradictory statements.

You likely only have numbers, NaNs, or Inf.

You can identify them using numpy.isfinite and numpy.where:

idx, col = np.where(~np.isfinite(df))

list(zip(df.index[idx], df.columns[col]))
# [(0, &#39;col2&#39;), (1, &#39;col3&#39;)]

If you really have non-numbers:

idx, col = np.where(~np.isfinite(df.apply(pd.to_numeric, errors=&#39;coerce&#39;)))

list(zip(df.index[idx], df.columns[col]))

Used input:

df = pd.DataFrame({&#39;col1&#39;: [1, 2, 3], &#39;col2&#39;: [np.nan, 4, 5], &#39;col3&#39;: [6, np.inf, 7]})

答案2

得分: 1

您可以使用此代码来检查包含字符串的列：

```python
df = pd.DataFrame({'column1': [1,2,3], 'column2': [2,3,'a']})
df.applymap(lambda x: isinstance(x, str)).any()

column1    False
column2     True

column2为True，因为它包含一个字符串。
然后，您可以去掉"any"，它将返回字符串所在的确切单元格。


<details>
<summary>英文:</summary>

You can use this code to check which column contains a string

df = pd.DataFrame({'column1': [1,2,3], 'column2': [2,3,'a']})
df.applymap(lambda x: isinstance(x, str)).any()

column1 False
column2 True


column2 is true because it contains a string. 
Then you can remove the &quot;any&quot; and it will return which exact cell is the string as well



</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何查找数据框中是否包含任何字符串

问题

答案1

答案2

电子邮件 Outlook 获取正文

访问 pandas 多级索引的“上层名称”

AttributeError: 导入Dask时，模块’pandas.core.strings’没有’StringMethods’属性。

使用Pandas根据另一列的条件重置列的值

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论