英文:
How to find if a dataframe contains any string
问题
我已导入一个数据集,尝试处理包含大约 5,000 行和 85 列的数据。我尝试将数据传递给 sklearn 的功能进行特征分析,但遇到了一个错误,即在数据框中的某个地方有一个字符串,但该函数只能处理浮点数或整数。我已经遇到了存在 nan 和 inf 值的问题,但已经成功处理了它们。现在的问题是尝试找到数据框中的字符串值所在的位置。
我已经找到了搜索数据框以查找精确或部分字符串匹配的解决方案,但在找到包含任何字符串值的单元格方面没有取得成功。
我尝试了 df.dtypes,但它报告所有列都是 int 或 float 类型 - 即使在那里存在 nan 和 inf 值时也是如此。
数据集是来自 https://drive.google.com/drive/folders/1XIlVteHaHFqBXqNApYGb3RoHcBkqpRoR 的 testing.csv。
代码:
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import LabelEncoder
import numpy as np
# ANOVA feature selection for numeric input and categorical output
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
svicpath = Path("SVIC APT 2021/Testing.csv")
ds = pd.read_csv(svicpath)
# 填充 inf 和 -inf 值为 0
ds.replace([np.inf, -np.inf], 0, inplace=True)
# 填充任何 nan 值为 0
ds = ds.fillna(0)
y = ds.iloc[:,-1:]
X = ds.iloc[:, :-1]
# 删除非数值列:
X = ds._get_numeric_data()
# 特征提取:
# 配置以选择所有特征
fs = SelectKBest(score_func=f_classif, k='8')
# 从训练数据中学习关系
fs.fit(X, y)
# 特征得分是多少
for i in range(len(fs.scores_)):
print('Feature %d: %f' % (i, fs.scores_[i]))
# 绘制特征得分
pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)
pyplot.show()
错误:
TypeError: '<=' not supported between instances of 'int' and 'str'
如果您需要帮助解决此错误,请提供更多上下文或问题描述。
英文:
I have imported a dataset I am trying to work with which contains around 5,000 rows and 85 columns. I am trying to feed the data into sklearn function for feature analysis but am running into an error whereby somewhere in the dataframe there is a string but the function only works with float or int. I already had the issue where nan and inf values existed but have managed to deal with them. Now the problem is trying to locate where the string values are in the dataframe.
I have found solutions for searching a dataframe for an exact or partial string match but have had no luck finding a solution to this problem e.g. finding a cell containing any string value.
I have tried df.dtypes but this reports all columns are of type int or float - it also reported the same thing when there was nan and inf values there too.
Dataset is the testing.csv from https://drive.google.com/drive/folders/1XIlVteHaHFqBXqNApYGb3RoHcBkqpRoR
Code:
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import LabelEncoder
import numpy as np
# ANOVA feature selection for numeric input and categorical output
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
svicpath = Path("SVIC APT 2021/Testing.csv")
ds = pd.read_csv(svicpath)
#Fill and in with 0
ds.replace([np.inf, -np.inf], 0, inplace=True)
#Fill any nan with 0
ds = ds.fillna(0)
y = ds.iloc[:,-1:]
X = ds.iloc[:, :-1]
#Remove non numeric cols:
X = ds._get_numeric_data()
#Feature Extraction:
#configure to select all features
fs = SelectKBest(score_func=f_classif, k='8')
# learn relationship from training data
fs.fit(X, y)
# what are scores for the features
for i in range(len(fs.scores_)):
print('Feature %d: %f' % (i, fs.scores_[i]))
# plot the scores
pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)
pyplot.show()
Error:
> --------------------------------------------------------------------------- TypeError Traceback (most recent call
> last) <ipython-input-5-ff2b2e10bd87> in <module>
> 30 fs = SelectKBest(score_func=f_classif, k='8')
> 31 # learn relationship from training data
> ---> 32 fs.fit(X, y)
> 33
> 34
>
> ~\Anaconda3\lib\site-packages\sklearn\feature_selection_univariate_selection.py
> in fit(self, X, y)
> 346 % (self.score_func, type(self.score_func)))
> 347
> --> 348 self._check_params(X, y)
> 349 score_func_ret = self.score_func(X, y)
> 350 if isinstance(score_func_ret, (list, tuple)):
>
> ~\Anaconda3\lib\site-packages\sklearn\feature_selection_univariate_selection.py
> in _check_params(self, X, y)
> 509
> 510 def _check_params(self, X, y):
> --> 511 if not (self.k == "all" or 0 <= self.k <= X.shape[1]):
> 512 raise ValueError("k should be >=0, <= n_features = %d; got %r. "
> 513 "Use k='all' to return all features."
>
> TypeError: '<=' not supported between instances of 'int' and 'str'
答案1
得分: 3
"现在的问题是尝试定位数据框中的字符串值。" 和 "我尝试过使用 df.dtypes,但报告所有列都是 int 或 float 类型。" 这两个陈述是相互矛盾的。
您很可能只有数字、NaN 或 Inf 值。
您可以使用 numpy.isfinite
和 numpy.where
来识别它们:
idx, col = np.where(~np.isfinite(df))
list(zip(df.index[idx], df.columns[col]))
# [(0, 'col2'), (1, 'col3')]
如果您确实有非数字值:
idx, col = np.where(~np.isfinite(df.apply(pd.to_numeric, errors='coerce')))
list(zip(df.index[idx], df.columns[col]))
使用的输入:
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [np.nan, 4, 5], 'col3': [6, np.inf, 7]})
英文:
"Now the problem is trying to locate where the string values are in the dataframe." and "I have tried df.dtypes but this reports all columns are of type int or float." are two contradictory statements.
You likely only have numbers, NaNs, or Inf.
You can identify them using numpy.isfinite
and numpy.where
:
idx, col = np.where(~np.isfinite(df))
list(zip(df.index[idx], df.columns[col]))
# [(0, 'col2'), (1, 'col3')]
If you really have non-numbers:
idx, col = np.where(~np.isfinite(df.apply(pd.to_numeric, errors='coerce')))
list(zip(df.index[idx], df.columns[col]))
Used input:
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [np.nan, 4, 5], 'col3': [6, np.inf, 7]})
答案2
得分: 1
您可以使用此代码来检查包含字符串的列:
```python
df = pd.DataFrame({'column1': [1,2,3], 'column2': [2,3,'a']})
df.applymap(lambda x: isinstance(x, str)).any()
column1 False
column2 True
column2为True,因为它包含一个字符串。
然后,您可以去掉"any",它将返回字符串所在的确切单元格。
<details>
<summary>英文:</summary>
You can use this code to check which column contains a string
df = pd.DataFrame({'column1': [1,2,3], 'column2': [2,3,'a']})
df.applymap(lambda x: isinstance(x, str)).any()
column1 False
column2 True
column2 is true because it contains a string.
Then you can remove the "any" and it will return which exact cell is the string as well
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论