英文:
Lower DBCV Scores for Cluster Analysis using Sklearn's GridSearchCV
问题
I have translated the code-related portion of your text:
我有一个UTM坐标的地理数据集'coordinates',我正在对其执行HDBSCAN,并希望使用sklearn的GridSearchCV验证不同的参数,使用DBCV。当我手动评估HDBSCAN的参数时,我得到了以下结果,这比sklearn的GridSearchCV要好:
clusters = hdbscan.HDBSCAN(min_cluster_size=75, min_samples=60,
cluster_selection_method='eom', gen_min_span_tree=True,
prediction_data=True).fit(coordinates)
获得的DBCV分数:0.2580606238793024
当使用sklearn的GridSearchCV时,它选择模型参数,以获得更低的DBCV值,即使手动选择的参数在参数字典中。另外,当我尝试使用RandomizedSearchCV时,使用不同范围的参数,我能够获得0.28的DBCV值,但未记录使用了哪些参数。
*更新:当我运行RandomizedSearchCV和GridSearchCV时,选择的'最佳'模型始终是参数网格的第一个条目或第一个随机样本。例如,在下面的代码中,它总是选择min_samples和min_cluster_size的第一个条目。我怀疑是因为它遇到了错误。当我添加error_score="raise"时,它会引发TypeError,这可能与它无法与y进行比较有关,但这是无监督聚类,没有数据标签。
> TypeError: _BaseScorer.__call__()缺少1个必需的位置参数:'y_true'
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
import hdbscan
from sklearn.metrics import make_scorer
import logging # to further silence deprecation warnings
logging.captureWarnings(True)
### GridSearch CV模型调整 ###
logging.captureWarnings(True)
hdb = hdbscan.HDBSCAN(gen_min_span_tree=True).fit(coordinates)
# 指定要从中抽样的参数
grid = {'min_samples': [50,55,60,65,70,75,80,90,100,110],
'min_cluster_size':[40,45,50,55,60,65,75,80,85,90,95,100],
'cluster_selection_method' : ['eom','leaf'],
'metric' : ['euclidean','manhattan']
}
#validity_scroer = "hdbscan__hdbscan___HDBSCAN__validity_index"
validity_scorer = make_scorer(hdbscan.validity.validity_index,greater_is_better=True)
grid_search = GridSearchCV(hdb
,param_grid=grid
,scoring=validity_scorer)
grid_search.fit(coordinates)
print(f"Best Parameters {grid_search.best_params_}")
print(f"DBCV score :{grid_search.best_estimator_.relative_validity_}")
> 最佳参数 {'cluster_selection_method': 'eom', 'metric': 'euclidean', 'min_cluster_size': 40, 'min_samples': 50}
> DBCV分数:0.22213170637127946
Is there anything specific you would like to know or do with this code?
英文:
I have a geographic dataset 'coordinates' in UTM coordinates that I am performing HDBSCAN on and would like to have sklearn's GridSearchCV validate various parameters using DBCV. While manually evaluating the parameters for HDBSCAN I got the following result, which is better than sklearn's GridSearchCV:
clusters = hdbscan.HDBSCAN(min_cluster_size=75, min_samples=60,
cluster_selection_method ='eom', gen_min_span_tree=True,
prediction_data=True).fit(coordinates)
Obtained DBCV Score: 0.2580606238793024
When using sklearn's GridSearchCV it chooses model parameters that obtain a lower DBCV value, even though the manually chosen parameters are in the dictionary of parameters. As an aside, while playing around with the RandomizedSearchCV I was able to obtain a DBCV value of 0.28 using a different range of parameters, but didn't write down what parameters were utilized.
*Update: When I run the RandomizedSearchCV & GridSearchCV the 'best' model chosen is the first item in the parameter grid or the first chosen random sample. For example, in the code below, it always picks the first entries in min_samples & min_cluster_size. I suspect because it encounters an error. When I add error_score="raise" it raises a TypeError, which is likely related to the fact that it can't compare to a y, but this is unsupervised clustering with not data labels.
> TypeError: _BaseScorer.call() missing 1 required positional
> argument: 'y_true'
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
import hdbscan
from sklearn.metrics import make_scorer
import logging # to further silence deprecation warnings
logging.captureWarnings(True)
# ### GridSearch CV Model Tuning ###
logging.captureWarnings(True)
hdb = hdbscan.HDBSCAN(gen_min_span_tree=True).fit(coordinates)
# # specify parameters to sample from
grid = {'min_samples': [50,55,60,65,70,75,80,90,100,110],
'min_cluster_size':[40,45,50,55,60,65,75,80,85,90,95,100],
'cluster_selection_method' : ['eom','leaf'],
'metric' : ['euclidean','manhattan']
}
#validity_scroer = "hdbscan__hdbscan___HDBSCAN__validity_index"
validity_scorer = make_scorer(hdbscan.validity.validity_index,greater_is_better=True)
grid_search = GridSearchCV(hdb
,param_grid=grid
,scoring=validity_scorer)
grid_search.fit(coordinates)
print(f"Best Parameters {grid_search.best_params_}")
print(f"DBCV score :{grid_search.best_estimator_.relative_validity_}")
> Best Parameters {'cluster_selection_method': 'eom', 'metric': 'euclidean', 'min_cluster_size': 40, 'min_samples': 50}
> DBCV score :0.22213170637127946
答案1
得分: 0
Sure, here's the translated code:
# Naive grid search implementation by Mueller and Guido, Introduction to Machine Learning with Python
best_score = 0
for min_cluster_size in [40, 45, 120, 50, 55, 130, 140, 150, 155, 160]:
for min_samples in [40, 45, 50, 85, 55, 60, 90, 100, 110, 115, 120]:
for cluster_selection_method in ['eom', 'leaf']:
for metric in ['euclidean']:
# for each combination of parameters of hdbscan
hdb = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, min_samples=min_samples,
cluster_selection_method=cluster_selection_method, metric=metric,
gen_min_span_tree=True).fit(coordinates)
# DBCV score
score = hdb.relative_validity_
# if we got a better DBCV, store it and the parameters
if score > best_score:
best_score = score
best_parameters = {'min_cluster_size': min_cluster_size,
'min_samples': min_samples, 'cluster_selection_method': cluster_selection_method,
'metric': metric}
print("Best DBCV score: {:.3f}".format(best_score))
print("Best parameters: {}".format(best_parameters))
Please note that this is a code translation, and if you have any specific questions or need further assistance related to this code, feel free to ask.
英文:
# Naive grid search implementation by Mueller and Guido, Introduction to Machine Learning with Python
best_score = 0
for min_cluster_size in [40,45,120,50,55,130,140,150,155,160]:
for min_samples in [40,45,50,85,55,60,90,100,110,115,120]:
for cluster_selection_method in ['eom','leaf']:
for metric in ['euclidean']:
# for each combination of parameters of hdbscan
hdb = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size,min_samples=min_samples,
cluster_selection_method=cluster_selection_method, metric=metric,
gen_min_span_tree=True).fit(coordinates)
# DBCV score
score = hdb.relative_validity_
# if we got a better DBCV, store it and the parameters
if score > best_score:
best_score = score
best_parameters = {'min_cluster_size': min_cluster_size,
' min_samples': min_samples, 'cluster_selection_method': cluster_selection_method,
'metric': metric}
print("Best DBCV score: {:.3f}".format(best_score))
print("Best parameters: {}".format(best_parameters))
Outputs:
> Best DBCV score: 0.414 Best parameters: {'min_cluster_size': 150, '
> min_samples': 90, 'cluster_selection_method': 'eom', 'metric':
> 'euclidean'}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论