Lower DBCV Scores for Cluster Analysis using Sklearn’s GridSearchCV

huangapple go评论76阅读模式
英文:

Lower DBCV Scores for Cluster Analysis using Sklearn's GridSearchCV

问题

I have translated the code-related portion of your text:

  1. 我有一个UTM坐标的地理数据集'coordinates',我正在对其执行HDBSCAN,并希望使用sklearnGridSearchCV验证不同的参数,使用DBCV。当我手动评估HDBSCAN的参数时,我得到了以下结果,这比sklearnGridSearchCV要好:
  2. clusters = hdbscan.HDBSCAN(min_cluster_size=75, min_samples=60,
  3. cluster_selection_method='eom', gen_min_span_tree=True,
  4. prediction_data=True).fit(coordinates)
  5. 获得的DBCV分数:0.2580606238793024
  6. 当使用sklearnGridSearchCV时,它选择模型参数,以获得更低的DBCV值,即使手动选择的参数在参数字典中。另外,当我尝试使用RandomizedSearchCV时,使用不同范围的参数,我能够获得0.28DBCV值,但未记录使用了哪些参数。
  7. *更新:当我运行RandomizedSearchCVGridSearchCV时,选择的'最佳'模型始终是参数网格的第一个条目或第一个随机样本。例如,在下面的代码中,它总是选择min_samplesmin_cluster_size的第一个条目。我怀疑是因为它遇到了错误。当我添加error_score="raise"时,它会引发TypeError,这可能与它无法与y进行比较有关,但这是无监督聚类,没有数据标签。
  8. > TypeError: _BaseScorer.__call__()缺少1个必需的位置参数:'y_true'
  9. from sklearn.model_selection import RandomizedSearchCV
  10. from sklearn.model_selection import GridSearchCV
  11. import hdbscan
  12. from sklearn.metrics import make_scorer
  13. import logging # to further silence deprecation warnings
  14. logging.captureWarnings(True)
  15. ### GridSearch CV模型调整 ###
  16. logging.captureWarnings(True)
  17. hdb = hdbscan.HDBSCAN(gen_min_span_tree=True).fit(coordinates)
  18. # 指定要从中抽样的参数
  19. grid = {'min_samples': [50,55,60,65,70,75,80,90,100,110],
  20. 'min_cluster_size':[40,45,50,55,60,65,75,80,85,90,95,100],
  21. 'cluster_selection_method' : ['eom','leaf'],
  22. 'metric' : ['euclidean','manhattan']
  23. }
  24. #validity_scroer = "hdbscan__hdbscan___HDBSCAN__validity_index"
  25. validity_scorer = make_scorer(hdbscan.validity.validity_index,greater_is_better=True)
  26. grid_search = GridSearchCV(hdb
  27. ,param_grid=grid
  28. ,scoring=validity_scorer)
  29. grid_search.fit(coordinates)
  30. print(f"Best Parameters {grid_search.best_params_}")
  31. print(f"DBCV score :{grid_search.best_estimator_.relative_validity_}")
  32. > 最佳参数 {'cluster_selection_method': 'eom', 'metric': 'euclidean', 'min_cluster_size': 40, 'min_samples': 50}
  33. > DBCV分数:0.22213170637127946

Is there anything specific you would like to know or do with this code?

英文:

I have a geographic dataset 'coordinates' in UTM coordinates that I am performing HDBSCAN on and would like to have sklearn's GridSearchCV validate various parameters using DBCV. While manually evaluating the parameters for HDBSCAN I got the following result, which is better than sklearn's GridSearchCV:

  1. clusters = hdbscan.HDBSCAN(min_cluster_size=75, min_samples=60,
  2. cluster_selection_method ='eom', gen_min_span_tree=True,
  3. prediction_data=True).fit(coordinates)
  4. Obtained DBCV Score: 0.2580606238793024

When using sklearn's GridSearchCV it chooses model parameters that obtain a lower DBCV value, even though the manually chosen parameters are in the dictionary of parameters. As an aside, while playing around with the RandomizedSearchCV I was able to obtain a DBCV value of 0.28 using a different range of parameters, but didn't write down what parameters were utilized.

*Update: When I run the RandomizedSearchCV & GridSearchCV the 'best' model chosen is the first item in the parameter grid or the first chosen random sample. For example, in the code below, it always picks the first entries in min_samples & min_cluster_size. I suspect because it encounters an error. When I add error_score="raise" it raises a TypeError, which is likely related to the fact that it can't compare to a y, but this is unsupervised clustering with not data labels.

> TypeError: _BaseScorer.call() missing 1 required positional
> argument: 'y_true'

  1. from sklearn.model_selection import RandomizedSearchCV
  2. from sklearn.model_selection import GridSearchCV
  3. import hdbscan
  4. from sklearn.metrics import make_scorer
  5. import logging # to further silence deprecation warnings
  6. logging.captureWarnings(True)
  7. # ### GridSearch CV Model Tuning ###
  8. logging.captureWarnings(True)
  9. hdb = hdbscan.HDBSCAN(gen_min_span_tree=True).fit(coordinates)
  10. # # specify parameters to sample from
  11. grid = {'min_samples': [50,55,60,65,70,75,80,90,100,110],
  12. 'min_cluster_size':[40,45,50,55,60,65,75,80,85,90,95,100],
  13. 'cluster_selection_method' : ['eom','leaf'],
  14. 'metric' : ['euclidean','manhattan']
  15. }
  16. #validity_scroer = "hdbscan__hdbscan___HDBSCAN__validity_index"
  17. validity_scorer = make_scorer(hdbscan.validity.validity_index,greater_is_better=True)
  18. grid_search = GridSearchCV(hdb
  19. ,param_grid=grid
  20. ,scoring=validity_scorer)
  21. grid_search.fit(coordinates)
  22. print(f"Best Parameters {grid_search.best_params_}")
  23. print(f"DBCV score :{grid_search.best_estimator_.relative_validity_}")

> Best Parameters {'cluster_selection_method': 'eom', 'metric': 'euclidean', 'min_cluster_size': 40, 'min_samples': 50}
> DBCV score :0.22213170637127946

答案1

得分: 0

Sure, here's the translated code:

  1. # Naive grid search implementation by Mueller and Guido, Introduction to Machine Learning with Python
  2. best_score = 0
  3. for min_cluster_size in [40, 45, 120, 50, 55, 130, 140, 150, 155, 160]:
  4. for min_samples in [40, 45, 50, 85, 55, 60, 90, 100, 110, 115, 120]:
  5. for cluster_selection_method in ['eom', 'leaf']:
  6. for metric in ['euclidean']:
  7. # for each combination of parameters of hdbscan
  8. hdb = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, min_samples=min_samples,
  9. cluster_selection_method=cluster_selection_method, metric=metric,
  10. gen_min_span_tree=True).fit(coordinates)
  11. # DBCV score
  12. score = hdb.relative_validity_
  13. # if we got a better DBCV, store it and the parameters
  14. if score > best_score:
  15. best_score = score
  16. best_parameters = {'min_cluster_size': min_cluster_size,
  17. 'min_samples': min_samples, 'cluster_selection_method': cluster_selection_method,
  18. 'metric': metric}
  19. print("Best DBCV score: {:.3f}".format(best_score))
  20. print("Best parameters: {}".format(best_parameters))

Please note that this is a code translation, and if you have any specific questions or need further assistance related to this code, feel free to ask.

英文:
  1. # Naive grid search implementation by Mueller and Guido, Introduction to Machine Learning with Python
  2. best_score = 0
  3. for min_cluster_size in [40,45,120,50,55,130,140,150,155,160]:
  4. for min_samples in [40,45,50,85,55,60,90,100,110,115,120]:
  5. for cluster_selection_method in ['eom','leaf']:
  6. for metric in ['euclidean']:
  7. # for each combination of parameters of hdbscan
  8. hdb = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size,min_samples=min_samples,
  9. cluster_selection_method=cluster_selection_method, metric=metric,
  10. gen_min_span_tree=True).fit(coordinates)
  11. # DBCV score
  12. score = hdb.relative_validity_
  13. # if we got a better DBCV, store it and the parameters
  14. if score > best_score:
  15. best_score = score
  16. best_parameters = {'min_cluster_size': min_cluster_size,
  17. ' min_samples': min_samples, 'cluster_selection_method': cluster_selection_method,
  18. 'metric': metric}
  19. print("Best DBCV score: {:.3f}".format(best_score))
  20. print("Best parameters: {}".format(best_parameters))

Outputs:
> Best DBCV score: 0.414 Best parameters: {'min_cluster_size': 150, '
> min_samples': 90, 'cluster_selection_method': 'eom', 'metric':
> 'euclidean'}

huangapple
  • 本文由 发表于 2023年4月11日 05:45:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/75980972.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定