2023年4月11日 05:45:17go评论76阅读模式

英文:

Lower DBCV Scores for Cluster Analysis using Sklearn's GridSearchCV

问题

I have translated the code-related portion of your text:

我有一个UTM坐标的地理数据集'coordinates'，我正在对其执行HDBSCAN，并希望使用sklearn的GridSearchCV验证不同的参数，使用DBCV。当我手动评估HDBSCAN的参数时，我得到了以下结果，这比sklearn的GridSearchCV要好：
clusters = hdbscan.HDBSCAN(min_cluster_size=75, min_samples=60, 
                           cluster_selection_method='eom', gen_min_span_tree=True, 
                           prediction_data=True).fit(coordinates)
获得的DBCV分数：0.2580606238793024
当使用sklearn的GridSearchCV时，它选择模型参数，以获得更低的DBCV值，即使手动选择的参数在参数字典中。另外，当我尝试使用RandomizedSearchCV时，使用不同范围的参数，我能够获得0.28的DBCV值，但未记录使用了哪些参数。
*更新：当我运行RandomizedSearchCV和GridSearchCV时，选择的'最佳'模型始终是参数网格的第一个条目或第一个随机样本。例如，在下面的代码中，它总是选择min_samples和min_cluster_size的第一个条目。我怀疑是因为它遇到了错误。当我添加error_score="raise"时，它会引发TypeError，这可能与它无法与y进行比较有关，但这是无监督聚类，没有数据标签。
> TypeError: _BaseScorer.__call__()缺少1个必需的位置参数：'y_true'
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
import hdbscan
from sklearn.metrics import make_scorer
import logging # to further silence deprecation warnings
logging.captureWarnings(True)
### GridSearch CV模型调整 ###
logging.captureWarnings(True)
hdb = hdbscan.HDBSCAN(gen_min_span_tree=True).fit(coordinates)
# 指定要从中抽样的参数
grid = {'min_samples': [50,55,60,65,70,75,80,90,100,110],
                  'min_cluster_size':[40,45,50,55,60,65,75,80,85,90,95,100],  
                  'cluster_selection_method' : ['eom','leaf'],
                  'metric' : ['euclidean','manhattan'] 
                 }
#validity_scroer = "hdbscan__hdbscan___HDBSCAN__validity_index"
validity_scorer = make_scorer(hdbscan.validity.validity_index,greater_is_better=True)
grid_search = GridSearchCV(hdb
                           ,param_grid=grid
                           ,scoring=validity_scorer)
grid_search.fit(coordinates)
print(f"Best Parameters {grid_search.best_params_}")
print(f"DBCV score :{grid_search.best_estimator_.relative_validity_}")
>     最佳参数 {'cluster_selection_method': 'eom', 'metric': 'euclidean', 'min_cluster_size': 40, 'min_samples': 50}
>     DBCV分数：0.22213170637127946

Is there anything specific you would like to know or do with this code?

英文:

I have a geographic dataset 'coordinates' in UTM coordinates that I am performing HDBSCAN on and would like to have sklearn's GridSearchCV validate various parameters using DBCV. While manually evaluating the parameters for HDBSCAN I got the following result, which is better than sklearn's GridSearchCV:

clusters = hdbscan.HDBSCAN(min_cluster_size=75, min_samples=60, 
                           cluster_selection_method =&#39;eom&#39;, gen_min_span_tree=True, 
                           prediction_data=True).fit(coordinates)
Obtained DBCV Score:  0.2580606238793024

When using sklearn's GridSearchCV it chooses model parameters that obtain a lower DBCV value, even though the manually chosen parameters are in the dictionary of parameters. As an aside, while playing around with the RandomizedSearchCV I was able to obtain a DBCV value of 0.28 using a different range of parameters, but didn't write down what parameters were utilized.

*Update: When I run the RandomizedSearchCV & GridSearchCV the 'best' model chosen is the first item in the parameter grid or the first chosen random sample. For example, in the code below, it always picks the first entries in min_samples & min_cluster_size. I suspect because it encounters an error. When I add error_score="raise" it raises a TypeError, which is likely related to the fact that it can't compare to a y, but this is unsupervised clustering with not data labels.

> TypeError: _BaseScorer.call() missing 1 required positional
> argument: 'y_true'

    from sklearn.model_selection import RandomizedSearchCV
    from sklearn.model_selection import GridSearchCV
    import hdbscan
    from sklearn.metrics import make_scorer
    import logging # to further silence deprecation warnings
    logging.captureWarnings(True)
    
    # ### GridSearch CV Model Tuning ###
    logging.captureWarnings(True)
    hdb = hdbscan.HDBSCAN(gen_min_span_tree=True).fit(coordinates)
    
    # # specify parameters to sample from
    grid = {&#39;min_samples&#39;: [50,55,60,65,70,75,80,90,100,110],
                  &#39;min_cluster_size&#39;:[40,45,50,55,60,65,75,80,85,90,95,100],  
                  &#39;cluster_selection_method&#39; : [&#39;eom&#39;,&#39;leaf&#39;],
                  &#39;metric&#39; : [&#39;euclidean&#39;,&#39;manhattan&#39;] 
                 }
    #validity_scroer = &quot;hdbscan__hdbscan___HDBSCAN__validity_index&quot;
    validity_scorer = make_scorer(hdbscan.validity.validity_index,greater_is_better=True)
    
    grid_search = GridSearchCV(hdb
                               ,param_grid=grid
                               ,scoring=validity_scorer)
    
    grid_search.fit(coordinates)
    
    
    print(f&quot;Best Parameters {grid_search.best_params_}&quot;)
    print(f&quot;DBCV score :{grid_search.best_estimator_.relative_validity_}&quot;)

> Best Parameters {'cluster_selection_method': 'eom', 'metric': 'euclidean', 'min_cluster_size': 40, 'min_samples': 50}
> DBCV score :0.22213170637127946

答案1

得分: 0

Sure, here's the translated code:

# Naive grid search implementation by Mueller and Guido, Introduction to Machine Learning with Python
best_score = 0
for min_cluster_size in [40, 45, 120, 50, 55, 130, 140, 150, 155, 160]:
    for min_samples in [40, 45, 50, 85, 55, 60, 90, 100, 110, 115, 120]:
        for cluster_selection_method in ['eom', 'leaf']:
            for metric in ['euclidean']:
                # for each combination of parameters of hdbscan
                hdb = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, min_samples=min_samples,
                                      cluster_selection_method=cluster_selection_method, metric=metric, 
                                      gen_min_span_tree=True).fit(coordinates)
                # DBCV score
                score = hdb.relative_validity_
                # if we got a better DBCV, store it and the parameters
                if score > best_score:
                    best_score = score
                    best_parameters = {'min_cluster_size': min_cluster_size, 
                                       'min_samples': min_samples, 'cluster_selection_method': cluster_selection_method,
                                       'metric': metric}
print("Best DBCV score: {:.3f}".format(best_score))
print("Best parameters: {}".format(best_parameters))

Please note that this is a code translation, and if you have any specific questions or need further assistance related to this code, feel free to ask.

英文:

# Naive grid search implementation by Mueller and Guido, Introduction to Machine Learning with Python
best_score = 0
for min_cluster_size in [40,45,120,50,55,130,140,150,155,160]:
    for min_samples in [40,45,50,85,55,60,90,100,110,115,120]:
        for cluster_selection_method in [&#39;eom&#39;,&#39;leaf&#39;]:
            for metric in [&#39;euclidean&#39;]:
                # for each combination of parameters of hdbscan
                hdb = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size,min_samples=min_samples,
                                      cluster_selection_method=cluster_selection_method, metric=metric, 
                                      gen_min_span_tree=True).fit(coordinates)
                # DBCV score
                score = hdb.relative_validity_
                # if we got a better DBCV, store it and the parameters
                if score &gt; best_score:
                    best_score = score
                    best_parameters = {&#39;min_cluster_size&#39;: min_cluster_size, 
                               &#39; min_samples&#39;:  min_samples, &#39;cluster_selection_method&#39;: cluster_selection_method,
                              &#39;metric&#39;: metric}
print(&quot;Best DBCV score: {:.3f}&quot;.format(best_score))
print(&quot;Best parameters: {}&quot;.format(best_parameters))

Outputs:
> Best DBCV score: 0.414 Best parameters: {'min_cluster_size': 150, '
> min_samples': 90, 'cluster_selection_method': 'eom', 'metric':
> 'euclidean'}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Lower DBCV Scores for Cluster Analysis using Sklearn’s GridSearchCV

问题

答案1

sklearn的KNN Imputer能够处理数据框中的特定行吗？

使用简单线性回归进行多分类任务。

在训练值上执行Tf-idf向量化器（Tf-idfvectorizer()）时发生错误。

我需要在 model.predict() 之前使用 RobustScaler() 和 OneHotEncoder() 吗？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。