问题

以下是与预训练模型一起使用的一些代码示例（链接到整个示例页面https://learn.microsoft.com/en-us/dotnet/api/microsoft.ml.textcatalog.applywordembedding?view=ml-dotnet）：

using System;
using System.Collections.Generic;
using Microsoft.ML;
using Microsoft.ML.Transforms.Text;

namespace Samples.Dynamic
{
    public static class ApplyWordEmbedding
    {
        public static void Example()
        {
            // 创建一个新的ML上下文，用于ML.NET操作。它可以用于异常跟踪、日志记录以及随机性的来源。
            var mlContext = new MLContext();

            // 创建一个空的列表作为数据集。'ApplyWordEmbedding'不需要训练数据，因为由'ApplyWordEmbedding' API 创建的估计器（'WordEmbeddingEstimator'）不是可训练的估计器。
            // 空列表仅需要用于将输入模式传递给管道。
            var emptySamples = new List<TextData>();

            // 将样本列表转换为空的IDataView。
            var emptyDataView = mlContext.Data.LoadFromEnumerable(emptySamples);

            // 一个用于将文本转换为150维嵌入向量的管道，使用预训练的'SentimentSpecificWordEmbedding'模型。
            // 'ApplyWordEmbedding'计算每个标记的嵌入向量的最小、平均和最大值。在'SentimentSpecificWordEmbedding'模型中，标记表示为50维向量。因此，输出是150维[min, avg, max]。
            //
            // 'ApplyWordEmbedding' API需要文本向量作为输入。该管道首先对文本进行标准化和标记化，然后应用词嵌入转换。
            var textPipeline = mlContext.Transforms.Text.NormalizeText("Text")
                .Append(mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "Text"))
                .Append(mlContext.Transforms.Text.ApplyWordEmbedding("Features", "Tokens", WordEmbeddingEstimator.PretrainedModelKind.FastTextWikipedia300D));

            // 适应数据。
            var textTransformer = textPipeline.Fit(emptyDataView);

            // 创建预测引擎以从输入文本/字符串获取嵌入向量。
            var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData, TransformedTextData>(textTransformer);

            // 调用预测API将文本转换为嵌入向量。
            var data = new TextData()
            {
                Text = "This is a great product. I would like to buy it again."
            };
            var prediction = predictionEngine.Predict(data);

            // 打印嵌入向量的长度。
            Console.WriteLine($"Number of Features: {prediction.Features.Length}");

            // 打印嵌入向量。
            Console.Write("Features: ");
            foreach (var f in prediction.Features)
                Console.Write($"{f:F4} ");

            // 期望的输出：
            // Number of Features: 150
            // Features: -1.2489 0.2384 -1.3034 -0.9135 -3.4978 -0.1784 -1.3823 -0.3863 -2.5262 -0.8950 ...
        }

        private class TextData
        {
            public string Text { get; set; }
        }

        private class TransformedTextData : TextData
        {
            public float[] Features { get; set; }
        }
    }
}

在您的情况下，如果使用FastTextWikipedia300D、Glove200D或Glove100D预训练模型，当运行以下代码时出现卡住的情况：

var textTransformer = textPipeline.Fit(emptyDataView);

您尝试了解决方法：https://stackoverflow.com/a/54561423/5168936 但是添加没有产生任何效果：

.AppendCacheCheckpoint(mlContext)

是否有方法可以了解为什么会发生这种情况？或者我使用方法不正确；我将很高兴听取任何想法。谢谢！

Nuget中的Microsoft.ML包版本是：2.0.1

英文:

there is some code example which works with Pretrained model (link to whole example page https://learn.microsoft.com/en-us/dotnet/api/microsoft.ml.textcatalog.applywordembedding?view=ml-dotnet):

using System;
using System.Collections.Generic;
using Microsoft.ML;
using Microsoft.ML.Transforms.Text;
namespace Samples.Dynamic
{
public static class ApplyWordEmbedding
{
public static void Example()
{
// Create a new ML context, for ML.NET operations. It can be used for
// exception tracking and logging, as well as the source of randomness.
var mlContext = new MLContext();
// Create an empty list as the dataset. The &#39;ApplyWordEmbedding&#39; does
// not require training data as the estimator (&#39;WordEmbeddingEstimator&#39;)
// created by &#39;ApplyWordEmbedding&#39; API is not a trainable estimator.
// The empty list is only needed to pass input schema to the pipeline.
var emptySamples = new List&lt;TextData&gt;();
// Convert sample list to an empty IDataView.
var emptyDataView = mlContext.Data.LoadFromEnumerable(emptySamples);
// A pipeline for converting text into a 150-dimension embedding vector
// using pretrained &#39;SentimentSpecificWordEmbedding&#39; model. The
// &#39;ApplyWordEmbedding&#39; computes the minimum, average and maximum values
// for each token&#39;s embedding vector. Tokens in 
// &#39;SentimentSpecificWordEmbedding&#39; model are represented as
// 50 -dimension vector. Therefore, the output is of 150-dimension [min,
// avg, max].
//
// The &#39;ApplyWordEmbedding&#39; API requires vector of text as input.
// The pipeline first normalizes and tokenizes text then applies word
// embedding transformation.
var textPipeline = mlContext.Transforms.Text.NormalizeText(&quot;Text&quot;)
.Append(mlContext.Transforms.Text.TokenizeIntoWords(&quot;Tokens&quot;,
&quot;Text&quot;))
.Append(mlContext.Transforms.Text.ApplyWordEmbedding(&quot;Features&quot;,
&quot;Tokens&quot;, WordEmbeddingEstimator.PretrainedModelKind
.FastTextWikipedia300D));
// Fit to data.
var textTransformer = textPipeline.Fit(emptyDataView);
// Create the prediction engine to get the embedding vector from the
// input text/string.
var predictionEngine = mlContext.Model.CreatePredictionEngine&lt;TextData,
TransformedTextData&gt;(textTransformer);
// Call the prediction API to convert the text into embedding vector.
var data = new TextData()
{
Text = &quot;This is a great product. I would &quot; +
&quot;like to buy it again.&quot;
};
var prediction = predictionEngine.Predict(data);
// Print the length of the embedding vector.
Console.WriteLine($&quot;Number of Features: {prediction.Features.Length}&quot;);
// Print the embedding vector.
Console.Write(&quot;Features: &quot;);
foreach (var f in prediction.Features)
Console.Write($&quot;{f:F4} &quot;);
//  Expected output:
//   Number of Features: 150
//   Features: -1.2489 0.2384 -1.3034 -0.9135 -3.4978 -0.1784 -1.3823 -0.3863 -2.5262 -0.8950 ...
}
private class TextData
{
public string Text { get; set; }
}
private class TransformedTextData : TextData
{
public float[] Features { get; set; }
}
}

}

So, in my case if I use FastTextWikipedia300D OR Glove200D OR Glove 100D pretrained models there is a stuck process which not ends even after 10 mins while run there:

> var textTransformer = textPipeline.Fit(emptyDataView);

I tried use this resolution: https://stackoverflow.com/a/54561423/5168936
But adding doesn't have any effect

> AppendCacheCheckpoint(mlContext)

are there any ways to understand why this happened? or I use it wrong; I will be happy for any idea. thank you!

Microsoft.ML package version in Nuget is: 2.0.1

答案1

得分: 0

我是个傻瓜
要解决这个“问题”，您应该将词嵌入下载到您的“..AppData\Local\mlnet-resources\WordVectors”文件夹中；
例如，将下载的文件glove.6B.300d.txt添加到该文件夹中，成功完成了调试，并且Fit()函数在没有卡住的情况下正常工作；
所以，我用我的答案关闭了这个问题。

英文:

I'm an idiot
to resolve this "issue" you should download word embeddings to your "..AppData\Local\mlnet-resources\WordVectors" folder;
as example adding downloaded file: glove.6B.300d.txt to that folder succeed the debug completely and Fit() works correctly without stuck process;
so, I close this question with my own answer

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

ML.Net卡在预训练模型的Fit()方法上。

问题

答案1

System.Text.Json JsonStringEnumConverter with custom fallback in case of deserialization failures

如何在C#中将胶囊碰撞器的物理材质更改为true或false作为布尔值？

如何在SSIS脚本任务中访问用户变量的参数

SqlDataReader 仅从 JSON 列中读取部分数据。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论