(vowpal wabbit) 处理新上下文的情境赌博

huangapple go评论74阅读模式
英文:

(vowpal wabbit) contextual bandit dealing with new context

问题

这些最近的日子里,我试图通过Vowpalwabbit来训练一个上下文强化学习算法,所以我正在做一些玩具模型,以帮助我理解算法的工作原理。

所以我构想了一个具有4种可能操作的状态,并在两种不同的上下文中训练我的模型。每个上下文中只有一种最佳操作,其中有4种操作。

这是我所做的。

vw = pyvw.vw("--cb_explore 4 -q UA --epsilon 0.1")
vw.learn('1:-2:0.5 | 5')
vw.learn('3:2:0.5 | 5')
vw.learn('1:2:0.5 | 15')
vw.learn('3:-2:0.5 | 15')
vw.learn('4:2:0.5 | 5')
vw.learn('4:2:0.5 | 15')
vw.learn('2:2:0.5 | 5')
vw.learn('2:2:0.5 | 15')

因此,对于我的示例,在具有特征等于5的上下文中,最佳操作是2,而对于另一个上下文,最佳操作是3。

当我在这两种上下文中进行预测时,没有问题,因为算法已经遇到过它们一次,并且已经根据奖励来确定了选择。

但是当我带来一个新的上下文时,我希望算法能为我提供最相关的操作,例如考虑上下文特征的相似性。

例如,如果我提供一个特征等于29的上下文,我期望得到操作3,因为29比5更接近于15。

这就是我现在的疑问。

谢谢!

英文:

This last days I'm trying to train a contextual bandit algorithm throw Vowpalwabbit, so I'm doing some toy-model that can help me understand how the algorithm works.

So I imagined a state with 4 possible action and I train my model on two different context.
Each context has only one optimal action among the 4 actions.

That's how I did it.

vw = pyvw.vw("--cb_explore 4 -q UA --epsilon 0.1")
vw.learn('1:-2:0.5 | 5')
vw.learn('3:2:0.5 | 5')
vw.learn('1:2:0.5 | 15')
vw.learn('3:-2:0.5 | 15')
vw.learn('4:2:0.5 | 5')
vw.learn('4:2:0.5 | 15')
vw.learn('2:2:0.5 | 5')
vw.learn('2:2:0.5 | 15')

So for my example for the context with his feature equal to 5 the optimal action is 2 and for the other one the optimal action is 3.

When I predict on those two context, there is no problem since the algorithm meet them already once and had get a reward conditioning his choice.

But when I arrive with a new context I expect the algorithm to make me the most relevant action, for example by taking into account the similarity of the context features.

So for example if I give a feature that equal to 29, I'm expecting to get action 3, since 29 is more near to 15 than 5.

So that my interrogations right now.

Thanks !

答案1

得分: 2

问题出在你构造特征的方式上。特征的输入格式定义为name[:value],如果没有提供值,则默认值为1.0。所以你提供的是一个名字为515的特征。特征名称会被哈希化并用于确定特征的索引。因此,在你的情况下,特征5和特征15都具有值1.0,并且具有不同的索引。

因此,要解决你的问题,你只需为你的特征指定一个名称。

vw.learn('1:-2:0.5 | my_feature_name:5')
vw.learn('1:2:0.5 | my_feature_name:15')

你可以在这里了解更多关于输入格式的信息。

另外,我想指出,在你的示例中-q UA没有起作用,因为你没有命名空间。命名空间可以通过将它们放置在竖线旁边来指定。以下示例有两个命名空间,A 和 B。(注意:如果命名空间使用多个字符,那么-q只使用第一个字符)

1:-2:0.5 |A my_feature_name:5 |B yet_another_feature:4

在这种情况下,如果我们提供了-q AB,那么VW会在运行时为A和B中的每一对特征创建一个新特征。这允许你在VW学习的表示中表达更复杂的交互作用。

英文:

The problem is in the way you've structured the feature. The input format for a feature is defined as name[:value], and if value is not supplied the default value is 1.0. So what you've supplied is a feature whose name is 5, or 15. Feature names are hashed and used to determine the index of the feature. So in your case feature 5 and feature 15 both have a value of 1.0 and are distinct features with different indices.

Therefore, to fix your problem you just need to give your features a name.

vw.learn('1:-2:0.5 | my_feature_name:5')
vw.learn('1:2:0.5 | my_feature_name:15')

You can read more about the input format here.

Also, I'd like to point out that -q UA is not doing anything in your example, as you do not have namespaces. Namespaces can be specified by placing them next to the bar. The following example has two namespaces, A and B. (Note: if more than one character is used for namespace only the first character is used with -q)

1:-2:0.5 |A my_feature_name:5 |B yet_another_feature:4

In this case if we supplied -q AB, then VW would create a new feature for each pair of features in A and B at runtime. This allows you to express more complicated interactions in the representation VW learns.

huangapple
  • 本文由 发表于 2020年1月4日 01:16:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/59582677.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定