英文:
Why Keras / TensorFlow do not see input features in random forest models if the dataset is very small?
问题
I am trying to use Keras and TensorFlow to predict a variable via random forests. I encountered an unexpected behavior and I managed to trace it back to the following issue. If my training dataset is too small, I get the warning "The model does not have any input features i.e. the model is constant and will always return the same prediction." even though there is a feature in the dataset. Is it a bug, or maybe a deeply undocumented feature?
以下是一个最小的不起作用的示例。训练数据集只是说键1总是与值1相关联,键2与值2相关联。此信息编码了“multiplicity”次。
正确的行为应该是,无论何时我们将键"1"作为输入,"0"是正确答案的概率都等于0.0,"1"是正确答案的概率都等于1.0,而"2"是正确答案的概率都等于0.0。这意味着我的期望答案是概率向量(0.0, 1.0, 0.0)。如果"2"是密钥,期望答案应该是(0.0, 0.0, 1.0)。
程序的真实输出如下:如果multiplicity最多为4,则TensorFlow不会看到任何输入特征;如果multiplicity为5或更多,则TensorFlow可以看到1个特征。这种行为的改变可能表明存在错误。
此外,预测的输出似乎非常奇怪,例如对于multiplicity=5,我们得到非常疯狂的概率:对于"1",我们得到[0., 0.61333287, 0.3866664]
,对于"2",我们得到[0., 0.35999975, 0.6399995]
。
我在Kaggle上使用TensorFlow版本2.11.0。您能帮助我弄清楚问题是软件错误还是我不理解某些东西吗?
英文:
I am trying to use Keras and TensorFlow to predict a variable via random forests. I encountered an unexpected behavior and I managed to trace it back to the following issue. If my training dataset is too small, I get the warning The model does not have any input features i.e. the model is constant and will always return the same prediction.
even though there is a feature in the dataset. Is it a bug, or maybe a deeply undocumented feature?
Below is a minimal non-working example. The training dataset just says that the key 1 is associated always with the value 1 and the key 2 is associated with the value 2. This information is encoded "multiplicity" number of times.
The correct behaviour should be that whenever we get the key "1" as the input, the probability that "0" is the correct answer is equal to 0.0, the probability that "1" is the correct answer is equal to 1.0, while the probability that "2" is the correct answer is equal to 0.0. This means that my desired answer is the vector of probabilities (0.0, 1.0, 0.0). If "2" is the key, the desired answer should be (0.0, 0.0, 1.0)].
The real output of the program is as follows: if the multiplicity is at most 4, TensorFlow does not see any input features; if the multiplicity is 5 or more, TensorFlow can see 1 feature. This change of behaviour may indicate a bug.
Also, the output of the prediction seems very strange, for example for multiplicity=5 we get really crazy probabilities: for "1" we get [0., 0.61333287, 0.3866664 ]
and for "2" we get [0., 0.35999975, 0.6399995 ]
.
import tensorflow as tf
import tensorflow_decision_forests as tfdf
import pandas as pd
def train_and_predict(multiplicity):
train_pd=pd.DataFrame( multiplicity *[ {"key":1, "value":1}] + multiplicity *[ {"key":2, "value":2} ] )
train_tf = tfdf.keras.pd_dataframe_to_tf_dataset(train_pd, label= "value")
rf = tfdf.keras.RandomForestModel()
rf.fit(x=train_tf)
to_guess =pd.DataFrame( [ {"key":1}, {"key":2} ] )
guess_tf = tfdf.keras.pd_dataframe_to_tf_dataset(to_guess)
return rf.predict(guess_tf )
print(train_and_predict(4))
print(train_and_predict(5))
The interesting parts of the output:
[WARNING] The model does not have any input features i.e. the model is constant and will always return the same prediction.
[INFO] Model loaded with 300 root(s), 300 node(s), and 0 input feature(s).
[INFO] Engine "RandomForestGeneric" built
[INFO] Use fast generic engine
[[0. 0.6033329 0.39666638]
[0. 0.6033329 0.39666638]]
[INFO] Model loaded with 300 root(s), 452 node(s), and 1 input feature(s).
[INFO] Use fast generic engine.
[[0. 0.61333287 0.3866664 ]
[0. 0.35999975 0.6399995 ]]
I use TensorFlow version 2.11.0 on Kaggle. Can you help me figuring out whether the problem lies on a software bug or rather on me not understanding something?
答案1
得分: 2
以下是翻译的内容:
-
默认情况下,Tensorflow Decision Forests 不会创建节点中包含少于5个示例的情况。如果您只有8个示例(多重性=4),则无法将这些示例拆分为至少包含5个示例的2个节点,因此不会进行拆分,模型保持不变。您可以通过设置超参数来控制此行为,例如
rf = tfdf.keras.RandomForestModel(min_examples=1)
。 -
概率向量是3维的,因为标签是整数1和2,所以TF-DF隐式地假定范围[0,1,2]内的所有整数都是有效标签。这主要是为了避免必须计算真实标签到实际标签的映射 - 您可以通过底层C ++库的数据规范详细控制此过程。
-
你得到“奇怪”概率的原因是随机森林的定义。随机森林(RF)基于装袋技术。每棵树都是在从原始数据集中随机(有替换地)抽样的不同数据集上训练的。对于您来说,这意味着树1始终是在完整数据集上训练的,这会产生您可能期望的完美0-1概率。对于所有其他树,它们都是在可能没有良好拆分的数据集的样本上抽样的(请参见答案的第1部分),因此树只会预测优先级类别。当对所有树进行平均时,您就会得到您所看到的概率。
您可以通过以下方式绘制单个树以更好地了解情况:
# 所有树的文本表示
print(rf.make_inspector().extract_all_trees())
# 在Colab / IPython中,我们可以为单个树生成交互式图形
tfdf.model_plotter.plot_model_in_colab(rf, tree_idx=77)
TF-DF允许您通过设置 rf = tfdf.keras.RandomForestModel(bootstrap_training_dataset=False)
来禁用装袋,但这会完全破坏随机森林的主要思想之一。您还可以只创建一棵树 rf = tfdf.keras.RandomForestModel(num_trees=1)
,因为第一棵树不使用装袋。
注意:通常,RF还使用特征装袋,即对属性的随机子集进行抽样,但由于数据集仅具有一个特征,因此未使用。
免责声明:我是Tensorflow Decision Forests的作者之一。
英文:
There are 3 question in this, let me answer them one by one.
-
By default, Tensorflow Decision Forests will not create any nodes with less than 5 examples in the node. If you just have 8 examples (multiplicity = 4), you cannot split these examples to get 2 nodes with at least 5 examples, so no split is applied and the model is constant. You can control this hyperparameter by setting to 1, e.g.
rf = tfdf.keras.RandomForestModel(min_examples=1)
. -
The probability vector is 3-dimensional because the labels are integers 1 and 2, so TF-DF implicitly assumes that all integers in the range [0,1,2] are valid labels. This is done mostly to avoid having to compute a mapping from the real label to actual label - you can control this process in detail through the dataspec of the underlying C++ library.
-
The reason you're getting "weird" probabilities is due to the definition of random forests. Random Forests (RF) are based on bagging techniques. Each tree is trained on a different dataset, sampled randomly (with replacement) from the original one. For you, this means that Tree 1 is (always) trained on the full dataset, which gives to perfect 0-1-probabilities you probably expected. For all other trees, those are sampled on a sample of the dataset that may not have a good split (see Part 1 of the answer), and therefore the tree will just predict the priority class. When averaging over all trees, you end up with the probabilities you're seeing.
It can be interesting to plot the individual trees in order to get a feel for this with
# Text representation of all trees
print(rf.make_inspector().extract_all_trees())
# In Colab / IPython, we have interactive plots for individual trees
tfdf.model_plotter.plot_model_in_colab(rf, tree_idx=77)
TF-DF allows you to disable bagging by setting rf = tfdf.keras.RandomForestModel(bootstrap_training_dataset=False)
, but doing so completely destroys one of the main ideas of Random Forests. You can also just create a single tree rf = tfdf.keras.RandomForestModel(num_trees=1)
, as the first tree does not use bagging.
Note: Generally, RFs also use feature bagging, i.e. sampling a random subset of attributes, but it is not used since the dataset only has one feature.
Full Disclosure: I'm one of the authors of Tensorflow Decision Forests.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论