英文:
Do machine learning models accept cells of pandas Dataframe containing tensors or arrays?
问题
我正在实施一个基于图的数据的欺诈检测方法,来自这篇文章:https://developer.nvidia.com/blog/optimizing-fraud-detection-in-financial-services-with-graph-neural-networks-and-nvidia-gpus/。在文章的"训练GNN模型"和"在下游任务中使用GNN嵌入"部分,文章建议将表格数据转换为图,然后为每个节点生成一个64宽度的嵌入,然后将它们与原始表格数据连接在相应的节点ID上。我已经生成了一个64宽度的节点嵌入张量,但我不确定接下来该怎么做。我考虑过将张量压缩为一维并根据节点ID追加它们,但我觉得这不是文章建议的做法。我也考虑过将整个张量添加到单元格中,但我觉得它可能不适合大多数机器学习模型。在这种情况下,我应该怎么做?如果这不是提问的正确地方,我深感抱歉,会将其删除。
英文:
I am implementing a fraud detection approach for graph-based data, from this article: https://developer.nvidia.com/blog/optimizing-fraud-detection-in-financial-services-with-graph-neural-networks-and-nvidia-gpus/. At the "Training the GNN model" and "Using GNN embeddings for downstream tasks" parts, the article suggests translating a tabular dataset into a graph, then generate a 64-width embedding for each node, before joining them to the original tabular dataset on the respective node IDs. I have generated a 64-width node embedding tensor, however, I am unsure what to do next. I have thought about condensing the tensor into one dimension and appending them based on the node IDs, but I feel like that is not what the article is suggesting. I have also thought of adding the entire tensor into the cell, but I feel like it is not going to fit with most machine learning models. What should I do in this situation? I do apologize if this does not seem like the right place to ask the question, and will remove it if that is the case.
答案1
得分: 0
理解博文作者在NVIDIA博客中使用何种嵌入来丰富数据集,让我们重新阐述他们的方法:
- 他们构建了一个图,其中卡片ID和商家表示为
节点
,而交易表示为边
。我在下面的示例中进行了可视化,以明确显示什么是节点
和什么是边
(这也证明了为什么我更喜欢硬科学而不是艺术学院)。 - 随后,他们使用
链接预测
任务创建节点嵌入
,也就是说,他们训练了一个模型,该模型预测卡片ID
和商家
是否连接的概率。**注意:**这是OP所指的64维嵌入。 - 现在,他们将这些生成的嵌入重新加入数据集。**注意:**由于嵌入的目的是编码给定
商家
和给定卡片ID
连接的概率,您只需要每个交易一个嵌入,因为每个交易只有1个商家和一个卡片。 - 最后,他们在丰富的数据集上拟合了一个XGBoost算法,以预测交易是否是欺诈的。虽然他们没有明确提到,但64维嵌入作为
64个特征
加入到数据集中 - XGBoost无法处理包含64维度向量的单个特征,这就是为什么64个包含单个数字的特征是唯一有效的选项。
英文:
To understand what kind of embeddings the authors of this NVIDIA blog use to enrich the dataset, let's rephrase their approach:
- They construct a graph where card ID's and merchants represent
nodes
, and transactions representedges
. I've visualized an example below to make it clear what arenodes
and what areedges
(it also proves why I preferred hard sciences to art school). - Subsequently, they create
node embeddings
using alink prediction
task, that is to say they train a model which predicts the probability that acard ID
and amerchant
are connected. Note: This is the 64-dimensional embedding that was referred by OP. - Now, they join these generated embeddings back into the dataset. Note: Since the purpose of the embeddings is to encode the probability that a given
merchant
and givencard ID
are connected, you only need one embedding per transaction, since every transaction only has 1 merchant and one card. - Last, they fit an XGBoost algorithm on the enriched dataset to predict if a transaction is fraudulent. Although they do not mention it explicitly, the 64-dimensional embedding is joined back into the dataset as
64 features
- XGBoost cannot handle a single feature containing a vector of 64 dimensions, which is why 64 features containing a single number is the only valid option.
Visualization of type of graph created. For the sake of simplicity, the visualization assumes that every person only owns one card. Since the dataset does not contain a transaction ID
and every row in the dataset contain one transaction, you can consider the row index
a transaction ID
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论