Machine learning models接受包含张量或数组的pandas Dataframe的单元格吗?

huangapple go评论56阅读模式
英文:

Do machine learning models accept cells of pandas Dataframe containing tensors or arrays?

问题

我正在实施一个基于图的数据的欺诈检测方法,来自这篇文章:https://developer.nvidia.com/blog/optimizing-fraud-detection-in-financial-services-with-graph-neural-networks-and-nvidia-gpus/。在文章的"训练GNN模型"和"在下游任务中使用GNN嵌入"部分,文章建议将表格数据转换为图,然后为每个节点生成一个64宽度的嵌入,然后将它们与原始表格数据连接在相应的节点ID上。我已经生成了一个64宽度的节点嵌入张量,但我不确定接下来该怎么做。我考虑过将张量压缩为一维并根据节点ID追加它们,但我觉得这不是文章建议的做法。我也考虑过将整个张量添加到单元格中,但我觉得它可能不适合大多数机器学习模型。在这种情况下,我应该怎么做?如果这不是提问的正确地方,我深感抱歉,会将其删除。

英文:

I am implementing a fraud detection approach for graph-based data, from this article: https://developer.nvidia.com/blog/optimizing-fraud-detection-in-financial-services-with-graph-neural-networks-and-nvidia-gpus/. At the "Training the GNN model" and "Using GNN embeddings for downstream tasks" parts, the article suggests translating a tabular dataset into a graph, then generate a 64-width embedding for each node, before joining them to the original tabular dataset on the respective node IDs. I have generated a 64-width node embedding tensor, however, I am unsure what to do next. I have thought about condensing the tensor into one dimension and appending them based on the node IDs, but I feel like that is not what the article is suggesting. I have also thought of adding the entire tensor into the cell, but I feel like it is not going to fit with most machine learning models. What should I do in this situation? I do apologize if this does not seem like the right place to ask the question, and will remove it if that is the case.

答案1

得分: 0

理解博文作者在NVIDIA博客中使用何种嵌入来丰富数据集,让我们重新阐述他们的方法:

  • 他们构建了一个图,其中卡片ID商家表示为节点,而交易表示为。我在下面的示例中进行了可视化,以明确显示什么是节点和什么是(这也证明了为什么我更喜欢硬科学而不是艺术学院)。
  • 随后,他们使用链接预测任务创建节点嵌入,也就是说,他们训练了一个模型,该模型预测卡片ID商家是否连接的概率。**注意:**这是OP所指的64维嵌入。
  • 现在,他们将这些生成的嵌入重新加入数据集。**注意:**由于嵌入的目的是编码给定商家和给定卡片ID连接的概率,您只需要每个交易一个嵌入,因为每个交易只有1个商家和一个卡片。
  • 最后,他们在丰富的数据集上拟合了一个XGBoost算法,以预测交易是否是欺诈的。虽然他们没有明确提到,但64维嵌入作为64个特征加入到数据集中 - XGBoost无法处理包含64维度向量的单个特征,这就是为什么64个包含单个数字的特征是唯一有效的选项。
英文:

To understand what kind of embeddings the authors of this NVIDIA blog use to enrich the dataset, let's rephrase their approach:

  • They construct a graph where card ID's and merchants represent nodes, and transactions represent edges. I've visualized an example below to make it clear what are nodes and what are edges (it also proves why I preferred hard sciences to art school).
  • Subsequently, they create node embeddings using a link prediction task, that is to say they train a model which predicts the probability that a card ID and a merchant are connected. Note: This is the 64-dimensional embedding that was referred by OP.
  • Now, they join these generated embeddings back into the dataset. Note: Since the purpose of the embeddings is to encode the probability that a given merchant and given card ID are connected, you only need one embedding per transaction, since every transaction only has 1 merchant and one card.
  • Last, they fit an XGBoost algorithm on the enriched dataset to predict if a transaction is fraudulent. Although they do not mention it explicitly, the 64-dimensional embedding is joined back into the dataset as 64 features - XGBoost cannot handle a single feature containing a vector of 64 dimensions, which is why 64 features containing a single number is the only valid option.

Machine learning models接受包含张量或数组的pandas Dataframe的单元格吗?

Visualization of type of graph created. For the sake of simplicity, the visualization assumes that every person only owns one card. Since the dataset does not contain a transaction ID and every row in the dataset contain one transaction, you can consider the row index a transaction ID.

huangapple
  • 本文由 发表于 2023年5月21日 01:36:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76296550.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定