2023年5月21日 01:36:33go评论61阅读模式

英文:

Do machine learning models accept cells of pandas Dataframe containing tensors or arrays?

问题

我正在实施一个基于图的数据的欺诈检测方法，来自这篇文章：https://developer.nvidia.com/blog/optimizing-fraud-detection-in-financial-services-with-graph-neural-networks-and-nvidia-gpus/。在文章的"训练GNN模型"和"在下游任务中使用GNN嵌入"部分，文章建议将表格数据转换为图，然后为每个节点生成一个64宽度的嵌入，然后将它们与原始表格数据连接在相应的节点ID上。我已经生成了一个64宽度的节点嵌入张量，但我不确定接下来该怎么做。我考虑过将张量压缩为一维并根据节点ID追加它们，但我觉得这不是文章建议的做法。我也考虑过将整个张量添加到单元格中，但我觉得它可能不适合大多数机器学习模型。在这种情况下，我应该怎么做？如果这不是提问的正确地方，我深感抱歉，会将其删除。

英文:

I am implementing a fraud detection approach for graph-based data, from this article: https://developer.nvidia.com/blog/optimizing-fraud-detection-in-financial-services-with-graph-neural-networks-and-nvidia-gpus/. At the "Training the GNN model" and "Using GNN embeddings for downstream tasks" parts, the article suggests translating a tabular dataset into a graph, then generate a 64-width embedding for each node, before joining them to the original tabular dataset on the respective node IDs. I have generated a 64-width node embedding tensor, however, I am unsure what to do next. I have thought about condensing the tensor into one dimension and appending them based on the node IDs, but I feel like that is not what the article is suggesting. I have also thought of adding the entire tensor into the cell, but I feel like it is not going to fit with most machine learning models. What should I do in this situation? I do apologize if this does not seem like the right place to ask the question, and will remove it if that is the case.

答案1

得分: 0

理解博文作者在NVIDIA博客中使用何种嵌入来丰富数据集，让我们重新阐述他们的方法：

他们构建了一个图，其中卡片ID和商家表示为节点，而交易表示为边。我在下面的示例中进行了可视化，以明确显示什么是节点和什么是边（这也证明了为什么我更喜欢硬科学而不是艺术学院）。
随后，他们使用链接预测任务创建节点嵌入，也就是说，他们训练了一个模型，该模型预测卡片ID和商家是否连接的概率。**注意：**这是OP所指的64维嵌入。
现在，他们将这些生成的嵌入重新加入数据集。**注意：**由于嵌入的目的是编码给定商家和给定卡片ID连接的概率，您只需要每个交易一个嵌入，因为每个交易只有1个商家和一个卡片。
最后，他们在丰富的数据集上拟合了一个XGBoost算法，以预测交易是否是欺诈的。虽然他们没有明确提到，但64维嵌入作为64个特征加入到数据集中 - XGBoost无法处理包含64维度向量的单个特征，这就是为什么64个包含单个数字的特征是唯一有效的选项。

英文:

To understand what kind of embeddings the authors of this NVIDIA blog use to enrich the dataset, let's rephrase their approach:

They construct a graph where card ID's and merchants represent nodes, and transactions represent edges. I've visualized an example below to make it clear what are nodes and what are edges (it also proves why I preferred hard sciences to art school).
Subsequently, they create node embeddings using a link prediction task, that is to say they train a model which predicts the probability that a card ID and a merchant are connected. Note: This is the 64-dimensional embedding that was referred by OP.
Now, they join these generated embeddings back into the dataset. Note: Since the purpose of the embeddings is to encode the probability that a given merchant and given card ID are connected, you only need one embedding per transaction, since every transaction only has 1 merchant and one card.
Last, they fit an XGBoost algorithm on the enriched dataset to predict if a transaction is fraudulent. Although they do not mention it explicitly, the 64-dimensional embedding is joined back into the dataset as 64 features - XGBoost cannot handle a single feature containing a vector of 64 dimensions, which is why 64 features containing a single number is the only valid option.

Visualization of type of graph created. For the sake of simplicity, the visualization assumes that every person only owns one card. Since the dataset does not contain a transaction ID and every row in the dataset contain one transaction, you can consider the row index a transaction ID.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Machine learning models接受包含张量或数组的pandas Dataframe的单元格吗？

问题

答案1

AttributeError: 导入Dask时，模块’pandas.core.strings’没有’StringMethods’属性。

Pandas数据框可视化表示连接数据框。

如何在不使用for循环的情况下更新多个字典的键值对。

你可以使用特定的方式对数据集的列进行排序，以展示它们的分布。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论