2023年5月22日 22:06:19go评论68阅读模式

英文:

Sugestions on the best way to work with NLP mixed some numerical and categorical features

问题

I can help with the translation part. Here's the translated text:

我正在处理不同国家的药品数据集，每个国家都有自己的数据源。这导致数据并不总是很“标准化”（没有更好的词来形容），因此我试图解决的一个问题是确保所有国家的剂量都以相同的格式呈现。我一直在使用 regex 为每个国家进行“手动”处理，同时考虑了一些我想在模型中用作特征的标准。例如：产品的活性物质数量，药物制剂形式以及产品中是否存在某些特定活性物质。通过对大约三分之一的国家进行这种“手动”处理，我已经获得了足够多的记录来训练一个模型。

我希望能够自动填充这个“DosageFinal”字段。最佳方法是什么？我研究了并行网络，想法是使用一个神经网络来获取文本变量的嵌入，使用另一个神经网络来收集唯一的数值特征的嵌入，然后连接这些嵌入。 我是否过于复杂化了？

英文:

I'm working with a dataset of medicinal products across different countries, with each country having it's own data source. This results in the data not always being quite 'standardized' (for a lack of a better word), so one of the problems I'm trying to solve is to have the dosage in the same format across all countries. I've been doing it 'manually' for each country using regex, while having into account some criteria that I want to use as features in the model. For example: the number of active substances of the product, the pharmaceutical form and if some specific active substance is present in the product. By doing this 'manually' for like 1/3 of the countries, I've got a reasonable amount of records to train a model.

Name   ActiveSubstances   NumberOfActSubst   PharmaceuticalForm   Dosage        DosageFinal

X      [&#39;Y&#39;,&#39;Z&#39;]          2                  Tablet               &#39;20mg/5mg&#39;    &#39;20 mg + 5 mg&#39;

A      [&#39;B&#39;]              1                  Tablet               &#39;(50 microg+10mg)/ml&#39;&#39;50 &#181;g/ml + 10mg/ml&#39;

I want this DosageFinal field to be filled automatically. What would be the best way to approach this task? I looked into parallel networks and the idea would be to use one NN to get the embeddings of the text variables, and another NN to collect the embeddings of the only numeric feature and later concatenate the embeddings. Am I overcomplicating it?

答案1

得分: 1

你可以使用嵌入来理解文本的语义含义。

对于你的情况，我建议将其视为翻译任务或简单的文本生成。

生成

使用任何解码器生成正确格式的文本。
在提示中使用Few-Shot学习，它将已经理解模式。

进行快速测试；前往任何免费的AI聊天平台（例如HFchat，ChatGPt等），用几个示例指导它，你将得到正确的答案。
如果你正确构建提示，你将获得SOTA级别的答案。

一些有助于模型的想法包括：独立转换每个国家或每种药物。
而且，如果你提供足够好的提示（Few-Shots），它将表现出色。

翻译

如果你有足够的数据样本来训练一个语言模型 - 尝试使用BART、T5等。
然后，你可能能够创建一个模型来为你生成这些文本。

祝你好运。

英文:

You would use embeddings to understand the semantic meaning of the text.

For your situation, I would recommend looking at this as a Translation task, or a simple text Generation.

Generation

Use any decoder to generate the text in the right format.
Use a Few-Show learning inside the prompt, and it will already understand the pattern.

Do a quick test; Go to any free AI-Chat platform (e.g. HFchat, ChatGPt, etc.), instruct it with a few examples, and you would get the right answers.
If you build the prompt correctly you will get SOTA answers.

Some ideas to help the model would be: transform each country independently, or each medication.
Also, if you give it a good enough prompt (few-shots) - it will do great.

Translation

If you have enough data samples to train an LM - try to use BART, T5, etc.
And you might be able to create a model to generate these texts for you.

Good luck.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

建议处理混合了数值和分类特征的自然语言处理（NLP）数据的最佳方法：

问题

答案1

生成

翻译

Generation

Translation

将特定列转换为列表，然后创建JSON。

Module name binding by relative import in init.py

正则表达式匹配电影文件

$字符在Python中代表变量名或标识符的一部分。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论