建议处理混合了数值和分类特征的自然语言处理(NLP)数据的最佳方法:

huangapple go评论68阅读模式
英文:

Sugestions on the best way to work with NLP mixed some numerical and categorical features

问题

I can help with the translation part. Here's the translated text:

我正在处理不同国家的药品数据集,每个国家都有自己的数据源。这导致数据并不总是很“标准化”(没有更好的词来形容),因此我试图解决的一个问题是确保所有国家的剂量都以相同的格式呈现。我一直在使用 regex 为每个国家进行“手动”处理,同时考虑了一些我想在模型中用作特征的标准。例如:产品的活性物质数量,药物制剂形式以及产品中是否存在某些特定活性物质。通过对大约三分之一的国家进行这种“手动”处理,我已经获得了足够多的记录来训练一个模型。

我希望能够自动填充这个“DosageFinal”字段。最佳方法是什么?我研究了并行网络,想法是使用一个神经网络来获取文本变量的嵌入,使用另一个神经网络来收集唯一的数值特征的嵌入,然后连接这些嵌入。 我是否过于复杂化了?

英文:

I'm working with a dataset of medicinal products across different countries, with each country having it's own data source. This results in the data not always being quite 'standardized' (for a lack of a better word), so one of the problems I'm trying to solve is to have the dosage in the same format across all countries. I've been doing it 'manually' for each country using regex, while having into account some criteria that I want to use as features in the model. For example: the number of active substances of the product, the pharmaceutical form and if some specific active substance is present in the product. By doing this 'manually' for like 1/3 of the countries, I've got a reasonable amount of records to train a model.

Name   ActiveSubstances   NumberOfActSubst   PharmaceuticalForm   Dosage        DosageFinal

X      ['Y','Z']          2                  Tablet               '20mg/5mg'    '20 mg + 5 mg'

A      ['B']              1                  Tablet               '(50 microg+10mg)/ml''50 µg/ml + 10mg/ml'

I want this DosageFinal field to be filled automatically. What would be the best way to approach this task? I looked into parallel networks and the idea would be to use one NN to get the embeddings of the text variables, and another NN to collect the embeddings of the only numeric feature and later concatenate the embeddings. Am I overcomplicating it?

答案1

得分: 1

你可以使用嵌入来理解文本的语义含义。

对于你的情况,我建议将其视为翻译任务或简单的文本生成。

生成

使用任何解码器生成正确格式的文本。
在提示中使用Few-Shot学习,它将已经理解模式。

进行快速测试;前往任何免费的AI聊天平台(例如HFchat,ChatGPt等),用几个示例指导它,你将得到正确的答案。
如果你正确构建提示,你将获得SOTA级别的答案。

一些有助于模型的想法包括:独立转换每个国家或每种药物。
而且,如果你提供足够好的提示(Few-Shots),它将表现出色。

翻译

如果你有足够的数据样本来训练一个语言模型 - 尝试使用BART、T5等。
然后,你可能能够创建一个模型来为你生成这些文本。

祝你好运。

英文:

You would use embeddings to understand the semantic meaning of the text.

For your situation, I would recommend looking at this as a Translation task, or a simple text Generation.

Generation

Use any decoder to generate the text in the right format.
Use a Few-Show learning inside the prompt, and it will already understand the pattern.

Do a quick test; Go to any free AI-Chat platform (e.g. HFchat, ChatGPt, etc.), instruct it with a few examples, and you would get the right answers.
If you build the prompt correctly you will get SOTA answers.

Some ideas to help the model would be: transform each country independently, or each medication.
Also, if you give it a good enough prompt (few-shots) - it will do great.

Translation

If you have enough data samples to train an LM - try to use BART, T5, etc.
And you might be able to create a model to generate these texts for you.

Good luck.

huangapple
  • 本文由 发表于 2023年5月22日 22:06:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76307031.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定