如何在R中找到与特定词相关的产品名称列表

huangapple go评论126阅读模式
英文:

How to find a list of product names associated with word in R

问题

我有一个包含关于个人及其针对食品、膳食补充剂和化妆品提交给FDA的产品投诉报告的大量数据列表。我的数据已经清理好,然后我创建了一个包含0和1的矩阵:

syms <- strsplit(dat$symptoms, ", ")
tm   <- matrix(0, nrow=nrow(dat), ncol=length(unique(unlist(syms))))
colnames(tm) <- unique(unlist(syms))

for(i in 1:length(syms)) {
  tm[i, syms[[i]]] <- 1
}
dat$symptoms <- NULL

'dat'包含患者的投诉数据:

received id ... product outcome
9/30/2022 2022-CFS-014640 ... centrum silver men's 50+ other outcome
9/30/2022 2022-CFS-014637 ... liquid collagen shot life threatening

而'tm'包含症状矩阵:

diarrhoea vomiting cancer
0 1 0
1 0 0
... ... 1

如果不想患癌症,我需要找到个人应该避免的产品列表。我尝试了以下代码:

# 找到包含"cancer"症状的tm矩阵中的行
cancer_rows <- which(tm[, "cancer"] == 1)

# 创建与"cancer"症状相关的产品名称向量
products_to_avoid <- unique(dat$product[cancer_rows])

但这对我不起作用。也许有人有任何想法如何正确编写它?

英文:

I have a huge list of data that contains information about person and its product complaint reports submitted to FDA for foods, dietary supplements, and cosmetics. My data is cleaned up and then I create the matrix that contains 0 and 1:

syms &lt;- strsplit(dat$symptoms, &quot;, &quot;)
tm   &lt;- matrix(0, nrow=nrow(dat), ncol=length(unique(unlist(syms))))
colnames(tm) &lt;- unique(unlist(syms))

for(i in 1:length(syms)) {
  tm[i, syms[[i]]] &lt;- 1
}
dat$symptoms &lt;- NULL

The 'dat' contains data of complaints of the patient:

received id ... product outcome
9/30/2022 2022-CFS-014640 ... centrum silver men's 50+ other outcome
9/30/2022 2022-CFS-014637 ... liquid collagen shot life threatening

and the 'tm' has the matrix of symptoms:

diarrhoea vomiting cancer
0 1 0
1 0 0
... ... 1

I need to find the list of products that person should avoid if it doesn't want to get cancer. I tried this:

# Find rows in tm matrix where the &quot;cancer&quot; symptom is present
cancer_rows &lt;- which(tm[, &quot;cancer&quot;] == 1)

# Create a vector of product names associated with &quot;cancer&quot; symptoms
products_to_avoid &lt;- unique(dat$product[cancer_rows])

but this doesn't work for me. Maybe someone has any ideas how can I write it properly?

答案1

得分: 1

你可以使用正则表达式过滤症状,而不需要为每个症状创建一个变量(请注意,这仅在将dat$symptoms设置为NULL之前才有效):

unique(dat$product[grepl("cancer", dat$symptoms)])

要提取症状,你还可以使用tidyverse方法,以便将其保持在同一数据框中。例如:

library(dplyr)
library(tidyr)
library(tibble)

dat_syms <- dat %>%
  mutate(
    syms = symptoms %>%
      strsplit(", ") %>%
      lapply(table) %>%
      lapply(as.data.frame)
  ) %>%
  unnest(syms) %>%
  spread(Var1, Freq, fill = 0)

unique(dat_syms$product[dat_syms$cancer == 1])

然而,重要的是要注意,虽然这列出了客户投诉患癌症的产品,但这很可能不会提供关于是否应该避免这些产品的有用信息。为了提供有用信息,你必须对数据做出非常强烈的假设,例如投诉的客户确实知道导致他们患癌症的是该产品,这显然是不正确的。

英文:

You can filter by symptoms using regex without making a variable for each symptom (note that this only works before you set dat$symptoms to NULL):

unique(dat$product[grepl(&quot;cancer&quot;, dat$symptoms)])

For extracting symptoms, you could also use a tidyverse approach to easily keep it within the same data frame. For example:

library(dplyr)
library(tidyr)
library(tibble)

dat_syms &lt;-
  dat %&gt;%
  mutate(
    syms = symptoms %&gt;%
      strsplit(&quot;, &quot;) %&gt;%
      lapply(table) %&gt;%
      lapply(as.data.frame)
  ) %&gt;%
  unnest(syms) %&gt;%
  spread(Var1, Freq, fill = 0)

unique(dat_syms$product[dat_syms$cancer == 1])

However, it is important note that while this lists products where customers complained about cancer, it is likely not very informative about whether or not those products should be avvoided. To be informative you would have to make very strong assumptions about the data, e.g. that customers who complain actually know that it was indeed that product which caused their cancer---which obviously is not true.

huangapple
  • 本文由 发表于 2023年3月9日 14:44:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/75681224.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定