英文:
How to find a list of product names associated with word in R
问题
我有一个包含关于个人及其针对食品、膳食补充剂和化妆品提交给FDA的产品投诉报告的大量数据列表。我的数据已经清理好,然后我创建了一个包含0和1的矩阵:
syms <- strsplit(dat$symptoms, ", ")
tm <- matrix(0, nrow=nrow(dat), ncol=length(unique(unlist(syms))))
colnames(tm) <- unique(unlist(syms))
for(i in 1:length(syms)) {
tm[i, syms[[i]]] <- 1
}
dat$symptoms <- NULL
'dat'包含患者的投诉数据:
received | id | ... | product | outcome |
---|---|---|---|---|
9/30/2022 | 2022-CFS-014640 | ... | centrum silver men's 50+ | other outcome |
9/30/2022 | 2022-CFS-014637 | ... | liquid collagen shot | life threatening |
而'tm'包含症状矩阵:
diarrhoea | vomiting | cancer |
---|---|---|
0 | 1 | 0 |
1 | 0 | 0 |
... | ... | 1 |
如果不想患癌症,我需要找到个人应该避免的产品列表。我尝试了以下代码:
# 找到包含"cancer"症状的tm矩阵中的行
cancer_rows <- which(tm[, "cancer"] == 1)
# 创建与"cancer"症状相关的产品名称向量
products_to_avoid <- unique(dat$product[cancer_rows])
但这对我不起作用。也许有人有任何想法如何正确编写它?
英文:
I have a huge list of data that contains information about person and its product complaint reports submitted to FDA for foods, dietary supplements, and cosmetics. My data is cleaned up and then I create the matrix that contains 0 and 1:
syms <- strsplit(dat$symptoms, ", ")
tm <- matrix(0, nrow=nrow(dat), ncol=length(unique(unlist(syms))))
colnames(tm) <- unique(unlist(syms))
for(i in 1:length(syms)) {
tm[i, syms[[i]]] <- 1
}
dat$symptoms <- NULL
The 'dat' contains data of complaints of the patient:
received | id | ... | product | outcome |
---|---|---|---|---|
9/30/2022 | 2022-CFS-014640 | ... | centrum silver men's 50+ | other outcome |
9/30/2022 | 2022-CFS-014637 | ... | liquid collagen shot | life threatening |
and the 'tm' has the matrix of symptoms:
diarrhoea | vomiting | cancer |
---|---|---|
0 | 1 | 0 |
1 | 0 | 0 |
... | ... | 1 |
I need to find the list of products that person should avoid if it doesn't want to get cancer. I tried this:
# Find rows in tm matrix where the "cancer" symptom is present
cancer_rows <- which(tm[, "cancer"] == 1)
# Create a vector of product names associated with "cancer" symptoms
products_to_avoid <- unique(dat$product[cancer_rows])
but this doesn't work for me. Maybe someone has any ideas how can I write it properly?
答案1
得分: 1
你可以使用正则表达式过滤症状,而不需要为每个症状创建一个变量(请注意,这仅在将dat$symptoms
设置为NULL
之前才有效):
unique(dat$product[grepl("cancer", dat$symptoms)])
要提取症状,你还可以使用tidyverse方法,以便将其保持在同一数据框中。例如:
library(dplyr)
library(tidyr)
library(tibble)
dat_syms <- dat %>%
mutate(
syms = symptoms %>%
strsplit(", ") %>%
lapply(table) %>%
lapply(as.data.frame)
) %>%
unnest(syms) %>%
spread(Var1, Freq, fill = 0)
unique(dat_syms$product[dat_syms$cancer == 1])
然而,重要的是要注意,虽然这列出了客户投诉患癌症的产品,但这很可能不会提供关于是否应该避免这些产品的有用信息。为了提供有用信息,你必须对数据做出非常强烈的假设,例如投诉的客户确实知道导致他们患癌症的是该产品,这显然是不正确的。
英文:
You can filter by symptoms using regex without making a variable for each symptom (note that this only works before you set dat$symptoms
to NULL
):
unique(dat$product[grepl("cancer", dat$symptoms)])
For extracting symptoms, you could also use a tidyverse approach to easily keep it within the same data frame. For example:
library(dplyr)
library(tidyr)
library(tibble)
dat_syms <-
dat %>%
mutate(
syms = symptoms %>%
strsplit(", ") %>%
lapply(table) %>%
lapply(as.data.frame)
) %>%
unnest(syms) %>%
spread(Var1, Freq, fill = 0)
unique(dat_syms$product[dat_syms$cancer == 1])
However, it is important note that while this lists products where customers complained about cancer, it is likely not very informative about whether or not those products should be avvoided. To be informative you would have to make very strong assumptions about the data, e.g. that customers who complain actually know that it was indeed that product which caused their cancer---which obviously is not true.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论