2023年3月9日 14:44:11go评论135阅读模式

英文:

How to find a list of product names associated with word in R

问题

我有一个包含关于个人及其针对食品、膳食补充剂和化妆品提交给FDA的产品投诉报告的大量数据列表。我的数据已经清理好，然后我创建了一个包含0和1的矩阵：

syms <- strsplit(dat$symptoms, ", ")
tm   <- matrix(0, nrow=nrow(dat), ncol=length(unique(unlist(syms))))
colnames(tm) <- unique(unlist(syms))

for(i in 1:length(syms)) {
  tm[i, syms[[i]]] <- 1
}
dat$symptoms <- NULL

'dat'包含患者的投诉数据：

received	id	...	product	outcome
9/30/2022	2022-CFS-014640	...	centrum silver men's 50+	other outcome
9/30/2022	2022-CFS-014637	...	liquid collagen shot	life threatening

而'tm'包含症状矩阵：

diarrhoea	vomiting	cancer
0	1	0
1	0	0
...	...	1

如果不想患癌症，我需要找到个人应该避免的产品列表。我尝试了以下代码：

# 找到包含"cancer"症状的tm矩阵中的行
cancer_rows <- which(tm[, "cancer"] == 1)

# 创建与"cancer"症状相关的产品名称向量
products_to_avoid <- unique(dat$product[cancer_rows])

但这对我不起作用。也许有人有任何想法如何正确编写它？

英文:

I have a huge list of data that contains information about person and its product complaint reports submitted to FDA for foods, dietary supplements, and cosmetics. My data is cleaned up and then I create the matrix that contains 0 and 1:

syms &lt;- strsplit(dat$symptoms, &quot;, &quot;)
tm   &lt;- matrix(0, nrow=nrow(dat), ncol=length(unique(unlist(syms))))
colnames(tm) &lt;- unique(unlist(syms))

for(i in 1:length(syms)) {
  tm[i, syms[[i]]] &lt;- 1
}
dat$symptoms &lt;- NULL

The 'dat' contains data of complaints of the patient:

received	id	...	product	outcome
9/30/2022	2022-CFS-014640	...	centrum silver men's 50+	other outcome
9/30/2022	2022-CFS-014637	...	liquid collagen shot	life threatening

and the 'tm' has the matrix of symptoms:

diarrhoea	vomiting	cancer
0	1	0
1	0	0
...	...	1

I need to find the list of products that person should avoid if it doesn't want to get cancer. I tried this:

# Find rows in tm matrix where the &quot;cancer&quot; symptom is present
cancer_rows &lt;- which(tm[, &quot;cancer&quot;] == 1)

# Create a vector of product names associated with &quot;cancer&quot; symptoms
products_to_avoid &lt;- unique(dat$product[cancer_rows])

but this doesn't work for me. Maybe someone has any ideas how can I write it properly?

答案1

得分: 1

你可以使用正则表达式过滤症状，而不需要为每个症状创建一个变量（请注意，这仅在将dat$symptoms设置为NULL之前才有效）：

unique(dat$product[grepl("cancer", dat$symptoms)])

要提取症状，你还可以使用tidyverse方法，以便将其保持在同一数据框中。例如：

library(dplyr)
library(tidyr)
library(tibble)

dat_syms <- dat %>%
  mutate(
    syms = symptoms %>%
      strsplit(", ") %>%
      lapply(table) %>%
      lapply(as.data.frame)
  ) %>%
  unnest(syms) %>%
  spread(Var1, Freq, fill = 0)

unique(dat_syms$product[dat_syms$cancer == 1])

然而，重要的是要注意，虽然这列出了客户投诉患癌症的产品，但这很可能不会提供关于是否应该避免这些产品的有用信息。为了提供有用信息，你必须对数据做出非常强烈的假设，例如投诉的客户确实知道导致他们患癌症的是该产品，这显然是不正确的。

英文:

You can filter by symptoms using regex without making a variable for each symptom (note that this only works before you set dat$symptoms to NULL):

unique(dat$product[grepl(&quot;cancer&quot;, dat$symptoms)])

For extracting symptoms, you could also use a tidyverse approach to easily keep it within the same data frame. For example:

library(dplyr)
library(tidyr)
library(tibble)

dat_syms &lt;-
  dat %&gt;%
  mutate(
    syms = symptoms %&gt;%
      strsplit(&quot;, &quot;) %&gt;%
      lapply(table) %&gt;%
      lapply(as.data.frame)
  ) %&gt;%
  unnest(syms) %&gt;%
  spread(Var1, Freq, fill = 0)

unique(dat_syms$product[dat_syms$cancer == 1])

However, it is important note that while this lists products where customers complained about cancer, it is likely not very informative about whether or not those products should be avvoided. To be informative you would have to make very strong assumptions about the data, e.g. that customers who complain actually know that it was indeed that product which caused their cancer---which obviously is not true.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在R中找到与特定词相关的产品名称列表

问题

答案1

R根据另一列中的值在多列之间进行更改。

在时间戳之前和之后保留特定日期范围

无法在R/ggraph中使用组对环形树状图进行线条着色。

创建一个使用矢量化函数的新数据框。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论