2023年5月29日 10:39:33go评论104阅读模式

英文:

Replace all values using a data dictionary, where the column name matches the rows in the dictionary (R language)

问题

我已翻译好您提供的代码部分，如下所示：

library(dplyr)
# 读取数据字典和原始调查数据
dictionary_df <- readxl::read_excel(filepath)
dictionary_df$Value <- as.integer(dictionary_df$Value)
survey_df <- readxl::read_excel(filepath)
# 用带标签的 survey_df 替换原始未标记的 survey_df。
lookup_dict <- list()
cols <- unique(dictionary_df$Field)
for (i in cols) {
  subset <- filter(dictionary_df, Field == i)
  lookup_dict_vals <- setNames(subset$Description, subset$Value)
  lookup_dict[[i]] <- lookup_dict_vals
}
# 这个 try/except 部分用于处理原始文件中存在但在数据字典中不存在的列。这些列将保持不变，但右侧的列将被标记。
for (i in colnames(survey_df)) {
  tryCatch({
    survey_df <- survey_df %>% mutate({{i}} := lookup_dict[[i]][{{i}}])
  }, error = function(e) {
    next
  })
}

请注意，R 中使用的代码与Python代码在语法和功能上是等效的，应该能够正确地将标签应用于调查数据。如果您遇到问题，可能需要检查文件路径和数据是否正确导入。

英文:

I have an encoded survey file and data dictionary that describes each coded value for each column. Example of inputs and desired output here.

The "Field" column in the data dictionary lists every column in the survey data. The values have different descriptions depending on the column name. For example, "1" can mean "Very satisfied" if it's in the Satisfaction column, but "Metro" if in the Region column.

I've gotten this code to work in Python, but I need it to work in R.

I am open to alternative approaches to this, this code is just what I've come up with. It runs in ~5 seconds across 150 columns and 40,000 rows.

import pandas as pd
# Read the data dictionary and raw survey data
dictionary_df = pd.read_excel(filepath)
dictionary_df[&quot;Value&quot;] = dictionary_df[&quot;Value&quot;].astype(int)
survey_df = pd.read_excel(filepath)
# Replaces the original unalbelled survey_df with a labelled survey_df.
lookup_dict = {}
cols = list(dictionary_df.Field.unique())
for i in cols:
    subset = dictionary_df.loc[dictionary_df.Field==i]
    lookup_dict_vals = dict(zip(subset[&#39;Value&#39;], subset[&#39;Description&#39;]))
    lookup_dict[i] = lookup_dict_vals
# This try/except is necessary where columns that are in the raw file are not in the data dictionary. These columns will remain untouched, but columns to the right will get labelled.
for i in cols:
    try:
        survey_df[i] = survey_df[i].map(lookup_dict[i])
    except:
        continue

I have tried to replicate this in R but get a dataframe full of NAs. Is there a way to fix the code below?

library(dplyr)
# Read the data dictionary and raw survey data
dictionary_df &lt;- readxl::read_excel(filepath)
dictionary_df$Value &lt;- as.integer(dictionary_df$Value)
survey_df &lt;- readxl::read_excel(filepath)
# Replaces the original unlabelled survey_df with a labelled survey_df.
lookup_dict &lt;- list()
cols &lt;- unique(dictionary_df$Field)
for (i in cols) {
  subset &lt;- filter(dictionary_df, Field == i)
  lookup_dict_vals &lt;- setNames(subset$Description, subset$Value)
  lookup_dict[[i]] &lt;- lookup_dict_vals
}
# This try/except is necessary where columns that are in the raw file are not in the data dictionary. These columns will remain untouched, but columns to the right will get labelled.
for (i in colnames(survey_df)) {
  tryCatch({
    survey_df &lt;- survey_df %&gt;% mutate({{i}} := lookup_dict[[i]][{{i}}])
  }, error = function(e) {
    next
  })
}

答案1

得分: 0

由于字典可以存储条目如下：

0     1
Rural Metro

（不一定使用定义1、2、3...），我们必须在循环中调用 names 和 which 来获取索引，然后对字典进行子集化。只需添加一点点额外的代码，涉及到名称和 unlist 函数，将单列数据框转换为向量。

for (i in colnames(survey_df)) {
  tryCatch({
    k <- c()
    for (j in unlist(survey_df[{{i}}])) { #对于每个编码值
      k <- c(k, which(names(lookup_dict[[i]]) == j)) #找到定义的槽位
    }
    survey_df <- survey_df %>% mutate({{i}} := lookup_dict[[i]][k]) #取这些名称
  }, error = function(e) {
    next
  })
}

英文:

Since the dictionary could store entries as

0     1
Rural Metro

(not necessarily using the definitions 1,2,3...) we have to invoke names and which in a loop to obtain the indices, then subset the dictionary for those indices. Simply add a little overhead code involving the names and the unlist function, which turns a single column data frame into a vector.

for (i in colnames(survey_df)) {
  tryCatch({
    k &lt;- c()
    for (j in unlist(survey_df[{{i}}])) { #for each coded value
      k&lt;-c(k, which(names(lookup_dict[[i]]) == j)) #find the definition&#39;s slot
    }
    survey_df &lt;- survey_df %&gt;% mutate({{i}} := lookup_dict[[i]][k]) #take those names
  }, error = function(e) {
    next
  })
}

答案2

得分: 0

我好奇这个在性能上有何比较。枢轴步骤可能会很昂贵，但连接应该非常快。

library(tidyverse)
survey_df %>%
  mutate(row = row_number()) %>%
  pivot_longer(-row, names_to = "Field", values_to = "Value") %>%
  # left_join(dictionary_df) %>%
  # EDIT -- 我认为这将保留现有的未匹配值
  rows_update(dictionary_df, unmatched = "ignore") %>%
  select(-Value) %>%
  pivot_wider(names_from = Field, values_from = Description)

英文:

I'm curious how this compares performance-wise. The pivoting steps might be costly, but the join should be very fast.

library(tidyverse)
survey_df %&gt;%
  mutate(row = row_number()) %&gt;%
  pivot_longer(-row, names_to = &quot;Field&quot;, values_to = &quot;Value&quot;) %&gt;%
  # left_join(dictionary_df) %&gt;% # won&#39;t work for values not in dictionary
  # EDIT -- I think this will keep existing unmatched values
  rows_update(dictionary_df, unmatched = &quot;ignore&quot;) %&gt;%  
  select(-Value) %&gt;%
  pivot_wider(names_from = Field, values_from = Description)

答案3

得分: 0

我发现对代码进行一些小的修改可以保留原始编码调查数值，如果在字典中找不到字段和值的匹配项。这是我根据Jon Spring的建议使用的代码。通过在瘦表中创建一个辅助列，我可以将描述与编码调查值合并，以便左连接中的任何NA都保持为原始编码值。

survey_df %>%
  mutate(row = row_number()) %>%
  pivot_longer(-row, names_to = "Field", values_to = "Value") %>%
  left_join(dictionary_df) %>%
  mutate(Value = as.character(Value)) %>%
  mutate(Description2 = coalesce(Description, Value)) %>%
  select(-Value, -Description) %>%
  pivot_wider(names_from = Field, values_from = Description2) %>%
  
  write_xlsx(filepath)```
<details>
<summary>英文:</summary>
I found that some minor changes to the code allow for retaining original coded survey values where a match is not found for the Field and Value in the dictionary. This is the code I used thanks to Jon Spring&#39;s advice. By creating a helper column in the skinny table, I could merge the Description with the coded survey Values so that any NAs from the Left Join would stay as the original coded values.
```library(tidyverse)
survey_df %&gt;%
  mutate(row = row_number()) %&gt;%
  pivot_longer(-row, names_to = &quot;Field&quot;, values_to = &quot;Value&quot;) %&gt;%
  left_join(dictionary_df) %&gt;%
  mutate(Value = as.character(Value)) %&gt;% 
  mutate(Description2 = coalesce(Description, Value)) %&gt;% 
  select(-Value, -Description) %&gt;%
  pivot_wider(names_from = Field, values_from = Description2) %&gt;% 
  
  write_xlsx(filepath)```
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用数据字典替换所有数值，其中列名与字典中的行匹配（R语言）。

问题

答案1

答案2

答案3

如何将多个断言语句转换或合并为一个

使用两个数据框列对每个组进行wilcox.test与其他所有组比较。

在任意角度旋转的PDF文件中识别文本

如何在Ubuntu 20.04上构建并安装PyLucene。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论