使用数据字典替换所有数值,其中列名与字典中的行匹配(R语言)。

huangapple go评论58阅读模式
英文:

Replace all values using a data dictionary, where the column name matches the rows in the dictionary (R language)

问题

我已翻译好您提供的代码部分,如下所示:

library(dplyr)

# 读取数据字典和原始调查数据
dictionary_df <- readxl::read_excel(filepath)
dictionary_df$Value <- as.integer(dictionary_df$Value)

survey_df <- readxl::read_excel(filepath)

# 用带标签的 survey_df 替换原始未标记的 survey_df。

lookup_dict <- list()
cols <- unique(dictionary_df$Field)

for (i in cols) {
  subset <- filter(dictionary_df, Field == i)
  lookup_dict_vals <- setNames(subset$Description, subset$Value)
  lookup_dict[[i]] <- lookup_dict_vals
}

# 这个 try/except 部分用于处理原始文件中存在但在数据字典中不存在的列。这些列将保持不变,但右侧的列将被标记。

for (i in colnames(survey_df)) {
  tryCatch({
    survey_df <- survey_df %>% mutate({{i}} := lookup_dict[[i]][{{i}}])
  }, error = function(e) {
    next
  })
}

请注意,R 中使用的代码与Python代码在语法和功能上是等效的,应该能够正确地将标签应用于调查数据。如果您遇到问题,可能需要检查文件路径和数据是否正确导入。

英文:

I have an encoded survey file and data dictionary that describes each coded value for each column. Example of inputs and desired output here.

The "Field" column in the data dictionary lists every column in the survey data. The values have different descriptions depending on the column name. For example, "1" can mean "Very satisfied" if it's in the Satisfaction column, but "Metro" if in the Region column.

I've gotten this code to work in Python, but I need it to work in R.

I am open to alternative approaches to this, this code is just what I've come up with. It runs in ~5 seconds across 150 columns and 40,000 rows.

import pandas as pd

# Read the data dictionary and raw survey data
dictionary_df = pd.read_excel(filepath)
dictionary_df[&quot;Value&quot;] = dictionary_df[&quot;Value&quot;].astype(int)

survey_df = pd.read_excel(filepath)


# Replaces the original unalbelled survey_df with a labelled survey_df.

lookup_dict = {}
cols = list(dictionary_df.Field.unique())
for i in cols:
    subset = dictionary_df.loc[dictionary_df.Field==i]
    lookup_dict_vals = dict(zip(subset[&#39;Value&#39;], subset[&#39;Description&#39;]))
    lookup_dict[i] = lookup_dict_vals

# This try/except is necessary where columns that are in the raw file are not in the data dictionary. These columns will remain untouched, but columns to the right will get labelled.

for i in cols:
    try:
        survey_df[i] = survey_df[i].map(lookup_dict[i])
    except:
        continue

I have tried to replicate this in R but get a dataframe full of NAs. Is there a way to fix the code below?

library(dplyr)

# Read the data dictionary and raw survey data
dictionary_df &lt;- readxl::read_excel(filepath)
dictionary_df$Value &lt;- as.integer(dictionary_df$Value)

survey_df &lt;- readxl::read_excel(filepath)

# Replaces the original unlabelled survey_df with a labelled survey_df.

lookup_dict &lt;- list()
cols &lt;- unique(dictionary_df$Field)

for (i in cols) {
  subset &lt;- filter(dictionary_df, Field == i)
  lookup_dict_vals &lt;- setNames(subset$Description, subset$Value)
  lookup_dict[[i]] &lt;- lookup_dict_vals
}

# This try/except is necessary where columns that are in the raw file are not in the data dictionary. These columns will remain untouched, but columns to the right will get labelled.

for (i in colnames(survey_df)) {
  tryCatch({
    survey_df &lt;- survey_df %&gt;% mutate({{i}} := lookup_dict[[i]][{{i}}])
  }, error = function(e) {
    next
  })
}

答案1

得分: 0

由于字典可以存储条目如下:

0     1
Rural Metro

(不一定使用定义1、2、3...),我们必须在循环中调用 nameswhich 来获取索引,然后对字典进行子集化。只需添加一点点额外的代码,涉及到名称和 unlist 函数,将单列数据框转换为向量。

for (i in colnames(survey_df)) {
  tryCatch({
    k <- c()
    for (j in unlist(survey_df[{{i}}])) { #对于每个编码值
      k <- c(k, which(names(lookup_dict[[i]]) == j)) #找到定义的槽位
    }
    survey_df <- survey_df %>% mutate({{i}} := lookup_dict[[i]][k]) #取这些名称
  }, error = function(e) {
    next
  })
}
英文:

Since the dictionary could store entries as

0     1
Rural Metro

(not necessarily using the definitions 1,2,3...) we have to invoke names and which in a loop to obtain the indices, then subset the dictionary for those indices. Simply add a little overhead code involving the names and the unlist function, which turns a single column data frame into a vector.

for (i in colnames(survey_df)) {
  tryCatch({
    k &lt;- c()
    for (j in unlist(survey_df[{{i}}])) { #for each coded value
      k&lt;-c(k, which(names(lookup_dict[[i]]) == j)) #find the definition&#39;s slot
    }
    survey_df &lt;- survey_df %&gt;% mutate({{i}} := lookup_dict[[i]][k]) #take those names
  }, error = function(e) {
    next
  })
}

答案2

得分: 0

我好奇这个在性能上有何比较。枢轴步骤可能会很昂贵,但连接应该非常快。

library(tidyverse)
survey_df %>%
  mutate(row = row_number()) %>%
  pivot_longer(-row, names_to = "Field", values_to = "Value") %>%
  # left_join(dictionary_df) %>%
  # EDIT -- 我认为这将保留现有的未匹配值
  rows_update(dictionary_df, unmatched = "ignore") %>%
  select(-Value) %>%
  pivot_wider(names_from = Field, values_from = Description)
英文:

I'm curious how this compares performance-wise. The pivoting steps might be costly, but the join should be very fast.

library(tidyverse)
survey_df %&gt;%
  mutate(row = row_number()) %&gt;%
  pivot_longer(-row, names_to = &quot;Field&quot;, values_to = &quot;Value&quot;) %&gt;%
  # left_join(dictionary_df) %&gt;% # won&#39;t work for values not in dictionary
  # EDIT -- I think this will keep existing unmatched values
  rows_update(dictionary_df, unmatched = &quot;ignore&quot;) %&gt;%  
  select(-Value) %&gt;%
  pivot_wider(names_from = Field, values_from = Description)

答案3

得分: 0

我发现对代码进行一些小的修改可以保留原始编码调查数值,如果在字典中找不到字段和值的匹配项。这是我根据Jon Spring的建议使用的代码。通过在瘦表中创建一个辅助列,我可以将描述与编码调查值合并,以便左连接中的任何NA都保持为原始编码值。

survey_df %>%
  mutate(row = row_number()) %>%
  pivot_longer(-row, names_to = "Field", values_to = "Value") %>%
  left_join(dictionary_df) %>%
  mutate(Value = as.character(Value)) %>%
  mutate(Description2 = coalesce(Description, Value)) %>%
  select(-Value, -Description) %>%
  pivot_wider(names_from = Field, values_from = Description2) %>%
  
  write_xlsx(filepath)```


<details>
<summary>英文:</summary>

I found that some minor changes to the code allow for retaining original coded survey values where a match is not found for the Field and Value in the dictionary. This is the code I used thanks to Jon Spring&#39;s advice. By creating a helper column in the skinny table, I could merge the Description with the coded survey Values so that any NAs from the Left Join would stay as the original coded values.

```library(tidyverse)
survey_df %&gt;%
  mutate(row = row_number()) %&gt;%
  pivot_longer(-row, names_to = &quot;Field&quot;, values_to = &quot;Value&quot;) %&gt;%
  left_join(dictionary_df) %&gt;%
  mutate(Value = as.character(Value)) %&gt;% 
  mutate(Description2 = coalesce(Description, Value)) %&gt;% 
  select(-Value, -Description) %&gt;%
  pivot_wider(names_from = Field, values_from = Description2) %&gt;% 
  
  write_xlsx(filepath)```

</details>



huangapple
  • 本文由 发表于 2023年5月29日 10:39:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76354392.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定