在同一列中进行字符串数据匹配 – R

huangapple go评论120阅读模式
英文:

String data matching within the same column - R

问题

以下是您的翻译结果:

我有一份有关个人工作的数据集,其中包含某些职业薪水信息,我正在尝试创建一个子集,通过模糊匹配来标准化职位名称。具体来说,一个名为“Cost Accountant”的职位,月薪为4000美元,以及一个名为“Financial Accountant”的职位,月薪为5000美元,将在一个名为“Accountant”的新列下匹配,该列计算具有相似名称的工作的平均值。

以下是我的代码:
#上传包
```{r setup, include=FALSE}
library(stringr)
library(dplyr)
# 打印具有特定列的数据示例
dput(job_posts[1:20,c(4,27)])

输出:

structure(list(jobtitle = c("PE Teacher", "Accountant", 
"Dewatering Supervisor", "sales account manager", "Sales Lead", 
"Assistant Housekeeping Manager", "Quality Manager", "Approval Officer", 
"Logistics", "Systems Engineer - Networking/Wireless", "Accountant", 
"Calls Admin", "Financial Accountant", "Sales Representative", 
"Procurement Assistant", "Water Quality Analyst", "Resident Engineer", 
"Cost Accountant", "Product Specilaist-2", "Operations Coordinator"
), monthly_income = c(NA, 8500, NA, 20000, 15000, NA, 3500, NA, 
NA, 4000, NA, 500, NA, 5000, NA, 8500, 20000, 9000, 4100, 4500)), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

我已按照这里的说明操作,这让我有了一个良好的开始,因为它可以标记已匹配的其他行/观察结果,但我无法像我之前的示例中所解释的那样标准化职位名称。

# 对职位名称进行模糊匹配,以便将相似的工作存储在一个数据框中
job_posts$matched <- sapply(job_posts$jobtitle,agrep,job_posts$jobtitle)
# 打印具有特定列的数据示例
dput(job_posts[1:10,c(4,27,28)])

输出:

structure(list(jobtitle = c("PE Teacher", "Accountant", 
"Dewatering Supervisor", "sales account manager", "Sales Lead", 
"Assistant Housekeeping Manager", "Quality Manager", "Approval Officer", 
"Logistics", "Systems Engineer - Networking/Wireless"), monthly_income = c(NA, 
8500, NA, 20000, 15000, NA, NA, NA, NA, NA), matched = list(`PE Teacher` = c(1L, 
1111L), `Accountant` = 2L, 
    `Dewatering Supervisor` = 3L, `sales account manager` = c(4L, 
    1242L, 1309L, 1524L, 1783L), `Sales Lead` = c(5L, 1984L), 
    `Assistant Housekeeping Manager` = 6L, `Quality Manager` = c(7L, 
    196L, 650L, 1856L, 2330L), `Approval Officer` = 8L, Logistics = c(9L, 
    71L, 129L, 176L, 362L, 444L, 446L, 587L, 655L, 935L, 1413L, 
    1508L, 1835L, 2176L, 2300L, 2370L, 2657L, 2685L, 2770L), 
    `Systems Engineer - Networking/Wireless` = 10L)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

当前数据框如下所示:

jobtitle                 avg_wage
Financial Accountant     $5000   
Cost Accountant          $4000
Retail Accountant        $4000

期望的结果如下,其中平均工资是基于所有会计工作的均值,并且不再是“成本会计”或“财务会计”,而是所有会计工作都会成为“会计师”之类的名称:

jobtitle       avg_wage
Accountant     $4333  

<details>
<summary>英文:</summary>

I have a dataset of jobs for individuals along with some information on salaries for certain occupations, and I am trying to create a subset that standardizes job names through fuzzy matching. Specifically, a job title called &quot;Cost Accountant&quot; with monthly wage of $4000 and &quot;Financial Accountant&quot; with $5000 would be matched under a new column called &quot;Accountant&quot; that computes the average of the jobs with similar names.

Here is my code thus far:
#upload packages
```{r setup, include=FALSE}
library(stringr)
library(dplyr)
# Print data example with specific columns
dput(job_posts[1:20,c(4,27)])

output:

structure(list(jobtitle = c(&quot;PE Teacher&quot;, &quot;Accountant&quot;, 
&quot;Dewatering Supervisor&quot;, &quot;sales account manager&quot;, &quot;Sales Lead&quot;, 
&quot;Assistant Housekeeping Manager&quot;, &quot;Quality Manager&quot;, &quot;Approval Officer&quot;, 
&quot;Logistics&quot;, &quot;Systems Engineer - Networking/Wireless&quot;, &quot;Accountant&quot;, 
&quot;Calls Admin&quot;, &quot;Financial Accountant&quot;, &quot;Sales Representative&quot;, 
&quot;Procurement Assistant&quot;, &quot;Water Quality Analyst&quot;, &quot;Resident Engineer&quot;, 
&quot;Cost Accountant&quot;, &quot;Product Specilaist-2&quot;, &quot;Operations Coordinator&quot;
), monthly_income = c(NA, 8500, NA, 20000, 15000, NA, 3500, NA, 
NA, 4000, NA, 500, NA, 5000, NA, 8500, 20000, 9000, 4100, 4500)), row.names = c(NA, 
-20L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))

I have followed the instructions here which gave me a good start because it flags other rows/observations that have been matched, but I am not able to standardize job titles as I explained in the example earlier.

# fuzzy matching for job titles, so that similar jobs are stored in one df
job_posts$matched &lt;- sapply(job_posts$jobtitle,agrep,job_posts$jobtitle)
# Print data example with specific columns
dput(job_posts[1:10,c(4,27,28)])

output:

structure(list(jobtitle = c(&quot;PE Teacher&quot;, &quot;Accountant&quot;, 
&quot;Dewatering Supervisor&quot;, &quot;sales account manager&quot;, &quot;Sales Lead&quot;, 
&quot;Assistant Housekeeping Manager&quot;, &quot;Quality Manager&quot;, &quot;Approval Officer&quot;, 
&quot;Logistics&quot;, &quot;Systems Engineer - Networking/Wireless&quot;), monthly_income = c(NA, 
8500, NA, 20000, 15000, NA, NA, NA, NA, NA), matched = list(`PE Teacher` = c(1L, 
1111L), `Accountant` = 2L, 
    `Dewatering Supervisor` = 3L, `sales account manager` = c(4L, 
    1242L, 1309L, 1524L, 1783L), `Sales Lead` = c(5L, 1984L), 
    `Assistant Housekeeping Manager` = 6L, `Quality Manager` = c(7L, 
    196L, 650L, 1856L, 2330L), `Approval Officer` = 8L, Logistics = c(9L, 
    71L, 129L, 176L, 362L, 444L, 446L, 587L, 655L, 935L, 1413L, 
    1508L, 1835L, 2176L, 2300L, 2370L, 2657L, 2685L, 2770L), 
    `Systems Engineer - Networking/Wireless` = 10L)), row.names = c(NA, 
-10L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))

The current df looks as follows:

jobtitle                 avg_wage
Financial Accountant     $5000   
Cost Accountant          $4000
Retail Accountant        $4000

The desired outcome is as follows, where the average wage is based on a mean of all accounting wages and instead of "cost accountant" or "financial accountant", all accounting jobs would be something like "Accountant"

jobtitle       avg_wage
Accountant     $4333  

答案1

得分: 1

以下是您要翻译的部分:

library(tidyverse)

# 与您提供的最小示例数据框相同,但多了一个不相关的行用于演示
data <- data.frame(
  jobtitle = c("Financial Accountant", "Cost Accountant", "Retail Accountant", "Instagram Influencer"),
  avg_wage = c("$5000", "$4000", "$4000", "$1000")
)

# 与此相同
job_groups <- c("Accountant", "Butcher", "Baker", "Candlestick Maker")

# 基本上这里发生的是,我们正在查找每个职位标题中的职位组,删除NA值,然后如果标题中没有职位组,我们返回NA,否则返回职位标题
mutate(data, grp = map_chr(jobtitle, ~ str_extract(.x, job_groups) %>% {.[!is.na(.)]} %>% if (length(.) == 0) NA_character_ else .))

输出:

              jobtitle avg_wage        grp
1 Financial Accountant    $5000 Accountant
2      Cost Accountant    $4000 Accountant
3    Retail Accountant    $4000 Accountant
4 Instagram Influencer    $1000       <NA>

<details>
<summary>英文:</summary>

I think this is what you want? Though I&#39;m not entirely sure:

library(tidyverse)

the same as the smallest example dataframe you gave, with an extra irrelevant row for demonstration

data <- data.frame(
jobtitle = c("Financial Accountant", "Cost Accountant", "Retail Accountant", "Instagram Influencer"),
avg_wage = c("$5000", "$4000", "$4000", "$1000")
)

same with this

job_groups <- c("Accountant", "Butcher", "Baker", "Candlestick Maker")

basically what's happening here is we're looking for the job group in each job title, removing NA values, then if there's no job group in the title, we're returning NA, else returning the job title(s)

mutate(data, grp = map_chr(jobtitle, ~ str_extract(.x, job_groups) %>% {.[!is.na(.)]} %>% if (length(.) == 0) NA_character_ else .))

Output:
          jobtitle avg_wage        grp

1 Financial Accountant $5000 Accountant
2 Cost Accountant $4000 Accountant
3 Retail Accountant $4000 Accountant
4 Instagram Influencer $1000 <NA>


</details>



huangapple
  • 本文由 发表于 2023年8月10日 23:48:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76877371.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定