2023年8月10日 23:48:46go评论162阅读模式

英文:

String data matching within the same column - R

问题

以下是您的翻译结果：

我有一份有关个人工作的数据集，其中包含某些职业薪水信息，我正在尝试创建一个子集，通过模糊匹配来标准化职位名称。具体来说，一个名为“Cost Accountant”的职位，月薪为4000美元，以及一个名为“Financial Accountant”的职位，月薪为5000美元，将在一个名为“Accountant”的新列下匹配，该列计算具有相似名称的工作的平均值。
以下是我的代码:
#上传包
```{r setup, include=FALSE}
library(stringr)
library(dplyr)

# 打印具有特定列的数据示例
dput(job_posts[1:20,c(4,27)])

输出:

structure(list(jobtitle = c("PE Teacher", "Accountant", 
"Dewatering Supervisor", "sales account manager", "Sales Lead", 
"Assistant Housekeeping Manager", "Quality Manager", "Approval Officer", 
"Logistics", "Systems Engineer - Networking/Wireless", "Accountant", 
"Calls Admin", "Financial Accountant", "Sales Representative", 
"Procurement Assistant", "Water Quality Analyst", "Resident Engineer", 
"Cost Accountant", "Product Specilaist-2", "Operations Coordinator"
), monthly_income = c(NA, 8500, NA, 20000, 15000, NA, 3500, NA, 
NA, 4000, NA, 500, NA, 5000, NA, 8500, 20000, 9000, 4100, 4500)), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

我已按照这里的说明操作，这让我有了一个良好的开始，因为它可以标记已匹配的其他行/观察结果，但我无法像我之前的示例中所解释的那样标准化职位名称。

# 对职位名称进行模糊匹配，以便将相似的工作存储在一个数据框中
job_posts$matched <- sapply(job_posts$jobtitle,agrep,job_posts$jobtitle)

# 打印具有特定列的数据示例
dput(job_posts[1:10,c(4,27,28)])

输出:

structure(list(jobtitle = c("PE Teacher", "Accountant", 
"Dewatering Supervisor", "sales account manager", "Sales Lead", 
"Assistant Housekeeping Manager", "Quality Manager", "Approval Officer", 
"Logistics", "Systems Engineer - Networking/Wireless"), monthly_income = c(NA, 
8500, NA, 20000, 15000, NA, NA, NA, NA, NA), matched = list(`PE Teacher` = c(1L, 
1111L), `Accountant` = 2L, 
    `Dewatering Supervisor` = 3L, `sales account manager` = c(4L, 
    1242L, 1309L, 1524L, 1783L), `Sales Lead` = c(5L, 1984L), 
    `Assistant Housekeeping Manager` = 6L, `Quality Manager` = c(7L, 
    196L, 650L, 1856L, 2330L), `Approval Officer` = 8L, Logistics = c(9L, 
    71L, 129L, 176L, 362L, 444L, 446L, 587L, 655L, 935L, 1413L, 
    1508L, 1835L, 2176L, 2300L, 2370L, 2657L, 2685L, 2770L), 
    `Systems Engineer - Networking/Wireless` = 10L)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

当前数据框如下所示:

jobtitle                 avg_wage
Financial Accountant     $5000   
Cost Accountant          $4000
Retail Accountant        $4000

期望的结果如下，其中平均工资是基于所有会计工作的均值，并且不再是“成本会计”或“财务会计”，而是所有会计工作都会成为“会计师”之类的名称:

jobtitle       avg_wage
Accountant     $4333


<details>
<summary>英文:</summary>
I have a dataset of jobs for individuals along with some information on salaries for certain occupations, and I am trying to create a subset that standardizes job names through fuzzy matching. Specifically, a job title called &quot;Cost Accountant&quot; with monthly wage of $4000 and &quot;Financial Accountant&quot; with $5000 would be matched under a new column called &quot;Accountant&quot; that computes the average of the jobs with similar names.
Here is my code thus far:
#upload packages
```{r setup, include=FALSE}
library(stringr)
library(dplyr)

# Print data example with specific columns
dput(job_posts[1:20,c(4,27)])

output:

structure(list(jobtitle = c(&quot;PE Teacher&quot;, &quot;Accountant&quot;, 
&quot;Dewatering Supervisor&quot;, &quot;sales account manager&quot;, &quot;Sales Lead&quot;, 
&quot;Assistant Housekeeping Manager&quot;, &quot;Quality Manager&quot;, &quot;Approval Officer&quot;, 
&quot;Logistics&quot;, &quot;Systems Engineer - Networking/Wireless&quot;, &quot;Accountant&quot;, 
&quot;Calls Admin&quot;, &quot;Financial Accountant&quot;, &quot;Sales Representative&quot;, 
&quot;Procurement Assistant&quot;, &quot;Water Quality Analyst&quot;, &quot;Resident Engineer&quot;, 
&quot;Cost Accountant&quot;, &quot;Product Specilaist-2&quot;, &quot;Operations Coordinator&quot;
), monthly_income = c(NA, 8500, NA, 20000, 15000, NA, 3500, NA, 
NA, 4000, NA, 500, NA, 5000, NA, 8500, 20000, 9000, 4100, 4500)), row.names = c(NA, 
-20L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))

I have followed the instructions here which gave me a good start because it flags other rows/observations that have been matched, but I am not able to standardize job titles as I explained in the example earlier.

# fuzzy matching for job titles, so that similar jobs are stored in one df
job_posts$matched &lt;- sapply(job_posts$jobtitle,agrep,job_posts$jobtitle)

# Print data example with specific columns
dput(job_posts[1:10,c(4,27,28)])

output:

structure(list(jobtitle = c(&quot;PE Teacher&quot;, &quot;Accountant&quot;, 
&quot;Dewatering Supervisor&quot;, &quot;sales account manager&quot;, &quot;Sales Lead&quot;, 
&quot;Assistant Housekeeping Manager&quot;, &quot;Quality Manager&quot;, &quot;Approval Officer&quot;, 
&quot;Logistics&quot;, &quot;Systems Engineer - Networking/Wireless&quot;), monthly_income = c(NA, 
8500, NA, 20000, 15000, NA, NA, NA, NA, NA), matched = list(`PE Teacher` = c(1L, 
1111L), `Accountant` = 2L, 
    `Dewatering Supervisor` = 3L, `sales account manager` = c(4L, 
    1242L, 1309L, 1524L, 1783L), `Sales Lead` = c(5L, 1984L), 
    `Assistant Housekeeping Manager` = 6L, `Quality Manager` = c(7L, 
    196L, 650L, 1856L, 2330L), `Approval Officer` = 8L, Logistics = c(9L, 
    71L, 129L, 176L, 362L, 444L, 446L, 587L, 655L, 935L, 1413L, 
    1508L, 1835L, 2176L, 2300L, 2370L, 2657L, 2685L, 2770L), 
    `Systems Engineer - Networking/Wireless` = 10L)), row.names = c(NA, 
-10L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))

The current df looks as follows:

jobtitle                 avg_wage
Financial Accountant     $5000   
Cost Accountant          $4000
Retail Accountant        $4000

The desired outcome is as follows, where the average wage is based on a mean of all accounting wages and instead of "cost accountant" or "financial accountant", all accounting jobs would be something like "Accountant"

jobtitle       avg_wage
Accountant     $4333

答案1

得分: 1

以下是您要翻译的部分：

library(tidyverse)
# 与您提供的最小示例数据框相同，但多了一个不相关的行用于演示
data <- data.frame(
  jobtitle = c("Financial Accountant", "Cost Accountant", "Retail Accountant", "Instagram Influencer"),
  avg_wage = c("$5000", "$4000", "$4000", "$1000")
)
# 与此相同
job_groups <- c("Accountant", "Butcher", "Baker", "Candlestick Maker")
# 基本上这里发生的是，我们正在查找每个职位标题中的职位组，删除NA值，然后如果标题中没有职位组，我们返回NA，否则返回职位标题
mutate(data, grp = map_chr(jobtitle, ~ str_extract(.x, job_groups) %>% {.[!is.na(.)]} %>% if (length(.) == 0) NA_character_ else .))

输出：

              jobtitle avg_wage        grp
1 Financial Accountant    $5000 Accountant
2      Cost Accountant    $4000 Accountant
3    Retail Accountant    $4000 Accountant
4 Instagram Influencer    $1000       <NA>


<details>
<summary>英文:</summary>
I think this is what you want? Though I&#39;m not entirely sure:

library(tidyverse)

the same as the smallest example dataframe you gave, with an extra irrelevant row for demonstration

data <- data.frame(
jobtitle = c("Financial Accountant", "Cost Accountant", "Retail Accountant", "Instagram Influencer"),
avg_wage = c("$5000", "$4000", "$4000", "$1000")
)

same with this

job_groups <- c("Accountant", "Butcher", "Baker", "Candlestick Maker")

basically what's happening here is we're looking for the job group in each job title, removing NA values, then if there's no job group in the title, we're returning NA, else returning the job title(s)

mutate(data, grp = map_chr(jobtitle, ~ str_extract(.x, job_groups) %>% {.[!is.na(.)]} %>% if (length(.) == 0) NA_character_ else .))

Output:

          jobtitle avg_wage        grp

1 Financial Accountant $5000 Accountant
2 Cost Accountant $4000 Accountant
3 Retail Accountant $4000 Accountant
4 Instagram Influencer $1000 <NA>


</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在同一列中进行字符串数据匹配 – R

问题

答案1

the same as the smallest example dataframe you gave, with an extra irrelevant row for demonstration

same with this

basically what's happening here is we're looking for the job group in each job title, removing NA values, then if there's no job group in the title, we're returning NA, else returning the job title(s)

提取变量名到一列并创建长格式数据

不是可执行对象：’SELECT * FROM LoanParcel’

Manipulating Single Values in R to Column values

如何在Rust中明确声明std::str::Matches<'a, P>？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。