英文:
Spark Scala [for loop embedded with if-else] how can I not receive duplicate array
问题
I understand you want a translation of the code and related information in English. Here is the translation of the code and the explanation:
I'm trying to count some certain words in array RDD level. It almost halfway done. However, the result shows not the exactly same that I'm looking for.
I'm dealing with wine review comment like
var aa = dataset.map(c => c(2))
>`Array[String] = Array("This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, "Ripe aromas of fig, "Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, "This spent 20 months in 30% new French oak, "This is the top wine from La Bégude, "Deep, `
I'm trying to count the number of certain words in a list
var positive_list= List( "tremendously","delicious")
var sum=0
var rr=aa.map(column =>
for (i <- positive_list) yield {
if(column.contains(i)){
sum=sum+1
(column,sum)
} else {
(column,0)
}
})
rr.take(50)
Result:
>`Array(List(("This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate,0), ("This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate,0)), List(("Ripe aromas of fig,0), ("Ripe aromas of fig,0)), List(("Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious,1), ("Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious,2))`
As you can see. There are some duplicate lists that I don't really need.
I know that is because [yield] will return a result each time in a loop, but I can't remove it, otherwise, I will get none in the list.
Is there any idea I can do?
If you have any specific questions or need further assistance with this code, please feel free to ask.
英文:
I'm trying to count some certain words in array RDD level. It almost halfway done. However, the result shows not the exactly same that I'm looking for.
I'm dealing with wine review comment like
var aa = dataset.map(c => c(2))
>Array[String] = Array("This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, "Ripe aromas of fig, "Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, "This spent 20 months in 30% new French oak, "This is the top wine from La Bégude, "Deep,
I'm trying to count the number of certain words in a list
var positive_list= List( "tremendously","delicious")
var sum=0
var rr=aa.map(column =>
for (i <- positive_list) yield {
if(column.contains(i)){
sum=sum+1
(column,sum)
} else {
(column,0)
}
})
rr.take(50)
Result:
>Array(List(("This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate,0), ("This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate,0)), List(("Ripe aromas of fig,0), ("Ripe aromas of fig,0)), List(("Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious,1), ("Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious,2))
As you can see. There are some duplicate list that I don't really need.
I know that is because [yield] will return result each time in a loop, but I can't remove it ,otherwise I will get none in the list.
Is there any idea I can do?
答案1
得分: 1
For each element in positive_list
you are creating a record with the for loop. I assume that you want to map your review to the number of positive words it contains (so just one record per review). You can do it by using count
on positive_list
:
var rr = aa.map(column => column => positive_list.count(column.contains))
英文:
For each element in positive_list
you are creating a record with the for loop. I assume that you want to map your review to the number of positive words it contains (so just one record per review). You can do it by using count
on positive_list
:
var rr=aa.map(column => column -> positive_list.count(column.contains))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论