Spark Scala [嵌套if-else的for循环] 如何避免接收重复数组

huangapple go评论58阅读模式
英文:

Spark Scala [for loop embedded with if-else] how can I not receive duplicate array

问题

I understand you want a translation of the code and related information in English. Here is the translation of the code and the explanation:

I'm trying to count some certain words in array RDD level. It almost halfway done. However, the result shows not the exactly same that I'm looking for.

I'm dealing with wine review comment like 

var aa = dataset.map(c => c(2))

>`Array[String] = Array("This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, "Ripe aromas of fig, "Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, "This spent 20 months in 30% new French oak, "This is the top wine from La Bégude, "Deep, `

I'm trying to count the number of certain words in a list

var positive_list= List( "tremendously","delicious")
var sum=0

var rr=aa.map(column =>
for (i <- positive_list) yield {
if(column.contains(i)){
sum=sum+1
(column,sum)
} else {
(column,0)
}
})

rr.take(50)



Result:
&gt;`Array(List(("This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate,0), ("This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate,0)), List(("Ripe aromas of fig,0), ("Ripe aromas of fig,0)), List(("Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious,1), ("Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious,2))`

As you can see. There are some duplicate lists that I don't really need.
I know that is because [yield] will return a result each time in a loop, but I can't remove it, otherwise, I will get none in the list.

Is there any idea I can do?

If you have any specific questions or need further assistance with this code, please feel free to ask.

英文:

I'm trying to count some certain words in array RDD level. It almost halfway done. However, the result shows not the exactly same that I'm looking for.

I'm dealing with wine review comment like

var aa = dataset.map(c =&gt; c(2))

>Array[String] = Array(&quot;This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, &quot;Ripe aromas of fig, &quot;Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, &quot;This spent 20 months in 30% new French oak, &quot;This is the top wine from La B&#233;gude, &quot;Deep,

I'm trying to count the number of certain words in a list

var positive_list= List( &quot;tremendously&quot;,&quot;delicious&quot;)
var sum=0
 
var rr=aa.map(column =&gt;
                 for (i &lt;- positive_list) yield { 
                    if(column.contains(i)){
                      sum=sum+1
                      (column,sum)
                    } else {
                      (column,0)
                    }
                 })

rr.take(50)

Result:
>Array(List((&quot;This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate,0), (&quot;This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate,0)), List((&quot;Ripe aromas of fig,0), (&quot;Ripe aromas of fig,0)), List((&quot;Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious,1), (&quot;Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious,2))

As you can see. There are some duplicate list that I don't really need.
I know that is because [yield] will return result each time in a loop, but I can't remove it ,otherwise I will get none in the list.

Is there any idea I can do?

答案1

得分: 1

For each element in positive_list you are creating a record with the for loop. I assume that you want to map your review to the number of positive words it contains (so just one record per review). You can do it by using count on positive_list:

var rr = aa.map(column => column => positive_list.count(column.contains))
英文:

For each element in positive_list you are creating a record with the for loop. I assume that you want to map your review to the number of positive words it contains (so just one record per review). You can do it by using count on positive_list:

var rr=aa.map(column =&gt; column -&gt; positive_list.count(column.contains))

huangapple
  • 本文由 发表于 2020年1月6日 17:41:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/59609756.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定