Optimizing neo4j cypher query for recommendation

huangapple go评论77阅读模式
英文:

Optimazing neo4j cypher query for recommendation

问题

这是我的neo4j数据库架构:

在数据库中有大约250万个文章节点、50万个命名实体节点和数千个趋势节点。文章具有发布日期,而且这些文章都是来自过去两年左右的。作为用户的输入,我会获得一组命名实体ID。我想要查询将在我的数据库中找到与这些输入命名实体ID和趋势之间有最佳连接的文章的查询。查询非常长,因为我正在进行一些评分,但我想将整个查询放在这里。

daysDelta通常为最近的7-14天。这个查询速度相当慢。根据输入的命名实体ID不同,查询可能需要几秒钟到几分钟。我尝试使用PROFILE进行调试,以下是结果。

从我所了解的信息来看,您应该减少基数。但我不知道如何在我的查询中执行这一操作,因为我必须进行所有这些评分。而且在我的查询中,我永远不需要获取超过30个结果。

英文:

This is my neo4j db schema:

Optimizing neo4j cypher query for recommendation

There is around 2.5 millions of Article nodes, 0.5 million of NamedEntityNodes and few thousand of Trend nodes. Articles have publication datetime and they are from about last two years.
As an input from a user I get list of NamedEntitiesIds. And I want to have query which will find articles with best connections between those input NamedEntitiesIds and Trends from my database. Query is quite long because I am doing some scoring but I want to put it whole here.

  1. MATCH (t:Trend)--(x:NamedEntity)-[xv:OCCUR]-(a:Article)-[v:OCCUR]-(n:NamedEntity)
  2. WHERE n.id IN ["polski związek narciarski_orgName", "Polska_placeName_country", "Kamila Stoch_persName", "Kamila_persName_surname", "Stoch_persName_surname", "Innsbruck_placeName_settlement", "Bischofshofen_placeName_settlement", "niemiecki_placeName_country", "Oberstdorfie_placeName_settlement", "47_placeName_settlement", "Garmisch_placeName_settlement", "Partenkirchen.nTo_placeName_settlement", "Stoch_persName", "katowicki_placeName_settlement", "AWF.nTCS_orgName", "polski_placeName_country", "Polak_placeName_country", "Adam Małysz_persName", "Adam_persName_forename", "Małysz_persName_surname", "Kamil Stoch_persName", "Kamil_persName_forename", "Piotr Żyła_persName", "Piotr_persName_forename", &quotyć_persName_surname", "Stoch_persName_addName", "Kaczmarski_persName", "Kaczmarski_persName_surname"] and t.date > date(datetime($currentDay) - duration({days: $daysDelta})) and a.publication_datetime > datetime($currentDay) - duration({days: $daysDelta})
  3. WITH a,t, collect(distinct n) as distinctLinkNes, collect(distinct x) as distinctTrendNes,
  4. sum(
  5. CASE
  6. WHEN n.category in ['persName', 'orgName'] THEN 2*v.amount
  7. WHEN n.category in ['persName_surname', 'persName_addName'] THEN 1.5*v.amount
  8. WHEN n.category in ['date', 'time', 'persName_forename'] THEN 0.5*v.amount
  9. ELSE 1.0*v.amount
  10. END) as linkSum,
  11. sum(
  12. CASE
  13. WHEN x.category in ['persName', 'orgName'] THEN 2*xv.amount
  14. WHEN x.category in ['persName_surname', 'persName_addName'] THEN 1.5*xv.amount
  15. WHEN x.category in ['date', 'time', 'persName_forename'] THEN 0.5*xv.amount
  16. ELSE 1.0*xv.amount
  17. END) as trendSum
  18. WITH a,t, linkSum, trendSum,
  19. reduce(total=0, ne in distinctLinkNes |
  20. total +
  21. CASE
  22. WHEN ne.category in ['persName', 'orgName'] THEN 2
  23. WHEN ne.category in ['persName_surname', 'persName_addName'] THEN 1.5
  24. WHEN ne.category in ['date', 'time', 'persName_forename'] THEN 0.5
  25. ELSE 1.0
  26. END) as distinctLinkNesAmount,
  27. reduce(total=0, ne in distinctTrendNes |
  28. total +
  29. CASE
  30. WHEN ne.category in ['persName', 'orgName'] THEN 2
  31. WHEN ne.category in ['persName_surname', 'persName_addName'] THEN 1.5
  32. WHEN ne.category in ['date', 'time', 'persName_forename'] THEN 0.5
  33. ELSE 1.0
  34. END) as distinctTrendNesAmount
  35. WITH a, t, distinctTrendNesAmount, trendSum, distinctLinkNesAmount, linkSum,
  36. (3*distinctTrendNesAmount + trendSum) * t.hits / 1000 as trendScore,
  37. (3*distinctLinkNesAmount + linkSum) * ($daysDelta - duration.between(a.publication_datetime, date($currentDay)).days) as articleScore
  38. WITH a, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore,
  39. (articleScore + trendScore) as score
  40. ORDER BY score DESC
  41. RETURN a as article, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore, score
  42. LIMIT 30

daysDelta will be usually latest 7-14 days. This query is pretty slow. Depending on the input NamedEntitiesId i takes from few seconds up to few minutes. I tried to debug this using PROFILE here it is the result:
Optimizing neo4j cypher query for recommendation

From what I had read I should decrease cardinality. (or maybe something else I am happy for suggestions). But I have no idea how to do it in my query. When I have to do all this scoring. And in my query I will never have to get more then 30 results.

答案1

得分: 2

我建议在你的模式中进行一些微小的调整。在NamedEntity节点中,存储一个额外的属性,名为multiplicationFactor,它将存储你在case语句中使用的值:2, 1.5, 1.0, 0.5。在分析了概要图之后,我注意到聚合操作比图遍历更昂贵。因此,这个变化应该会有很大帮助。使用以下查询来设置新属性:

  1. MATCH (n:NamedEntity)
  2. WITH n, CASE
  3. WHEN n.category in ['persName', 'orgName'] THEN 2
  4. WHEN n.category in ['persName_surname', 'persName_addName'] THEN 1.5
  5. WHEN n.category in ['date', 'time', 'persName_forename'] THEN 0.5
  6. ELSE 1.0
  7. END AS multiplicationFactor
  8. SET n.multiplicationFactor = multiplicationFactor

你的推荐查询现在应该是这样的:

  1. MATCH (t:Trend)--(x:NamedEntity)-[xv:OCCUR]-(a:Article)-[v:OCCUR]-(n:NamedEntity)
  2. WHERE n.id IN ["polski związek narciarski_orgName", "Polska_placeName_country", "Kamila Stoch_persName", "Kamila_persName_surname", "Stoch_persName_surname", "Innsbruck_placeName_settlement", "Bischofshofen_placeName_settlement", "niemiecki_placeName_country", "Oberstdorfie_placeName_settlement", "47_placeName_settlement", "Garmisch_placeName_settlement", "Partenkirchen.nTo_placeName_settlement", "Stoch_persName", "katowicki_placeName_settlement", "AWF.nTCS_orgName", "polski_placeName_country", "Polak_placeName_country", "Adam Małysz_persName", "Adam_persName_forename", "Małysz_persName_surname", "Kamil Stoch_persName", "Kamil_persName_forename", "Piotr Żyła_persName", "Piotr_persName_forename", "żyć_persName_surname", "Stoch_persName_addName", "Kaczmarski_persName", "Kaczmarski_persName_surname"] and t.date > date(datetime($currentDay) - duration({days: $daysDelta})) and a.publication_datetime > datetime($currentDay) - duration({days: $daysDelta})
  3. WITH a,t, collect(distinct n) as distinctLinkNes, collect(distinct x) as distinctTrendNes,
  4. sum(n.multiplicationFactor * v.amount) as linkSum,
  5. sum(x.multiplicationFactor * xv.amount) as trendSum
  6. WITH a,t, linkSum, trendSum,
  7. reduce(total=0, ne in distinctLinkNes |
  8. total + ne.multiplicationFactor) as distinctLinkNesAmount,
  9. reduce(total=0, ne in distinctTrendNes |
  10. total + ne.multiplicationFactor) as distinctTrendNesAmount
  11. WITH a, t, distinctTrendNesAmount, trendSum, distinctLinkNesAmount, linkSum,
  12. (3*distinctTrendNesAmount + trendSum) * t.hits / 1000 as trendScore,
  13. (3*distinctLinkNesAmount + linkSum) * ($daysDelta - duration.between(a.publication_datetime, date($currentDay)).days) as articleScore
  14. WITH a, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore,
  15. (articleScore + trendScore) as score
  16. ORDER BY score DESC
  17. RETURN a as article, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore, score
  18. LIMIT 30
英文:

I would suggest a minor tweak in your schema. In the NamedEntity node, store an additional property named multiplicationFactor, which will store the values 2, 1.5, 1.0, 0.5 that you are using in your case statements. After analyzing the profile graph, I have noticed that the aggregation operations are more expensive than the graph traversals. So this one change should help a lot. Set the new property using this query:

  1. MATCH (n:NamedEntity)
  2. WITH n, CASE
  3. WHEN n.category in ['persName', 'orgName'] THEN 2
  4. WHEN n.category in ['persName_surname', 'persName_addName'] THEN 1.5
  5. WHEN n.category in ['date', 'time', 'persName_forename'] THEN 0.5
  6. ELSE 1.0
  7. END AS multiplicationFactor
  8. SET n.multiplicationFactor = multiplicationFactor

Your recommendation query will now become this:

  1. MATCH (t:Trend)--(x:NamedEntity)-[xv:OCCUR]-(a:Article)-[v:OCCUR]-(n:NamedEntity)
  2. WHERE n.id IN ["polski związek narciarski_orgName", "Polska_placeName_country", "Kamila Stoch_persName", "Kamila_persName_surname", "Stoch_persName_surname", "Innsbruck_placeName_settlement", "Bischofshofen_placeName_settlement", "niemiecki_placeName_country", "Oberstdorfie_placeName_settlement", "47_placeName_settlement", "Garmisch_placeName_settlement", "Partenkirchen.nTo_placeName_settlement", "Stoch_persName", "katowicki_placeName_settlement", "AWF.nTCS_orgName", "polski_placeName_country", "Polak_placeName_country", "Adam Małysz_persName", "Adam_persName_forename", "Małysz_persName_surname", "Kamil Stoch_persName", "Kamil_persName_forename", "Piotr Żyła_persName", "Piotr_persName_forename", &quotyć_persName_surname", "Stoch_persName_addName", "Kaczmarski_persName", "Kaczmarski_persName_surname"] and t.date > date(datetime($currentDay) - duration({days: $daysDelta})) and a.publication_datetime > datetime($currentDay) - duration({days: $daysDelta})
  3. WITH a,t, collect(distinct n) as distinctLinkNes, collect(distinct x) as distinctTrendNes,
  4. sum(n.multiplicationFactor * v.amount) as linkSum,
  5. sum(x.multiplicationFactor * xv.amount) as trendSum
  6. WITH a,t, linkSum, trendSum,
  7. reduce(total=0, ne in distinctLinkNes |
  8. total + ne.multiplicationFactor) as distinctLinkNesAmount,
  9. reduce(total=0, ne in distinctTrendNes |
  10. total + ne.multiplicationFactor) as distinctTrendNesAmount
  11. WITH a, t, distinctTrendNesAmount, trendSum, distinctLinkNesAmount, linkSum,
  12. (3*distinctTrendNesAmount + trendSum) * t.hits / 1000 as trendScore,
  13. (3*distinctLinkNesAmount + linkSum) * ($daysDelta - duration.between(a.publication_datetime, date($currentDay)).days) as articleScore
  14. WITH a, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore,
  15. (articleScore + trendScore) as score
  16. ORDER BY score DESC
  17. RETURN a as article, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore, score
  18. LIMIT 30

huangapple
  • 本文由 发表于 2023年4月13日 19:49:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76005068.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定