Optimizing neo4j cypher query for recommendation

huangapple go评论47阅读模式
英文:

Optimazing neo4j cypher query for recommendation

问题

这是我的neo4j数据库架构:

在数据库中有大约250万个文章节点、50万个命名实体节点和数千个趋势节点。文章具有发布日期,而且这些文章都是来自过去两年左右的。作为用户的输入,我会获得一组命名实体ID。我想要查询将在我的数据库中找到与这些输入命名实体ID和趋势之间有最佳连接的文章的查询。查询非常长,因为我正在进行一些评分,但我想将整个查询放在这里。

daysDelta通常为最近的7-14天。这个查询速度相当慢。根据输入的命名实体ID不同,查询可能需要几秒钟到几分钟。我尝试使用PROFILE进行调试,以下是结果。

从我所了解的信息来看,您应该减少基数。但我不知道如何在我的查询中执行这一操作,因为我必须进行所有这些评分。而且在我的查询中,我永远不需要获取超过30个结果。

英文:

This is my neo4j db schema:

Optimizing neo4j cypher query for recommendation

There is around 2.5 millions of Article nodes, 0.5 million of NamedEntityNodes and few thousand of Trend nodes. Articles have publication datetime and they are from about last two years.
As an input from a user I get list of NamedEntitiesIds. And I want to have query which will find articles with best connections between those input NamedEntitiesIds and Trends from my database. Query is quite long because I am doing some scoring but I want to put it whole here.

MATCH (t:Trend)--(x:NamedEntity)-[xv:OCCUR]-(a:Article)-[v:OCCUR]-(n:NamedEntity)
        WHERE n.id IN ["polski związek narciarski_orgName", "Polska_placeName_country", "Kamila Stoch_persName", "Kamila_persName_surname", "Stoch_persName_surname", "Innsbruck_placeName_settlement", "Bischofshofen_placeName_settlement", "niemiecki_placeName_country", "Oberstdorfie_placeName_settlement", "47_placeName_settlement", "Garmisch_placeName_settlement", "Partenkirchen.nTo_placeName_settlement", "Stoch_persName", "katowicki_placeName_settlement", "AWF.nTCS_orgName", "polski_placeName_country", "Polak_placeName_country", "Adam Małysz_persName", "Adam_persName_forename", "Małysz_persName_surname", "Kamil Stoch_persName", "Kamil_persName_forename", "Piotr Żyła_persName", "Piotr_persName_forename", "żyć_persName_surname", "Stoch_persName_addName", "Kaczmarski_persName", "Kaczmarski_persName_surname"] and t.date > date(datetime($currentDay) - duration({days: $daysDelta})) and a.publication_datetime > datetime($currentDay) - duration({days: $daysDelta})
        WITH a,t, collect(distinct n) as distinctLinkNes, collect(distinct x) as distinctTrendNes,
            sum(
                CASE
                    WHEN n.category in ['persName', 'orgName'] THEN 2*v.amount
                    WHEN n.category in ['persName_surname', 'persName_addName'] THEN 1.5*v.amount
                    WHEN n.category in ['date', 'time', 'persName_forename'] THEN 0.5*v.amount
                    ELSE 1.0*v.amount
                END) as linkSum,
            sum(
                CASE
                    WHEN x.category in ['persName', 'orgName'] THEN 2*xv.amount
                    WHEN x.category in ['persName_surname', 'persName_addName'] THEN 1.5*xv.amount
                    WHEN x.category in ['date', 'time', 'persName_forename'] THEN 0.5*xv.amount
                    ELSE 1.0*xv.amount
                END) as trendSum
        WITH a,t, linkSum, trendSum,
            reduce(total=0, ne in distinctLinkNes |
            total + 
            CASE
                WHEN ne.category in ['persName', 'orgName'] THEN 2
                WHEN ne.category in ['persName_surname', 'persName_addName'] THEN 1.5
                WHEN ne.category in ['date', 'time', 'persName_forename'] THEN 0.5
                ELSE 1.0
            END) as distinctLinkNesAmount,
            reduce(total=0, ne in distinctTrendNes |
            total + 
            CASE
                WHEN ne.category in ['persName', 'orgName'] THEN 2
                WHEN ne.category in ['persName_surname', 'persName_addName'] THEN 1.5
                WHEN ne.category in ['date', 'time', 'persName_forename'] THEN 0.5
                ELSE 1.0
            END) as distinctTrendNesAmount
        WITH a, t, distinctTrendNesAmount, trendSum, distinctLinkNesAmount, linkSum,
            (3*distinctTrendNesAmount + trendSum) * t.hits / 1000 as trendScore, 
            (3*distinctLinkNesAmount + linkSum) * ($daysDelta - duration.between(a.publication_datetime, date($currentDay)).days) as articleScore
        WITH a, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore,
            (articleScore + trendScore) as score
        ORDER BY score DESC
        RETURN a as article, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore, score
        LIMIT 30

daysDelta will be usually latest 7-14 days. This query is pretty slow. Depending on the input NamedEntitiesId i takes from few seconds up to few minutes. I tried to debug this using PROFILE here it is the result:
Optimizing neo4j cypher query for recommendation

From what I had read I should decrease cardinality. (or maybe something else I am happy for suggestions). But I have no idea how to do it in my query. When I have to do all this scoring. And in my query I will never have to get more then 30 results.

答案1

得分: 2

我建议在你的模式中进行一些微小的调整。在NamedEntity节点中,存储一个额外的属性,名为multiplicationFactor,它将存储你在case语句中使用的值:2, 1.5, 1.0, 0.5。在分析了概要图之后,我注意到聚合操作比图遍历更昂贵。因此,这个变化应该会有很大帮助。使用以下查询来设置新属性:

MATCH (n:NamedEntity)
WITH n, CASE
            WHEN n.category in ['persName', 'orgName'] THEN 2
            WHEN n.category in ['persName_surname', 'persName_addName'] THEN 1.5
            WHEN n.category in ['date', 'time', 'persName_forename'] THEN 0.5
            ELSE 1.0
        END AS multiplicationFactor
SET n.multiplicationFactor = multiplicationFactor

你的推荐查询现在应该是这样的:

MATCH (t:Trend)--(x:NamedEntity)-[xv:OCCUR]-(a:Article)-[v:OCCUR]-(n:NamedEntity)
WHERE n.id IN ["polski związek narciarski_orgName", "Polska_placeName_country", "Kamila Stoch_persName", "Kamila_persName_surname", "Stoch_persName_surname", "Innsbruck_placeName_settlement", "Bischofshofen_placeName_settlement", "niemiecki_placeName_country", "Oberstdorfie_placeName_settlement", "47_placeName_settlement", "Garmisch_placeName_settlement", "Partenkirchen.nTo_placeName_settlement", "Stoch_persName", "katowicki_placeName_settlement", "AWF.nTCS_orgName", "polski_placeName_country", "Polak_placeName_country", "Adam Małysz_persName", "Adam_persName_forename", "Małysz_persName_surname", "Kamil Stoch_persName", "Kamil_persName_forename", "Piotr Żyła_persName", "Piotr_persName_forename", "żyć_persName_surname", "Stoch_persName_addName", "Kaczmarski_persName", "Kaczmarski_persName_surname"] and t.date > date(datetime($currentDay) - duration({days: $daysDelta})) and a.publication_datetime > datetime($currentDay) - duration({days: $daysDelta})
WITH a,t, collect(distinct n) as distinctLinkNes, collect(distinct x) as distinctTrendNes,
            sum(n.multiplicationFactor * v.amount) as linkSum,
            sum(x.multiplicationFactor * xv.amount) as trendSum
WITH a,t, linkSum, trendSum,
            reduce(total=0, ne in distinctLinkNes |
            total + ne.multiplicationFactor) as distinctLinkNesAmount,
            reduce(total=0, ne in distinctTrendNes |
            total + ne.multiplicationFactor) as distinctTrendNesAmount
WITH a, t, distinctTrendNesAmount, trendSum, distinctLinkNesAmount, linkSum,
            (3*distinctTrendNesAmount + trendSum) * t.hits / 1000 as trendScore, 
            (3*distinctLinkNesAmount + linkSum) * ($daysDelta - duration.between(a.publication_datetime, date($currentDay)).days) as articleScore
WITH a, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore,
            (articleScore + trendScore) as score
ORDER BY score DESC
RETURN a as article, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore, score
LIMIT 30
英文:

I would suggest a minor tweak in your schema. In the NamedEntity node, store an additional property named multiplicationFactor, which will store the values 2, 1.5, 1.0, 0.5 that you are using in your case statements. After analyzing the profile graph, I have noticed that the aggregation operations are more expensive than the graph traversals. So this one change should help a lot. Set the new property using this query:

MATCH (n:NamedEntity)
WITH n, CASE
            WHEN n.category in ['persName', 'orgName'] THEN 2
            WHEN n.category in ['persName_surname', 'persName_addName'] THEN 1.5
            WHEN n.category in ['date', 'time', 'persName_forename'] THEN 0.5
            ELSE 1.0
        END AS multiplicationFactor
SET n.multiplicationFactor = multiplicationFactor

Your recommendation query will now become this:

MATCH (t:Trend)--(x:NamedEntity)-[xv:OCCUR]-(a:Article)-[v:OCCUR]-(n:NamedEntity)
WHERE n.id IN ["polski związek narciarski_orgName", "Polska_placeName_country", "Kamila Stoch_persName", "Kamila_persName_surname", "Stoch_persName_surname", "Innsbruck_placeName_settlement", "Bischofshofen_placeName_settlement", "niemiecki_placeName_country", "Oberstdorfie_placeName_settlement", "47_placeName_settlement", "Garmisch_placeName_settlement", "Partenkirchen.nTo_placeName_settlement", "Stoch_persName", "katowicki_placeName_settlement", "AWF.nTCS_orgName", "polski_placeName_country", "Polak_placeName_country", "Adam Małysz_persName", "Adam_persName_forename", "Małysz_persName_surname", "Kamil Stoch_persName", "Kamil_persName_forename", "Piotr Żyła_persName", "Piotr_persName_forename", "żyć_persName_surname", "Stoch_persName_addName", "Kaczmarski_persName", "Kaczmarski_persName_surname"] and t.date > date(datetime($currentDay) - duration({days: $daysDelta})) and a.publication_datetime > datetime($currentDay) - duration({days: $daysDelta})
WITH a,t, collect(distinct n) as distinctLinkNes, collect(distinct x) as distinctTrendNes,
            sum(n.multiplicationFactor * v.amount) as linkSum,
            sum(x.multiplicationFactor * xv.amount) as trendSum
WITH a,t, linkSum, trendSum,
            reduce(total=0, ne in distinctLinkNes |
            total + ne.multiplicationFactor) as distinctLinkNesAmount,
            reduce(total=0, ne in distinctTrendNes |
            total + ne.multiplicationFactor) as distinctTrendNesAmount
WITH a, t, distinctTrendNesAmount, trendSum, distinctLinkNesAmount, linkSum,
            (3*distinctTrendNesAmount + trendSum) * t.hits / 1000 as trendScore, 
            (3*distinctLinkNesAmount + linkSum) * ($daysDelta - duration.between(a.publication_datetime, date($currentDay)).days) as articleScore
WITH a, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore,
            (articleScore + trendScore) as score
ORDER BY score DESC
RETURN a as article, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore, score
LIMIT 30

huangapple
  • 本文由 发表于 2023年4月13日 19:49:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76005068.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定