2023年4月13日 19:49:45go评论77阅读模式

英文:

Optimazing neo4j cypher query for recommendation

问题

这是我的neo4j数据库架构：

在数据库中有大约250万个文章节点、50万个命名实体节点和数千个趋势节点。文章具有发布日期，而且这些文章都是来自过去两年左右的。作为用户的输入，我会获得一组命名实体ID。我想要查询将在我的数据库中找到与这些输入命名实体ID和趋势之间有最佳连接的文章的查询。查询非常长，因为我正在进行一些评分，但我想将整个查询放在这里。

daysDelta通常为最近的7-14天。这个查询速度相当慢。根据输入的命名实体ID不同，查询可能需要几秒钟到几分钟。我尝试使用PROFILE进行调试，以下是结果。

从我所了解的信息来看，您应该减少基数。但我不知道如何在我的查询中执行这一操作，因为我必须进行所有这些评分。而且在我的查询中，我永远不需要获取超过30个结果。

英文:

This is my neo4j db schema:

There is around 2.5 millions of Article nodes, 0.5 million of NamedEntityNodes and few thousand of Trend nodes. Articles have publication datetime and they are from about last two years.
As an input from a user I get list of NamedEntitiesIds. And I want to have query which will find articles with best connections between those input NamedEntitiesIds and Trends from my database. Query is quite long because I am doing some scoring but I want to put it whole here.

MATCH (t:Trend)--(x:NamedEntity)-[xv:OCCUR]-(a:Article)-[v:OCCUR]-(n:NamedEntity)
        WHERE n.id IN [&quot;polski związek narciarski_orgName&quot;, &quot;Polska_placeName_country&quot;, &quot;Kamila Stoch_persName&quot;, &quot;Kamila_persName_surname&quot;, &quot;Stoch_persName_surname&quot;, &quot;Innsbruck_placeName_settlement&quot;, &quot;Bischofshofen_placeName_settlement&quot;, &quot;niemiecki_placeName_country&quot;, &quot;Oberstdorfie_placeName_settlement&quot;, &quot;47_placeName_settlement&quot;, &quot;Garmisch_placeName_settlement&quot;, &quot;Partenkirchen.nTo_placeName_settlement&quot;, &quot;Stoch_persName&quot;, &quot;katowicki_placeName_settlement&quot;, &quot;AWF.nTCS_orgName&quot;, &quot;polski_placeName_country&quot;, &quot;Polak_placeName_country&quot;, &quot;Adam Małysz_persName&quot;, &quot;Adam_persName_forename&quot;, &quot;Małysz_persName_surname&quot;, &quot;Kamil Stoch_persName&quot;, &quot;Kamil_persName_forename&quot;, &quot;Piotr Żyła_persName&quot;, &quot;Piotr_persName_forename&quot;, &quot;żyć_persName_surname&quot;, &quot;Stoch_persName_addName&quot;, &quot;Kaczmarski_persName&quot;, &quot;Kaczmarski_persName_surname&quot;] and t.date &gt; date(datetime($currentDay) - duration({days: $daysDelta})) and a.publication_datetime &gt; datetime($currentDay) - duration({days: $daysDelta})
        WITH a,t, collect(distinct n) as distinctLinkNes, collect(distinct x) as distinctTrendNes,
            sum(
                CASE
                    WHEN n.category in [&#39;persName&#39;, &#39;orgName&#39;] THEN 2*v.amount
                    WHEN n.category in [&#39;persName_surname&#39;, &#39;persName_addName&#39;] THEN 1.5*v.amount
                    WHEN n.category in [&#39;date&#39;, &#39;time&#39;, &#39;persName_forename&#39;] THEN 0.5*v.amount
                    ELSE 1.0*v.amount
                END) as linkSum,
            sum(
                CASE
                    WHEN x.category in [&#39;persName&#39;, &#39;orgName&#39;] THEN 2*xv.amount
                    WHEN x.category in [&#39;persName_surname&#39;, &#39;persName_addName&#39;] THEN 1.5*xv.amount
                    WHEN x.category in [&#39;date&#39;, &#39;time&#39;, &#39;persName_forename&#39;] THEN 0.5*xv.amount
                    ELSE 1.0*xv.amount
                END) as trendSum
        WITH a,t, linkSum, trendSum,
            reduce(total=0, ne in distinctLinkNes |
            total + 
            CASE
                WHEN ne.category in [&#39;persName&#39;, &#39;orgName&#39;] THEN 2
                WHEN ne.category in [&#39;persName_surname&#39;, &#39;persName_addName&#39;] THEN 1.5
                WHEN ne.category in [&#39;date&#39;, &#39;time&#39;, &#39;persName_forename&#39;] THEN 0.5
                ELSE 1.0
            END) as distinctLinkNesAmount,
            reduce(total=0, ne in distinctTrendNes |
            total + 
            CASE
                WHEN ne.category in [&#39;persName&#39;, &#39;orgName&#39;] THEN 2
                WHEN ne.category in [&#39;persName_surname&#39;, &#39;persName_addName&#39;] THEN 1.5
                WHEN ne.category in [&#39;date&#39;, &#39;time&#39;, &#39;persName_forename&#39;] THEN 0.5
                ELSE 1.0
            END) as distinctTrendNesAmount
        WITH a, t, distinctTrendNesAmount, trendSum, distinctLinkNesAmount, linkSum,
            (3*distinctTrendNesAmount + trendSum) * t.hits / 1000 as trendScore, 
            (3*distinctLinkNesAmount + linkSum) * ($daysDelta - duration.between(a.publication_datetime, date($currentDay)).days) as articleScore
        WITH a, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore,
            (articleScore + trendScore) as score
        ORDER BY score DESC
        RETURN a as article, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore, score
        LIMIT 30

daysDelta will be usually latest 7-14 days. This query is pretty slow. Depending on the input NamedEntitiesId i takes from few seconds up to few minutes. I tried to debug this using PROFILE here it is the result:

From what I had read I should decrease cardinality. (or maybe something else I am happy for suggestions). But I have no idea how to do it in my query. When I have to do all this scoring. And in my query I will never have to get more then 30 results.

答案1

得分: 2

我建议在你的模式中进行一些微小的调整。在NamedEntity节点中，存储一个额外的属性，名为multiplicationFactor，它将存储你在case语句中使用的值：2, 1.5, 1.0, 0.5。在分析了概要图之后，我注意到聚合操作比图遍历更昂贵。因此，这个变化应该会有很大帮助。使用以下查询来设置新属性：

MATCH (n:NamedEntity)
WITH n, CASE
            WHEN n.category in ['persName', 'orgName'] THEN 2
            WHEN n.category in ['persName_surname', 'persName_addName'] THEN 1.5
            WHEN n.category in ['date', 'time', 'persName_forename'] THEN 0.5
            ELSE 1.0
        END AS multiplicationFactor
SET n.multiplicationFactor = multiplicationFactor

你的推荐查询现在应该是这样的：

MATCH (t:Trend)--(x:NamedEntity)-[xv:OCCUR]-(a:Article)-[v:OCCUR]-(n:NamedEntity)
WHERE n.id IN ["polski związek narciarski_orgName", "Polska_placeName_country", "Kamila Stoch_persName", "Kamila_persName_surname", "Stoch_persName_surname", "Innsbruck_placeName_settlement", "Bischofshofen_placeName_settlement", "niemiecki_placeName_country", "Oberstdorfie_placeName_settlement", "47_placeName_settlement", "Garmisch_placeName_settlement", "Partenkirchen.nTo_placeName_settlement", "Stoch_persName", "katowicki_placeName_settlement", "AWF.nTCS_orgName", "polski_placeName_country", "Polak_placeName_country", "Adam Małysz_persName", "Adam_persName_forename", "Małysz_persName_surname", "Kamil Stoch_persName", "Kamil_persName_forename", "Piotr Żyła_persName", "Piotr_persName_forename", "żyć_persName_surname", "Stoch_persName_addName", "Kaczmarski_persName", "Kaczmarski_persName_surname"] and t.date > date(datetime($currentDay) - duration({days: $daysDelta})) and a.publication_datetime > datetime($currentDay) - duration({days: $daysDelta})
WITH a,t, collect(distinct n) as distinctLinkNes, collect(distinct x) as distinctTrendNes,
            sum(n.multiplicationFactor * v.amount) as linkSum,
            sum(x.multiplicationFactor * xv.amount) as trendSum
WITH a,t, linkSum, trendSum,
            reduce(total=0, ne in distinctLinkNes |
            total + ne.multiplicationFactor) as distinctLinkNesAmount,
            reduce(total=0, ne in distinctTrendNes |
            total + ne.multiplicationFactor) as distinctTrendNesAmount
WITH a, t, distinctTrendNesAmount, trendSum, distinctLinkNesAmount, linkSum,
            (3*distinctTrendNesAmount + trendSum) * t.hits / 1000 as trendScore, 
            (3*distinctLinkNesAmount + linkSum) * ($daysDelta - duration.between(a.publication_datetime, date($currentDay)).days) as articleScore
WITH a, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore,
            (articleScore + trendScore) as score
ORDER BY score DESC
RETURN a as article, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore, score
LIMIT 30

英文:

I would suggest a minor tweak in your schema. In the NamedEntity node, store an additional property named multiplicationFactor, which will store the values 2, 1.5, 1.0, 0.5 that you are using in your case statements. After analyzing the profile graph, I have noticed that the aggregation operations are more expensive than the graph traversals. So this one change should help a lot. Set the new property using this query:

MATCH (n:NamedEntity)
WITH n, CASE
            WHEN n.category in [&#39;persName&#39;, &#39;orgName&#39;] THEN 2
            WHEN n.category in [&#39;persName_surname&#39;, &#39;persName_addName&#39;] THEN 1.5
            WHEN n.category in [&#39;date&#39;, &#39;time&#39;, &#39;persName_forename&#39;] THEN 0.5
            ELSE 1.0
        END AS multiplicationFactor
SET n.multiplicationFactor = multiplicationFactor

Your recommendation query will now become this:

MATCH (t:Trend)--(x:NamedEntity)-[xv:OCCUR]-(a:Article)-[v:OCCUR]-(n:NamedEntity)
WHERE n.id IN [&quot;polski związek narciarski_orgName&quot;, &quot;Polska_placeName_country&quot;, &quot;Kamila Stoch_persName&quot;, &quot;Kamila_persName_surname&quot;, &quot;Stoch_persName_surname&quot;, &quot;Innsbruck_placeName_settlement&quot;, &quot;Bischofshofen_placeName_settlement&quot;, &quot;niemiecki_placeName_country&quot;, &quot;Oberstdorfie_placeName_settlement&quot;, &quot;47_placeName_settlement&quot;, &quot;Garmisch_placeName_settlement&quot;, &quot;Partenkirchen.nTo_placeName_settlement&quot;, &quot;Stoch_persName&quot;, &quot;katowicki_placeName_settlement&quot;, &quot;AWF.nTCS_orgName&quot;, &quot;polski_placeName_country&quot;, &quot;Polak_placeName_country&quot;, &quot;Adam Małysz_persName&quot;, &quot;Adam_persName_forename&quot;, &quot;Małysz_persName_surname&quot;, &quot;Kamil Stoch_persName&quot;, &quot;Kamil_persName_forename&quot;, &quot;Piotr Żyła_persName&quot;, &quot;Piotr_persName_forename&quot;, &quot;żyć_persName_surname&quot;, &quot;Stoch_persName_addName&quot;, &quot;Kaczmarski_persName&quot;, &quot;Kaczmarski_persName_surname&quot;] and t.date &gt; date(datetime($currentDay) - duration({days: $daysDelta})) and a.publication_datetime &gt; datetime($currentDay) - duration({days: $daysDelta})
WITH a,t, collect(distinct n) as distinctLinkNes, collect(distinct x) as distinctTrendNes,
            sum(n.multiplicationFactor * v.amount) as linkSum,
            sum(x.multiplicationFactor * xv.amount) as trendSum
WITH a,t, linkSum, trendSum,
            reduce(total=0, ne in distinctLinkNes |
            total + ne.multiplicationFactor) as distinctLinkNesAmount,
            reduce(total=0, ne in distinctTrendNes |
            total + ne.multiplicationFactor) as distinctTrendNesAmount
WITH a, t, distinctTrendNesAmount, trendSum, distinctLinkNesAmount, linkSum,
            (3*distinctTrendNesAmount + trendSum) * t.hits / 1000 as trendScore, 
            (3*distinctLinkNesAmount + linkSum) * ($daysDelta - duration.between(a.publication_datetime, date($currentDay)).days) as articleScore
WITH a, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore,
            (articleScore + trendScore) as score
ORDER BY score DESC
RETURN a as article, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore, score
LIMIT 30

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Optimizing neo4j cypher query for recommendation

问题

答案1

在OSLC4J中，我如何向具有七个关联属性的资源添加属性？

合并两个Cypher查询结果

Setting up Neo4J with Golang

在Neo4j中交换两个节点的特定属性，包括它们的关系。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。