英文:
Optimazing neo4j cypher query for recommendation
问题
这是我的neo4j数据库架构:
在数据库中有大约250万个文章节点、50万个命名实体节点和数千个趋势节点。文章具有发布日期,而且这些文章都是来自过去两年左右的。作为用户的输入,我会获得一组命名实体ID。我想要查询将在我的数据库中找到与这些输入命名实体ID和趋势之间有最佳连接的文章的查询。查询非常长,因为我正在进行一些评分,但我想将整个查询放在这里。
daysDelta
通常为最近的7-14天。这个查询速度相当慢。根据输入的命名实体ID不同,查询可能需要几秒钟到几分钟。我尝试使用PROFILE进行调试,以下是结果。
从我所了解的信息来看,您应该减少基数。但我不知道如何在我的查询中执行这一操作,因为我必须进行所有这些评分。而且在我的查询中,我永远不需要获取超过30个结果。
英文:
This is my neo4j db schema:
There is around 2.5 millions of Article nodes, 0.5 million of NamedEntityNodes and few thousand of Trend nodes. Articles have publication datetime and they are from about last two years.
As an input from a user I get list of NamedEntitiesIds. And I want to have query which will find articles with best connections between those input NamedEntitiesIds and Trends from my database. Query is quite long because I am doing some scoring but I want to put it whole here.
MATCH (t:Trend)--(x:NamedEntity)-[xv:OCCUR]-(a:Article)-[v:OCCUR]-(n:NamedEntity)
WHERE n.id IN ["polski związek narciarski_orgName", "Polska_placeName_country", "Kamila Stoch_persName", "Kamila_persName_surname", "Stoch_persName_surname", "Innsbruck_placeName_settlement", "Bischofshofen_placeName_settlement", "niemiecki_placeName_country", "Oberstdorfie_placeName_settlement", "47_placeName_settlement", "Garmisch_placeName_settlement", "Partenkirchen.nTo_placeName_settlement", "Stoch_persName", "katowicki_placeName_settlement", "AWF.nTCS_orgName", "polski_placeName_country", "Polak_placeName_country", "Adam Małysz_persName", "Adam_persName_forename", "Małysz_persName_surname", "Kamil Stoch_persName", "Kamil_persName_forename", "Piotr Żyła_persName", "Piotr_persName_forename", "żyć_persName_surname", "Stoch_persName_addName", "Kaczmarski_persName", "Kaczmarski_persName_surname"] and t.date > date(datetime($currentDay) - duration({days: $daysDelta})) and a.publication_datetime > datetime($currentDay) - duration({days: $daysDelta})
WITH a,t, collect(distinct n) as distinctLinkNes, collect(distinct x) as distinctTrendNes,
sum(
CASE
WHEN n.category in ['persName', 'orgName'] THEN 2*v.amount
WHEN n.category in ['persName_surname', 'persName_addName'] THEN 1.5*v.amount
WHEN n.category in ['date', 'time', 'persName_forename'] THEN 0.5*v.amount
ELSE 1.0*v.amount
END) as linkSum,
sum(
CASE
WHEN x.category in ['persName', 'orgName'] THEN 2*xv.amount
WHEN x.category in ['persName_surname', 'persName_addName'] THEN 1.5*xv.amount
WHEN x.category in ['date', 'time', 'persName_forename'] THEN 0.5*xv.amount
ELSE 1.0*xv.amount
END) as trendSum
WITH a,t, linkSum, trendSum,
reduce(total=0, ne in distinctLinkNes |
total +
CASE
WHEN ne.category in ['persName', 'orgName'] THEN 2
WHEN ne.category in ['persName_surname', 'persName_addName'] THEN 1.5
WHEN ne.category in ['date', 'time', 'persName_forename'] THEN 0.5
ELSE 1.0
END) as distinctLinkNesAmount,
reduce(total=0, ne in distinctTrendNes |
total +
CASE
WHEN ne.category in ['persName', 'orgName'] THEN 2
WHEN ne.category in ['persName_surname', 'persName_addName'] THEN 1.5
WHEN ne.category in ['date', 'time', 'persName_forename'] THEN 0.5
ELSE 1.0
END) as distinctTrendNesAmount
WITH a, t, distinctTrendNesAmount, trendSum, distinctLinkNesAmount, linkSum,
(3*distinctTrendNesAmount + trendSum) * t.hits / 1000 as trendScore,
(3*distinctLinkNesAmount + linkSum) * ($daysDelta - duration.between(a.publication_datetime, date($currentDay)).days) as articleScore
WITH a, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore,
(articleScore + trendScore) as score
ORDER BY score DESC
RETURN a as article, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore, score
LIMIT 30
daysDelta
will be usually latest 7-14 days. This query is pretty slow. Depending on the input NamedEntitiesId i takes from few seconds up to few minutes. I tried to debug this using PROFILE here it is the result:
From what I had read I should decrease cardinality. (or maybe something else I am happy for suggestions). But I have no idea how to do it in my query. When I have to do all this scoring. And in my query I will never have to get more then 30 results.
答案1
得分: 2
我建议在你的模式中进行一些微小的调整。在NamedEntity
节点中,存储一个额外的属性,名为multiplicationFactor
,它将存储你在case语句中使用的值:2, 1.5, 1.0, 0.5
。在分析了概要图之后,我注意到聚合操作比图遍历更昂贵。因此,这个变化应该会有很大帮助。使用以下查询来设置新属性:
MATCH (n:NamedEntity)
WITH n, CASE
WHEN n.category in ['persName', 'orgName'] THEN 2
WHEN n.category in ['persName_surname', 'persName_addName'] THEN 1.5
WHEN n.category in ['date', 'time', 'persName_forename'] THEN 0.5
ELSE 1.0
END AS multiplicationFactor
SET n.multiplicationFactor = multiplicationFactor
你的推荐查询现在应该是这样的:
MATCH (t:Trend)--(x:NamedEntity)-[xv:OCCUR]-(a:Article)-[v:OCCUR]-(n:NamedEntity)
WHERE n.id IN ["polski związek narciarski_orgName", "Polska_placeName_country", "Kamila Stoch_persName", "Kamila_persName_surname", "Stoch_persName_surname", "Innsbruck_placeName_settlement", "Bischofshofen_placeName_settlement", "niemiecki_placeName_country", "Oberstdorfie_placeName_settlement", "47_placeName_settlement", "Garmisch_placeName_settlement", "Partenkirchen.nTo_placeName_settlement", "Stoch_persName", "katowicki_placeName_settlement", "AWF.nTCS_orgName", "polski_placeName_country", "Polak_placeName_country", "Adam Małysz_persName", "Adam_persName_forename", "Małysz_persName_surname", "Kamil Stoch_persName", "Kamil_persName_forename", "Piotr Żyła_persName", "Piotr_persName_forename", "żyć_persName_surname", "Stoch_persName_addName", "Kaczmarski_persName", "Kaczmarski_persName_surname"] and t.date > date(datetime($currentDay) - duration({days: $daysDelta})) and a.publication_datetime > datetime($currentDay) - duration({days: $daysDelta})
WITH a,t, collect(distinct n) as distinctLinkNes, collect(distinct x) as distinctTrendNes,
sum(n.multiplicationFactor * v.amount) as linkSum,
sum(x.multiplicationFactor * xv.amount) as trendSum
WITH a,t, linkSum, trendSum,
reduce(total=0, ne in distinctLinkNes |
total + ne.multiplicationFactor) as distinctLinkNesAmount,
reduce(total=0, ne in distinctTrendNes |
total + ne.multiplicationFactor) as distinctTrendNesAmount
WITH a, t, distinctTrendNesAmount, trendSum, distinctLinkNesAmount, linkSum,
(3*distinctTrendNesAmount + trendSum) * t.hits / 1000 as trendScore,
(3*distinctLinkNesAmount + linkSum) * ($daysDelta - duration.between(a.publication_datetime, date($currentDay)).days) as articleScore
WITH a, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore,
(articleScore + trendScore) as score
ORDER BY score DESC
RETURN a as article, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore, score
LIMIT 30
英文:
I would suggest a minor tweak in your schema. In the NamedEntity
node, store an additional property named multiplicationFactor
, which will store the values 2, 1.5, 1.0, 0.5
that you are using in your case statements. After analyzing the profile graph, I have noticed that the aggregation operations are more expensive than the graph traversals. So this one change should help a lot. Set the new property using this query:
MATCH (n:NamedEntity)
WITH n, CASE
WHEN n.category in ['persName', 'orgName'] THEN 2
WHEN n.category in ['persName_surname', 'persName_addName'] THEN 1.5
WHEN n.category in ['date', 'time', 'persName_forename'] THEN 0.5
ELSE 1.0
END AS multiplicationFactor
SET n.multiplicationFactor = multiplicationFactor
Your recommendation query will now become this:
MATCH (t:Trend)--(x:NamedEntity)-[xv:OCCUR]-(a:Article)-[v:OCCUR]-(n:NamedEntity)
WHERE n.id IN ["polski związek narciarski_orgName", "Polska_placeName_country", "Kamila Stoch_persName", "Kamila_persName_surname", "Stoch_persName_surname", "Innsbruck_placeName_settlement", "Bischofshofen_placeName_settlement", "niemiecki_placeName_country", "Oberstdorfie_placeName_settlement", "47_placeName_settlement", "Garmisch_placeName_settlement", "Partenkirchen.nTo_placeName_settlement", "Stoch_persName", "katowicki_placeName_settlement", "AWF.nTCS_orgName", "polski_placeName_country", "Polak_placeName_country", "Adam Małysz_persName", "Adam_persName_forename", "Małysz_persName_surname", "Kamil Stoch_persName", "Kamil_persName_forename", "Piotr Żyła_persName", "Piotr_persName_forename", "żyć_persName_surname", "Stoch_persName_addName", "Kaczmarski_persName", "Kaczmarski_persName_surname"] and t.date > date(datetime($currentDay) - duration({days: $daysDelta})) and a.publication_datetime > datetime($currentDay) - duration({days: $daysDelta})
WITH a,t, collect(distinct n) as distinctLinkNes, collect(distinct x) as distinctTrendNes,
sum(n.multiplicationFactor * v.amount) as linkSum,
sum(x.multiplicationFactor * xv.amount) as trendSum
WITH a,t, linkSum, trendSum,
reduce(total=0, ne in distinctLinkNes |
total + ne.multiplicationFactor) as distinctLinkNesAmount,
reduce(total=0, ne in distinctTrendNes |
total + ne.multiplicationFactor) as distinctTrendNesAmount
WITH a, t, distinctTrendNesAmount, trendSum, distinctLinkNesAmount, linkSum,
(3*distinctTrendNesAmount + trendSum) * t.hits / 1000 as trendScore,
(3*distinctLinkNesAmount + linkSum) * ($daysDelta - duration.between(a.publication_datetime, date($currentDay)).days) as articleScore
WITH a, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore,
(articleScore + trendScore) as score
ORDER BY score DESC
RETURN a as article, t, distinctTrendNesAmount, trendSum, trendScore, distinctLinkNesAmount, linkSum, articleScore, score
LIMIT 30
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论