多参数排名使用Firebase

huangapple go评论65阅读模式
英文:

Multi parameter ranking using firebase

问题

如何利用Firebase开发类似于Twitter的混合器算法,该算法基于Firestore中的weightcreated_at参数检索和排名讨论?

我有一个以下结构的讨论集合:

interface Discussion {
    weight: number;
    created_at: ServerTimeStamp;
}

挑战:

在Firestore中,按单个字段排序数据会带来一些限制。例如,如果我们仅按weight排序讨论,新帖子将永远没有机会在排名中上升。

如果我尝试分别按weightcreated_at排序讨论,如何有效处理去重

需要考虑到讨论文档的数量可能从0到100万不等。因此,我希望找到一种避免在客户端加载所有文档的解决方案。此外,所做的任何更改必须是反应性的,并利用onSnapshot方法进行实时更新。

示例场景:


interface Discussion {
    weight: number;
    created_at: ServerTimeStamp;
}

async function queryDiscussionFromFireStore () { 
   const col_ref = collection("discussion")
   // 查询热门讨论
   const topPost_unSub = onSnapShot(query(col_ref, orderBy("weight"), 
   (snapShot) => {
       setState(snapShort.docs.map (d => d.data() as Array<Discussion>)    
   })
   
   // 查询最近的讨论
   const recentPost_unSub = onSnapShot(query(col_ref, orderBy("created_at"), 
   (snapShot) => {
       setState(snapShort.docs.map (d => d.data() as Array<Discussion>)    
   })

   return () => {
     recentPost_unSub()
     topPost_unSub()
   };
}

queryDiscussionFromFireStore函数工作正常,但我无法弄清楚如何处理重复数据。

假设我们有以下数据:

[
    {
        weight: 5,
        created_at: today_date
    },
    {
        weight: 3,
        created_at: today_date
    },
]

在这种情况下,两个snapShot将返回相同的数据。

解释

在提供的代码示例中,queryDiscussionFromFirestore函数通过两个标准(权重和创建时间)对Firestore中的讨论进行检索,并使用onSnapshot方法监听查询讨论的实时更新。

但是,存在关于重复数据的担忧。在给定的场景中,如果多个讨论具有相同的created_at时间戳,那么“热门讨论”查询(按权重排序)和“最近讨论”查询(按创建时间排序)可能返回相同的数据。

例如,考虑以下示例数据:

[
    {
        weight: 5,
        created_at: today_date
    },
    {
        weight: 3,
        created_at: today_date
    },
]

在这种情况下,“热门讨论”和“最近讨论”的两个onSnapshot回调将接收相同的数据,导致重复条目被处理。

英文:

How can I leverage Firebase to develop a mixer algorithm similar to Twitter, which retrieves and ranks discussions from Firestore based on weight and created_at parameters?

I have a discussion collection with the following structure:

interface Discussion {
    weight: number;
    created_at: ServerTimeStamp;
}

Challenge:

In Firestore, ordering data by a single field poses a limitation. For example, if we order discussions solely by weight, new posts will never have the opportunity to rise up in the ranking.

If I attempt to order discussions separately by weight and created_at, how can I handle deduplication effectively?

It's important to consider that the discussion documents can vary from 0 to 1 million. Therefore, I prefer a solution that avoids loading all the documents on the client side. Additionally, any changes made must be reactive and utilize the onSnapshot method for real-time updates.

Example Scenario:


interface Discussion {
    weight: number;
    created_at: ServerTimeStamp;
}

async function queryDiscussionFromFireStore () { 
   const col_ref = collection(&quot;discussion&quot;)
   // query top discussions
   const topPost_unSub = onSnapShot(query(col_ref, orderby(&quot;weight&quot;), 
   (snapShot) =&gt; {
       setState(snapShort.doc.map (d =&gt; d.data() as Array&lt;Discussion&gt;)    
   })
   
   // query recent discussions
   const recentPost_unSub = onSnapShot(query(col_ref, orderby(&quot;created_at&quot;), 
   (snapShot) =&gt; {
       setState(snapShort.doc.map (d =&gt; d.data() as Array&lt;Discussion&gt;)    
   })

   return () =&gt; {
     recentPost_unSub()
     topPost_unSub()
   };
}

queryDiscussionFromFireStore is working fine but i'm not able to figure out how to handle duplicate data.

let suppose we have following data:

[
    {
        weight: 5,
        created_at: today_date
    },
    {
        weight: 3,
        created_at: today_date
    },
]

In this case both snapShot will response with same data.

Explanation

In the provided code example, the queryDiscussionFromFirestore function retrieves discussions from Firestore by ordering them based on two criteria: weight and created_at. The function uses the onSnapshot method to listen for real-time updates on the queried discussions.

However, there is a concern regarding duplicate data. In the given scenario, if multiple discussions have the same created_at timestamp, both the "top discussions" query (ordered by weight) and the "recent discussions" query (ordered by creation time) may return the same data.

For instance, considering the following example data:

[
    {
        weight: 5,
        created_at: today_date
    },
    {
        weight: 3,
        created_at: today_date
    },
]

In this case, both onSnapshot callbacks for the "top discussions" and "recent discussions" queries will receive the same data, which results in duplicate entries being processed.

答案1

得分: 3

根据Firestore文档中的查询限制

在复合查询中,范围(<、<=、>、>=)和不等于(!=、not-in)比较必须都在同一字段上进行筛选。

因此,每个查询只能在单个字段上具有范围过滤器,无法在单个查询中按多个字段对顶部结果进行排序或筛选。您将需要执行多个查询并在应用程序代码中进行重复项去重。

这也意味着无法防止额外的读取。从理论上讲,您可以找到一种方法将created_atweight合并为单个值/属性,以满足您的需求进行筛选,但我知道的唯一真正的示例是地理哈希(将点的纬度/经度值合并为单个字符串值,可用于筛选以查找区域内的文档),但我个人认为这里没有类似的等效方法。

英文:

From the Firestore documentation on its query limitations:

> In a compound query, range (<, <=, >, >=) and not equals (!=, not-in) comparisons must all filter on the same field.

So each query can only have range filters on a single field, and there is no way to order/filter top results on multiple fields in a single query. You will have to perform multiple queries and deduplicate the results in your application code.

That also means that there is no way to prevent the extra reads. Theoretically, you could find a way to merge the created_at and weight into a single value/property that you can filter on to meet your requirements, but the only real example of something like that that I know of are geohashes (which combine the lat/lon values of a point into a single string value that you can filter on to find documents in a region), and I personally don't see an equivalent here.

huangapple
  • 本文由 发表于 2023年7月7日 02:42:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76631702.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定