英文:
Converting between two sets scaled between 0 and 100 using overlap
问题
我正在尝试对每10分钟接收到的谷歌趋势数据进行缩放。如果你对谷歌趋势不熟悉,每个响应都会根据当前响应中的最小值和最大值在0到100之间进行缩放。因此,对于不同但有重叠时间间隔的两个不同请求(例如4:30-5:30和5-6的请求),同一时间点可能会有不同的值(例如5的值可能不同)。
我尝试的目标是将所有值相对于我收集到的第一个4小时时间间隔进行缩放。每10分钟,将收集一个新的4小时数据块,这意味着大部分时间将与前一个数据块重叠。是否可以利用这种重叠来将所有新值相对于第一个时间间隔进行缩放?
注意:新值大于100也是可以的。
英文:
I am attempting to scale google trends data that is received by the minute every 10 minutes. If you are not familiar with google trends, every response is scaled between 0 and 100 based on the minimum and maximum in the current response. So two different requests for different, but overlapping, time intervals can have different values for the same time (ie a request from 4:30-5:30 and a request for 5-6 may have different values for 5).
What I am attempting to do is scale all values relative to the first 4 hour interval for which I collect trend data. Every 10 minutes, a new 4h chunk will be collected, meaning most of the time will overlap with the previous chunk. Is it possible to exploit this overlap to scale all new values relative to the first interval?
Note: it's ok for new values to be gt 100
答案1
得分: 2
假设你的初始四小时窗口和经过缩放处理的任何数据都是“好的”。
假设我们的好数据在时间T结束,我们有一个新的4小时窗口的数据,在时间T+10结束。
新窗口中的数据与好数据唯一的区别是缩放因子。新窗口与好数据共有的每一分钟都可以为我们需要使新数据“好”的缩放因子生成一次投票:缩放因子=(好值)/(新值)。
通常情况下,我会使用中位数来处理这样的情况,但由于数据非常粗糙,存在数据的“悬崖”,特别是中位数可能在一个显著较大或较小的数字旁边。因此,我建议通过消除两个方向上的k个异常值来生成投票的缩放因子,然后取剩余投票的平均值。
如果你想要更多的投票,你可以从非相邻的4小时块中获取它们(尽管显然有限的回报)。
--- 示例 ---
假设在初始窗口中,峰值搜索量为1000。这意味着该窗口的缩放因子为0.10,这将导致Google显示给我们的峰值搜索量为100。
在下一个窗口中,我们有一个新的峰值为2000。现在,这些峰值对我们来说是不可见的,但我们看到的是在两个窗口中存在的每个点在新窗口中的值是旧窗口值的一半。由于投票(如上所述)是(好值)/(新值),我们有很多接近2.0的投票(由于粗糙度和四舍五入,不是完全相等)。
因此,我们将我们的10个新值分别乘以2.0,将它们转换为好的比例。值为零的不变,因为没有搜索就是没有搜索,无论比例如何。
英文:
Let's say that your initial four-hour window and any data that has gone through a scaling process is 'good.'
Let's say our good data ends at time T, and we have a new 4-hour window of data that ends at time T+10.
The only difference between the data in our new window and good data is the scaling factor. Every minute that the new window has in common with good data can generate a vote for scaling factor we need to make the new data 'good': scaling factor = (good value) / (new value).
Normally I'd use the median of votes for something like this, but because the data is so coarse, you're at risk for there being 'cliffs' in the data, and in particular the median might be next to a significantly larger or smaller number. For that reason, I suggest generating a scaling factor off the votes by eliminating the k outliers in both directions, then taking the average of the remaining votes.
If you want even more votes, you can get them off non-adjacent 4-hour blocks (though obviously with limited returns).
--- Example ---
Say in the initial window, the peak searches is 1000. That means the scaling factor for that window is 0.10, which will result in the peak searches Google displays to us being 100.
In the next window we have a new peak of 2000. Now, these peaks are invisible to us, but what we do see is that each of the points that exist in both windows have half the value in the new window vs the old window. Since votes (described above) are (good value) / (new value), we have a bunch of votes close to 2.0 (close not exact due to coarseness and rounding).
So we multiply each of our 10 new values by 2.0 to convert them to the good scale. A value of zero is unchanged since no searches is no searches whatever the scale.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论