2023年6月22日 13:50:40go评论81阅读模式

英文:

Best algorithm to detect multiple linear trends in data

问题

我正在尝试检测数据中的线性趋势，参见示例图：

检测数据中多个线性趋势的最佳算法

数据可以是水平的，也可以是增加/减少的。

我已经尝试过k均值聚类和DBSCAN，并取得了合理的结果，但我感觉可能有更好的算法。使用聚类的一个缺点是必须预先定义聚类的数量，这需要对数据的先验知识。在这种情况下，这将不是问题，因为它是打算自动化的，新的数据趋势可以随机出现。

有没有其他建议要尝试？

英文:

I'm trying to detect linear trends in data, see the example plot:

检测数据中多个线性趋势的最佳算法

The data may be horizontal or increasing/decreasing.

I've tried kmeans clustering and dbscan with reasonable results but cannot help feeling there is probably a better algorithm out there. One of the drawbacks of using clustering is that the number of clusters has to be pre-defined, and that requires prior knowledge of the data. That will not be the case in this application as it is intended to be automated and new data trends can appear at random.

Does any one have any other suggestions to try?

答案1

得分: 0

你可以使用 kmeans 结合一些拟合准则来确定簇的数量。

这里是一个示例，仅使用 kmeans 和基于数据生成方式已知的簇数目：

rng(1); % 重置随机数生成器
N = 5;  % 要生成的嘈杂线的数量
y = rand( 1, N ) * 5;      % 初始线的 y 坐标
n = randi([5,100], 1, N ); % 每条线上的点数
y = repelem( y, 1, n );    % 生成完整的 y 值数组
y = y + rand(size(y))*0.1;               % 添加噪音
y = y( randsample(numel(y), numel(y)) ); % 洗牌
x = linspace( 0, 1, numel(y) );          % 生成 x 坐标
% 仅基于 y 坐标进行聚类
k = kmeans( y(:), N );
% 可视化结果，为每个簇绘制水平线
figure(99); clf;
for ii = 1:N
    yb = mean( y(k==ii) );
    line( [min(x), max(x)], [yb, yb], 'color', 'k', 'linestyle', '-' );
end
line( x, y, 'marker', 'o', 'linestyle', 'none' );
ylim( [min(y), max(y)] + [-0.5, 0.5] );

可以扩展此方法以增加簇的数量，直到满足一定的拟合准则：

% 仅基于 y 坐标进行聚类
NMax = 10; % 最大簇数
thresh = 0.25; % 最大误差
for n = 1:NMax
    k = kmeans( y(:), n );
    err = arrayfun( @(nn) norm(y(k==nn) - mean(y(k==nn))), 1:n );
    if max(err) < thresh
        % 这足够多的簇以获得良好拟合
        break
    end
end

这表明 n 个簇足以提供足够的拟合，在此示例中，n=0.5 给出 4 条线，而 n=0.25 给出 5 条线。

你可以使用 kmedoids 和自定义距离函数，更加重视 y 坐标而不是完全忽略 x 坐标，如果线不是完全水平的话：

w = 2; % 对靠近垂直距离的加权
% 定义一个距离函数，更加重视 y 坐标
fdist = @(zi,zj) (zi(1) - zj(:,1)).^2 + w*(zi(2) - zj(:,2)).^2;
k = kmedoids( [x(:),y(:)], n, 'distance', fdist );

英文:

You can use kmeans with some fit criteria to determine the number of clusters.

Here is an example just using kmeans and a known number of clusters based on how the data is generated:

rng(1); % Reset the random number generator
N = 5;  % Number of noisy lines to generate
y = rand( 1, N ) * 5;      % Initial line y coordinates
n = randi([5,100], 1, N ); % Number of points in each line
y = repelem( y, 1, n );    % Generate a full y values array
y = y + rand(size(y))*0.1;               % Add noise
y = y( randsample(numel(y), numel(y)) ); % shuffle
x = linspace( 0, 1, numel(y) );          % Generate x coords
% Cluster based on the y coordinates only
k = kmeans( y(:), N );
% Visualise the result, plot a horizontal line for each cluster
figure(99); clf;
for ii = 1:N
    yb = mean( y(k==ii) );
    line( [min(x), max(x)], [yb, yb], &#39;color&#39;, &#39;k&#39;, &#39;linestyle&#39;, &#39;-&#39; );
end
line( x, y, &#39;marker&#39;, &#39;o&#39;, &#39;linestyle&#39;, &#39;none&#39; );
ylim( [min(y), max(y)] + [-0.5, 0.5] );

This could be extended to increase the number of clusters until some goodness-of-fit criteria is met

% Cluster based on the y coordinates only
NMax = 10; % Max number of clusters
thresh = 0.25; % Max error
for n = 1:NMax
    k = kmeans( y(:), n );
    err = arrayfun( @(nn) norm(y(k==nn) - mean(y(k==nn))), 1:n );
    if max(err) &lt; thresh
        % This is enough clusters to get a decent fit
        break
    end
end

This would suggest n clusters is enough to give an adequate fit, for this example n=0.5 gives 4 lines where n=0.25 gives 5.

检测数据中多个线性趋势的最佳算法

You could use kmedoids and a custom distance function to more heavily weight the y coordinate instead of completely ignoring the x coordinate if the lines aren't exclusively horizontal

w = 2; % weighting in favour of close vertical distance
% Define a distance function which more heavily weights the y coord
fdist = @(zi,zj) (zi(1) - zj(:,1)).^2 + w*(zi(2) - zj(:,2)).^2;
k = kmedoids( [x(:),y(:)], n, &#39;distance&#39;, fdist );

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

检测数据中多个线性趋势的最佳算法

问题

答案1

在录制视频中检测特定对象的角度

创建一个在Matlab中的3D速度数据切片图。

Matlab矩阵的彩色绘图

寻找两个向量

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论