英文:
Best algorithm to detect multiple linear trends in data
问题
我正在尝试检测数据中的线性趋势,参见示例图:
数据可以是水平的,也可以是增加/减少的。
我已经尝试过k均值聚类和DBSCAN,并取得了合理的结果,但我感觉可能有更好的算法。使用聚类的一个缺点是必须预先定义聚类的数量,这需要对数据的先验知识。在这种情况下,这将不是问题,因为它是打算自动化的,新的数据趋势可以随机出现。
有没有其他建议要尝试?
英文:
I'm trying to detect linear trends in data, see the example plot:
The data may be horizontal or increasing/decreasing.
I've tried kmeans clustering and dbscan with reasonable results but cannot help feeling there is probably a better algorithm out there. One of the drawbacks of using clustering is that the number of clusters has to be pre-defined, and that requires prior knowledge of the data. That will not be the case in this application as it is intended to be automated and new data trends can appear at random.
Does any one have any other suggestions to try?
答案1
得分: 0
你可以使用 kmeans 结合一些拟合准则来确定簇的数量。
这里是一个示例,仅使用 kmeans 和基于数据生成方式已知的簇数目:
rng(1); % 重置随机数生成器
N = 5; % 要生成的嘈杂线的数量
y = rand( 1, N ) * 5; % 初始线的 y 坐标
n = randi([5,100], 1, N ); % 每条线上的点数
y = repelem( y, 1, n ); % 生成完整的 y 值数组
y = y + rand(size(y))*0.1; % 添加噪音
y = y( randsample(numel(y), numel(y)) ); % 洗牌
x = linspace( 0, 1, numel(y) ); % 生成 x 坐标
% 仅基于 y 坐标进行聚类
k = kmeans( y(:), N );
% 可视化结果,为每个簇绘制水平线
figure(99); clf;
for ii = 1:N
yb = mean( y(k==ii) );
line( [min(x), max(x)], [yb, yb], 'color', 'k', 'linestyle', '-' );
end
line( x, y, 'marker', 'o', 'linestyle', 'none' );
ylim( [min(y), max(y)] + [-0.5, 0.5] );
可以扩展此方法以增加簇的数量,直到满足一定的拟合准则:
% 仅基于 y 坐标进行聚类
NMax = 10; % 最大簇数
thresh = 0.25; % 最大误差
for n = 1:NMax
k = kmeans( y(:), n );
err = arrayfun( @(nn) norm(y(k==nn) - mean(y(k==nn))), 1:n );
if max(err) < thresh
% 这足够多的簇以获得良好拟合
break
end
end
这表明 n
个簇足以提供足够的拟合,在此示例中,n=0.5
给出 4 条线,而 n=0.25
给出 5 条线。
你可以使用 kmedoids
和自定义距离函数,更加重视 y
坐标而不是完全忽略 x
坐标,如果线不是完全水平的话:
w = 2; % 对靠近垂直距离的加权
% 定义一个距离函数,更加重视 y 坐标
fdist = @(zi,zj) (zi(1) - zj(:,1)).^2 + w*(zi(2) - zj(:,2)).^2;
k = kmedoids( [x(:),y(:)], n, 'distance', fdist );
英文:
You can use kmeans with some fit criteria to determine the number of clusters.
Here is an example just using kmeans and a known number of clusters based on how the data is generated:
rng(1); % Reset the random number generator
N = 5; % Number of noisy lines to generate
y = rand( 1, N ) * 5; % Initial line y coordinates
n = randi([5,100], 1, N ); % Number of points in each line
y = repelem( y, 1, n ); % Generate a full y values array
y = y + rand(size(y))*0.1; % Add noise
y = y( randsample(numel(y), numel(y)) ); % shuffle
x = linspace( 0, 1, numel(y) ); % Generate x coords
% Cluster based on the y coordinates only
k = kmeans( y(:), N );
% Visualise the result, plot a horizontal line for each cluster
figure(99); clf;
for ii = 1:N
yb = mean( y(k==ii) );
line( [min(x), max(x)], [yb, yb], 'color', 'k', 'linestyle', '-' );
end
line( x, y, 'marker', 'o', 'linestyle', 'none' );
ylim( [min(y), max(y)] + [-0.5, 0.5] );
This could be extended to increase the number of clusters until some goodness-of-fit criteria is met
% Cluster based on the y coordinates only
NMax = 10; % Max number of clusters
thresh = 0.25; % Max error
for n = 1:NMax
k = kmeans( y(:), n );
err = arrayfun( @(nn) norm(y(k==nn) - mean(y(k==nn))), 1:n );
if max(err) < thresh
% This is enough clusters to get a decent fit
break
end
end
This would suggest n
clusters is enough to give an adequate fit, for this example n=0.5
gives 4 lines where n=0.25
gives 5.
You could use kmedoids
and a custom distance function to more heavily weight the y
coordinate instead of completely ignoring the x
coordinate if the lines aren't exclusively horizontal
w = 2; % weighting in favour of close vertical distance
% Define a distance function which more heavily weights the y coord
fdist = @(zi,zj) (zi(1) - zj(:,1)).^2 + w*(zi(2) - zj(:,2)).^2;
k = kmedoids( [x(:),y(:)], n, 'distance', fdist );
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论