2023年3月31日 18:44:13go评论80阅读模式

英文:

Improving Deep Learning Model to Detect Train Wagon Gaps in Variable Conditions

问题

我们的团队记录了来自不同摄像机位置的移动火车的视频流，这些位置具有不同的背景和距离铁轨。我们的任务是收集有关每节车厢的信息，这需要检测它们之间的间隙。我们使用Yolov5架构和默认的数据增强训练了一个深度神经网络，用于处理超过2000张已标记图像的数据集，以及没有间隙的未标记图像。然而，我们在低光条件下遇到了一些假阳性和性能不佳的问题。

我们当前的后处理步骤包括运行dbscan算法以将具有“联接器”的帧分组（请参见下面的示例图像，带有联接器的边界框），并根据平均置信度和标准差来过滤低置信度的示例。

此外，我们最近从不同地点收集了5万张图像，包括有联接器和没有联接器的图像。使用当前的应用程序动态收集了这些图像，如果我们在图像中找到一个至少有60%置信度的联接器，则将图像分配到“GAP”类别。置信度低于60%的联接器图像被拒绝，而没有联接器的图像被分配到“NO_GAP”类别。使用这些图像，我们使用Yolov8架构训练了一个标签为[GAP, NO_GAP]的二元分类器。然而，我们不确定二元分类器是否足够好地概括了我们的任务，因为我们将许多不同的概念都视为“NO_GAP”。

我们正在考虑其他深度学习架构，如半监督学习和对比学习，作为解决我们问题的潜在解决方案。我们还有兴趣尝试不同的架构，如带有分块方法的VIT，尽管我们对这些架构的经验有限。

我们的主要问题是：

您建议我们探索哪些深度学习架构或技术，以提高我们的模型在不同光照和环境条件下检测火车车厢间隙的准确性？
值得继续使用分类，但采用不同的架构，例如VIT与分块方法吗？有没有特定的这些架构的实现或示例供我们参考？
我们有很多未标记的数据。值得尝试使用自监督学习作为“预训练”步骤吗？是否有关于未标记/标记数据比例、所需的计算能力、选择算法以及如何确定何时停止预训练过程的经验法则？

英文:

Our team records video streams of moving trains from various camera locations with differing backgrounds and distances from the rails. Our task is to collect information about each wagon, which requires detecting gaps between them. We have trained a deep neural network using the Yolov5 architecture with default data augmentation on a dataset of over 2000 labeled images, as well as unlabeled images without gaps. However, we are experiencing several issues with false positives and poor performance in low-light conditions.

Our current post-processing step involves running the dbscan algorithm to group frames with "couplers" (see the image below for an example with bbox around coupler), and filtering out low-confidence examples based on mean confidence and standard deviation.

Additionally, we have recently collected 50k images from different locations, including images with couplers and without couplers. Images were collected dynamically using the current application where the image was assigned a class "GAP" if we found a coupler in it with at least 60% confidence. Images with couplers below 60% confidence were rejected and images with no couplers were assigned to "NO_GAP" class. Using these images, we trained a binary classifier with labels [GAP, NO_GAP] using the Yolov8 architecture. However, we are unsure if a binary classifier can generalize well enough for our task, as we treat many different concepts as "NO_GAP."

We are considering other deep learning architectures, such as semi-supervised learning and contrastive learning, as potential solutions to our problems. We are also interested in trying different architectures, such as VIT with a patching approach, although we have limited experience with these architectures.

Our main questions are:

What deep learning architectures or techniques would you recommend we explore to improve the accuracy of our model in detecting train wagon gaps in variable lighting and environmental conditions?
Is it worth staying with classification but using a different architecture, such as VIT with a patching approach?
Are there any specific implementations or examples of these architectures that we can refer to?
We have a lot of unlabeled data. Is it worth trying to use self-supervised learning as a "pre-train" step? Is there a rule of thumb for things like unlabeled/labelled data ratio, required computing power, selecting algorithm and how to determine when to stop the pretraining process?

Example of video frames (with detected couplers)

答案1

得分: 1

为什么要使用定位网络，比如yolo，来执行分类任务？Yolo生成一个巨大的输出向量，可能会检测图像上任何位置的许多类别的对象。这似乎有些过度？
为什么要在背景中使用耦合器的标签？这会降低网络的自信心，并且对你帮助不大（我猜）。删除这些标签应该会使问题更容易学习。或者你可以为两个耦合器分配不同的标签。这样，网络就不会被背景中微小的、部分隐藏的耦合器和前景中的大耦合器混淆。我猜你可以以自动方式重新标记数据（如果只有一个标签，它很可能是前部耦合器，如果有两个标签，较大的那个可能是前部耦合器？）。
你可能已经考虑过的一点是：如果你使用视频数据，可能有一些帧容易检测，而一些帧更难（例如，光线反射）。使用同一节货车的多个帧可能有助于获得更好的结果，例如平均置信度或类似的指标。
Yolo有超参数来控制定位误差、类别误差或“目标性”误差的重要性。后者对你来说最重要。如果你不想一开始使用真正的分类器，这可能有助于更加关注这一点。
鉴于你检测到的不仅仅是耦合器（还有文本），我只想指出，你的链接中提到左右翻转作为数据增强进行。这在一般情况下是明智的做法，但对于文本检测来说可能是不好的做法（车厢上的字母）。
如果你使用外部知识，即耦合器总是与轨道的位置相关联，你的问题会变得更容易。通过将图像与轨道的位置对齐，你可以潜在地减小输入图像的相关区域，减少在不太可能的地方的误报数量。这还可以通过较小的图像提高推断和训练速度。
一般来说，分类器总是会预测引起最少“麻烦”的类别。如果你的数据集偏向“NO_GAP”图像，它将学会预测“NO_GAP”，因为这在大多数情况下都是正确的，而且不太冒险。因此，你应该为所有类别提供相同数量的图像。如果不可能，就必须从“GAP”文件夹中绘制更多的图像，以弥补这一不足。
由于这个问题上有赏金，我假设“钱不是问题”;-)，你有足够的资源提供更多的手动标签。对图像进行两类别分类非常迅速。我甚至建议开发者自己在一定程度上执行这项工作。了解自己的数据将教会你如何解决问题并提供进一步的思路。如果存在大量光照变化，将数据标准化对每个实例可能有所帮助。例如，可以尝试具有与数据集平均“值”通道相等的平均HSV“值”通道的数据。需要警告一下：我不知道你所指的确切架构，但以新的方式预处理数据将使不经修改的预训练网络性能下降。此外，标准化层可能已经提供了某种程度的通道调整。
鉴于你没有大量的训练数据，使用默认网络可能有些过度。例如，你正在使用RGB网络中的灰度图像... 这是低效的。如果你使用经过预训练的网络（在彩色图像上训练），这可能甚至有害。如果你没有预训练网络，自监督的预训练可能是一个好主意。
最后，增加你的用例的准确性的另一种方法是：由于货车的运动是可预测的，可以对间隙进行后处理：所有间隙大致具有相同大小的时间间隔（如果车厢具有标准长度）。这可以帮助处理误报和漏报。

我认为一个具有不到一百万参数的小型CNN网络，最后带有全连接层，应该足够用于二元分类器，数据量如此少。如果自己编写这样的网络，也更容易在未标记的数据上实现编码器-解码器网络的预训练（这是我自己从未做过的）。总之，这看起来是一个有趣的问题

英文:

maybe some of this will be of use:

Why are using a Localization network, such as yolo, to perform a task that is classification? Yolo produces a huge output vector, potentially detecting many objects of many classes anywhere on the image. This seems to be overkill?
Why use label of couplers in the background? This makes the network less confident and doesn't help you much (I guess). Removing those labels should make the problem easier to learn. Alternatively you can give both couplers different labels. This way the network is not confused by tiny, partial hidden couplers in the background and big ones in the fron. I guess you can re-label the data in an automatic fashion (if there is one label, its probably the front coupler, if there are two labels, the bigger one is the front coupler?).
Something you might already consider: If you use video data, there might be some frames allowing easy detection and some frames being more difficult (e.g. light reflections). Using multiple frames of the same wagon might help you to get better results e.g. an average confidence or the like.
Yolo has hyper-parameters to control the importance of localization error, class error or "objectivness"-error. The latter is the most important for you. It might help you to put more focus on this, if you don't want to go with a real classifier in the first place.
Seeing that you detect more stuff than couplers (namely text), I just want to point out, that your link says that left-right flipping is performed as augmentation. This is something sensible to do in general, but probably is something bad to do for text detection (letters on the wagon).
Your problem becomes easier, if you use the external knowledge that couples are always related to the position of the tracks. By aligning the images in relation to the tracks, you could potentially decrease the input image size to the relevant area, decreasing the number of false positives in unlikely places. This can also increase inference and training speed due to smaller images.
In general, a classifier will always predict the class that causes the least "trouble". If your dataset is skewed towards NO_GAP images, it will learn to predict NO_GAP as this will be true in most cases and is less risky. Therefore, you should always provide the same amount of images for all classes. If this is not possible, one has to draw more images from the "GAP" folder than from the "NO_GAP" image folder to make up for it.
As there is a bounty on this question, I assume "money doesn't matter" and you have resources to provide more manual labels. Classifying images in two classes is very quick. I even suggest that this is performed to some extent by the developers themselves. Knowing your own data will teach you very much on how to tackle the problem and give you further ideas.
If there is a lot of light variations, it might be helpful to standardize the data for each instance. E.g. one could experiment with data that has an average HSV "value" channel equal to the datasets average "value" channel. A word of warning: I don't know the exact architecture that you are referring to, but pre-processing the data in a new way will kill the performance on pre-trained / out of the box network without further ado. Also, normalization layers might already provide some sort of channel adjustments.
Given that you do not have excessive training data, a default network might be overkill. For example, you are using gray scale images in a RGB network... this is inefficient. In case you use a pre-trained network (trained on colours) this might be even harmful. In case you don't have a pre-trained network, self supervised pre-training is probably a good idea.
Last but not least here is another way of increasing accuracy for your use case: Since cargo trains have predictable motion, one could post process the gaps: All gaps would have a time-gap of roughly the same size (if wagons have a standard length). This can help to handle false positives and false negatives.

I assume a small CNN network with fully connected layers at the end with less than a million parameters should be enough for a binary classifier with such little data. If writing such network yourself it will also be easier to implement an decoder-encoder-network for pre-training the encoder (and later classifier) on your unlabelled data (which I have never done myself).

Well, all in all, this looks like a fun problem

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

改进深度学习模型以检测不同条件下的火车车厢间隙。

问题

答案1

Tensorflow变分自编码器，解码器是如何连接的？

Sklearn的BaggingClassifier无法与管道（预处理器，KNeighborsClassifier）一起使用。

YAML最佳实践用于机器学习模型配置和架构。

有没有方法在PyTorch中为张量生成分段函数？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。