英文:
Daily load of weekly records in elastic through logstash ignoring repeated records
问题
我完全是elastic和logstash的新手,我们目前正在评估它,希望将其用于一些数据需求,我必须说到目前为止进展得相当顺利。
现在我有一个疑问,我在官方文档中找不到明确的答案。
我们有一个URL,在那里我们根据需要创建一个包含过去7天环境数据的JSON。这些数据每小时更新一次,因此如果我在星期一早上6点检查它,它将包含从星期一早上5点到上周二早上6点的数据,如果我在星期二下午2点检查它,它将包含从星期二下午1点到上周三下午2点的数据。
我想定期将这些数据加载到elastic上,比如说每天一次,但我不想重复加载已经加载的数据。我的意思是,第一次我的logstash管道执行时,它将需要传输过去七天的数据,但如果它隔一天再次触发,它只应该传输最后一天的记录,忽略已经加载到索引中的前六天的记录。
我已经成功将记录从URL加载到elastic索引中,但我不知道如何配置管道以避免这些重复的数据。
如果有人能指点我正确的方向,我将非常感激。
英文:
I'm totally new to elastic and logstash, we are currently evaluating it, expecting to use it for some of our data needings, and I have to say that it's going quite well up to date.
Now I have a doubt and I cannot find a clear answer on the official documentation.
We have an url where we create on demand a json with environmental data for the last 7 days. Those data are updated hourly, so if I check it on monday at 6am it will contain data from monday at 5am to last tuesday at 6am, and if I check it on tuesday at 14pm it will contain data from tuesday at 13pm to last wednesday at 14pm.
I want to load those data on elastic periodically, let's say on a daily basis, but I don't want to repeat values that I've already loaded. I mean, the first time my logstash pipeline executes it will have to transfer data for the last seven days, but if it fires again a day later, it only should transfer records from the last day, ignoring the records of the 6 previous days that already are loaded on the index.
I've been able to load records from the url into elastic index, but I cannot see how to configure the pipeline to avoid those repeated data.
If anyone could point me in the right direction I will appreciate so much.
答案1
得分: 1
最好的方式是在提供该URL的一侧处理此问题,您可以在那里提供一个日期间隔以检索数据,而不是每次都不必要地检索无用的数据。
但如果这不是一个选择,您可以通过Logstash实现您想要的不同方式。
另外,您需要与jdbc
输入插件提供的完全相同的功能,即记录本地文件中某个字段的最后值,不幸的是,http_poller
输入插件不支持这个功能...但...因为关于这个需求的问题在2016年就已经发布了...!!
一种方法是使用environment
过滤器来加载包含上次加载日期的环境变量。在每次运行后,您需要确保环境变量已更新,以便在下一次调用时可以获取其值。
另一种方法是使用memcached
过滤器来存储该值,但我认为为了一个单一的值而安装和维护memcached有点牵强。
英文:
The best way would be to handle this on the side that serves that URL where you could give a date interval to retrieve instead of needlessly retrieving useless data every time.
But if this is not an option, you have different ways of achieving what you want in Logstash.
Also, you need the exact same thing as provided by the jdbc
input plugin, i.e. something that records the last value of a certain field in a local file, unfortunately the http_poller
input doesn't support that... yet... because an issue about this very need has been posted in... 2016!!
One way is to use the environment
filter in order to load an environment variable that would contain the date of the last load for instance. After each run you need to make sure that the environment variable is updated so that its value can be picked up during the next call.
Another way is to use the memcached
filter to store that value, but I think it's a bit far fetched to install and maintain memcached just for a single value.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论