telegraf: AWS IoT-Core to InfluxDB: i/o timeout

huangapple go评论94阅读模式
英文:

telegraf: AWS IoT-Core to InfluxDB: i/o timeout

问题

I understand that you need assistance with a technical issue related to configuring Telegraf and connecting to AWS IoT-Core. It appears that you are experiencing connectivity issues when using Telegraf to subscribe to an MQTT topic on AWS IoT-Core.

Based on the provided information, you've tried various configurations and versions of Telegraf, but the issue persists. To further troubleshoot and resolve this issue, here are some steps you can take:

  1. Check Security Policies in AWS IoT-Core:
    Ensure that the security policy associated with your AWS IoT Thing allows the necessary permissions for the MQTT action, especially if you've made changes to the security policy.

  2. Double-Check Certificate and Key Files:
    Confirm that the certificate (cert.pem), private key (key.pem), and root certificate (ca.pem) files are correctly placed in the container and have the appropriate permissions. Any issues with these files can lead to TLS handshake failures.

  3. Verify Network Connectivity:
    Ensure that your container has outbound network connectivity to the AWS IoT-Core endpoint on port 8883 or 443 (depending on your configuration). You mentioned that the container can ping the AWS server, which is a good sign. Double-check that there are no network-level restrictions or firewalls blocking outgoing traffic.

  4. Review AWS IoT-Core Endpoint:
    Verify that the AWS IoT-Core endpoint (<AWS-ID>.iot.<LOCATION>.amazonaws.com) is correctly configured in your code. Make sure there are no typos or issues with the endpoint URL.

  5. TLS Version and Cipher Suites:
    You mentioned that Telegraf sometimes uses TLS 1.2 or TLS 1.3. Ensure that the TLS version and cipher suites configured in Telegraf are compatible with AWS IoT-Core's requirements. You may need to experiment with different TLS configurations to find one that works.

  6. Debugging Telegraf:
    You can enable debugging in Telegraf by setting debug = true in your telegraf.conf file. This will provide more detailed logs that may help diagnose the issue. Monitor the logs for any specific error messages that could pinpoint the problem.

  7. Review AWS IoT Logs:
    Check the AWS IoT-Core logs to see if there are any error messages or indications of why the connection is failing. AWS CloudWatch Logs or other AWS IoT-Core logging mechanisms can be helpful for this.

  8. Packet Capture Analysis:
    Continue using packet capture tools like Wireshark to analyze the network traffic. Look for any anomalies or errors in the TLS handshake process. Pay attention to any TLS alerts or renegotiations.

  9. AWS Support:
    If the issue persists after trying these steps, consider reaching out to AWS Support for assistance. They can provide more detailed insights into the AWS IoT-Core side of the configuration.

  10. Community Forums:
    Consider posting your issue on relevant technical forums or communities where others with experience in AWS IoT-Core and Telegraf may be able to offer insights and solutions.

Remember to make one change at a time and carefully monitor the results to isolate the source of the problem. Troubleshooting network and TLS issues can be complex, so a systematic approach is essential.

英文:

I have an existing (working) setup that uses InfluxDB's Native Subscriptions to transfer data from the AWS IoT-Core (as the MQTT broker) to our InfluxDB-Cloud instance. Native Subscriptions are being removed as a feature in InfluxDB - and while I can still use it at the moment there's no guarantee in the long run. The setup used port 8883

Solution: use InfluxDB's telegraf to achieve the same functionality. The data is sent in 'influx' line protocol already, so no additional parsing is necessary. My setup uses the telegraf:1.27.1 docker image to deploy telegraf in a container. I'm volume-mapping the certificates and telegraf-configuration into the container.

Here's the telegraf.conf file:

  1. # Global settings
  2. [agent]
  3. interval = "5s"
  4. round_interval = true
  5. metric_batch_size = 1000
  6. metric_buffer_limit = 10000
  7. collection_jitter = "0s"
  8. flush_interval = "10s"
  9. flush_jitter = "0s"
  10. precision = "1ms"
  11. debug = true
  12. quiet = false
  13. [[inputs.mqtt_consumer]]
  14. servers = ["ssl://<AWS-ID>.iot.<LOCATION>.amazonaws.com:8883"]
  15. qos = 0
  16. connection_timeout = "30s"
  17. topics = ["374174.016/data"]
  18. client_id = "27-telegraf-e9cabc28-6a2b-4423-be74-31fb3dd231cd"
  19. data_format = "influx"
  20. # SSL configuration
  21. tls_ca = "/etc/telegraf/ca.pem"
  22. tls_cert = "/etc/telegraf/cert.pem"
  23. tls_key = "/etc/telegraf/key.pem"
  24. # insecure_skip_verify = false
  25. [[outputs.influxdb_v2]]
  26. urls = ["https://<LOCATION>.aws.cloud2.influxdata.com/"]
  27. token = "[REDACTED]"
  28. organization = "[REDACTED]"
  29. bucket = "TELEGRAF_DEV"

Here are the startup (and failure) logs for telegraf:

  1. 2023-07-06T12:54:11Z I! Loading config: /etc/telegraf/telegraf.conf
  2. 2023-07-06T12:54:11Z I! Starting Telegraf 1.27.1
  3. 2023-07-06T12:54:11Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores
  4. 2023-07-06T12:54:11Z I! Loaded inputs: mqtt_consumer
  5. 2023-07-06T12:54:11Z I! Loaded aggregators:
  6. 2023-07-06T12:54:11Z I! Loaded processors:
  7. 2023-07-06T12:54:11Z I! Loaded secretstores:
  8. 2023-07-06T12:54:11Z I! Loaded outputs: influxdb_v2
  9. 2023-07-06T12:54:11Z I! Tags enabled: host=d847c648021c
  10. 2023-07-06T12:54:11Z I! [agent] Config: Interval:5s, Quiet:false, Hostname:"d847c648021c", Flush Interval:10s
  11. 2023-07-06T12:54:11Z D! [agent] Initializing plugins
  12. 2023-07-06T12:54:11Z D! [agent] Connecting outputs
  13. 2023-07-06T12:54:11Z D! [agent] Attempting connection to [outputs.influxdb_v2]
  14. 2023-07-06T12:54:11Z D! [agent] Successfully connected to outputs.influxdb_v2
  15. 2023-07-06T12:54:11Z D! [agent] Starting service inputs
  16. 2023-07-06T12:55:11Z E! [telegraf] Error running agent: starting input inputs.mqtt_consumer: network Error : read tcp 172.17.0.2:51268->52.29.126.248:8443: i/o timeout

I've also written a Python-script to just read a couple of messages off the IoT-core from the same topic using the same certificate files (Python 3.9.6, awscrt==0.14.7, awsiot==0.1.3)

  1. import json
  2. import os.path
  3. import time
  4. import uuid
  5. from dataclasses import dataclass
  6. from typing import Callable, List, Any, Dict
  7. from awscrt import io
  8. from awscrt.mqtt import QoS
  9. from awsiot import mqtt_connection_builder # type: ignore
  10. class AWSMQTTClient:
  11. def __init__(
  12. self,
  13. host_name: str,
  14. client_id: str,
  15. cert_dir: str = "."
  16. ):
  17. # write to files (compatibility with AWS library)
  18. self.cert_file = f"{cert_dir}/cert.pem"
  19. self.priv_file = f"{cert_dir}/key.pem"
  20. self.root_file = f"{cert_dir}/ca.pem"
  21. # check if files exist
  22. if not os.path.exists(self.cert_file):
  23. raise FileNotFoundError(f"cert file not found: {self.cert_file}")
  24. if not os.path.exists(self.priv_file):
  25. raise FileNotFoundError(f"priv file not found: {self.priv_file}")
  26. if not os.path.exists(self.root_file):
  27. raise FileNotFoundError(f"root file not found: {self.root_file}")
  28. # set up connection object
  29. event_loop_group = io.EventLoopGroup(1)
  30. host_resolver = io.DefaultHostResolver(event_loop_group)
  31. client_bootstrap = io.ClientBootstrap(event_loop_group, host_resolver)
  32. self.mqtt_connection = mqtt_connection_builder.mtls_from_path(
  33. endpoint=host_name,
  34. cert_filepath=self.cert_file,
  35. pri_key_filepath=self.priv_file,
  36. client_bootstrap=client_bootstrap,
  37. ca_filepath=self.root_file,
  38. client_id=f"{client_id}_{uuid.uuid4()}",
  39. clean_session=False,
  40. keep_alive_secs=6,
  41. )
  42. def subscribe_to_topic(
  43. self,
  44. topic_name: str,
  45. duration: float,
  46. on_receive: Callable,
  47. ) -> None:
  48. _connect_future = self.mqtt_connection.connect()
  49. sub = self.mqtt_connection.subscribe(topic=topic_name, qos=QoS(1), callback=on_receive)
  50. time.sleep(duration)
  51. _connect_future.result()
  52. _disconnect_future = self.mqtt_connection.disconnect()
  53. _disconnect_future.result()
  54. if sub[0].running():
  55. raise ConnectionError("connection not closed")
  56. @dataclass
  57. class TopicInfo:
  58. topic: str
  59. message: str
  60. @dataclass
  61. class TopicResponse:
  62. topic_infos: List[TopicInfo]
  63. def add_topic(self, topic_info: TopicInfo):
  64. self.topic_infos.append(topic_info)
  65. def on_topic_received(
  66. self,
  67. topic: str,
  68. payload: bytes,
  69. dup: Any, qos: Any, retain: Any,
  70. **kwargs: Dict[Any, Any],
  71. ) -> None:
  72. try:
  73. message_json = json.loads(payload)
  74. if "message" in message_json.keys():
  75. message = message_json["message"]
  76. else:
  77. message = f"{message_json}"
  78. except json.decoder.JSONDecodeError:
  79. message = f"{payload.decode('ascii')}"
  80. topic_info = TopicInfo(
  81. topic=topic,
  82. message=message,
  83. )
  84. if topic_info not in self.topic_infos:
  85. self.add_topic(topic_info=topic_info)
  86. if __name__ == '__main__':
  87. # settings
  88. _host_name = "<AWS ID>.iot.<LOCATION>.amazonaws.com"
  89. _client_id = f"{uuid.uuid4()}"
  90. _topic = "374174.016/data"
  91. aws_mqtt_client = AWSMQTTClient(
  92. host_name=_host_name,
  93. client_id=_client_id,
  94. )
  95. response = TopicResponse(topic_infos=[])
  96. aws_mqtt_client.subscribe_to_topic(
  97. topic_name=_topic,
  98. on_receive=response.on_topic_received,
  99. duration=7.0,
  100. )
  101. print("")
  102. print("----RECEIVED:----")
  103. print(response.topic_infos)
  104. print("-----------------")
  105. print("")
  106. assert response.topic_infos.__len__() != 0

The Python script works as expected - it reads (and logs) a couple of messages. From what I can tell it uses Port 443 though instead of 8883.

Here's the Makefile I'm using to deploy both telegraf and the Python script listen_to_mqtt.py via docker:

  1. ###########################
  2. # Telegraf test setup #
  3. ###########################
  4. TELEGRAF_IMAGE=telegraf-test-image
  5. TELEGRAF_CONTAINER=telegraf-test-container
  6. TELEGRAF_LOCAL_DIR := $(shell pwd)
  7. TELEGRAF_INTERNAL_CONFIG_DIR = /etc/telegraf
  8. deploy_telegraf:
  9. docker run \
  10. --detach \
  11. --name $(TELEGRAF_CONTAINER) \
  12. -v $(TELEGRAF_LOCAL_DIR)/ca.pem:$(TELEGRAF_INTERNAL_CONFIG_DIR)/ca.pem \
  13. -v $(TELEGRAF_LOCAL_DIR)/cert.pem:$(TELEGRAF_INTERNAL_CONFIG_DIR)/cert.pem \
  14. -v $(TELEGRAF_LOCAL_DIR)/key.pem:$(TELEGRAF_INTERNAL_CONFIG_DIR)/key.pem \
  15. -v $(TELEGRAF_LOCAL_DIR)/telegraf.conf:$(TELEGRAF_INTERNAL_CONFIG_DIR)/telegraf.conf:ro \
  16. telegraf:1.27.1
  17. remove_telegraf:
  18. -docker rm -f $(TELEGRAF_CONTAINER)
  19. watch_logs_telegraf:
  20. docker logs -f $(TELEGRAF_CONTAINER)
  21. redeploy_telegraf:
  22. make remove_telegraf
  23. make deploy_telegraf
  24. make watch_logs_telegraf
  25. access_container:
  26. docker exec -it $(TELEGRAF_CONTAINER) bash
  27. DOCKER_NETWORK_INTERFACE = docker0 # replace this with your Docker network interface
  28. analyze_traffic:
  29. sudo tcpdump -i $(DOCKER_NETWORK_INTERFACE) -U -w ./tcpdump.pcap #'port 8883'
  30. analyze_traffic_file:
  31. wireshark ./tcpdump.pcap
  32. analyze_traffic_file_python:
  33. wireshark ./tcpdump-py.pcap
  34. perform_connection_check:
  35. docker exec -it $(TELEGRAF_CONTAINER) ping a1q5qvt4b4ldkh-ats.iot.eu-central-1.amazonaws.com
  36. ###########################
  37. # Python equivalent #
  38. ###########################
  39. TELEGRAF_PY_IMAGE=telegraf-test-image-py
  40. TELEGRAF_PY_CONTAINER=telegraf-test-container-py
  41. TELEGRAF_PY_DOCKERFILE=telegraf-py.Dockerfile
  42. build_python:
  43. docker build -t $(TELEGRAF_PY_IMAGE) -f $(TELEGRAF_PY_DOCKERFILE) .
  44. deploy_python:
  45. docker run \
  46. --detach --rm \
  47. --name $(TELEGRAF_PY_CONTAINER) \
  48. $(TELEGRAF_PY_IMAGE)
  49. watch_logs_python:
  50. docker logs -f $(TELEGRAF_PY_CONTAINER)
  51. remove_python:
  52. -docker rm -f $(TELEGRAF_PY_CONTAINER)
  53. redeploy_python:
  54. make remove_python
  55. make build_python
  56. make deploy_python
  57. make watch_logs_python

And here's the Dockerfile for the Python script:

  1. FROM python:3.9.6 as py3106
  2. ENV PYTHONDONTWRITEBYTECODE 1
  3. ENV PYTHONUNBUFFERED 1
  4. WORKDIR /app
  5. # copy requirement files
  6. COPY ./requirements.txt ./requirements.txt
  7. # install awscli
  8. RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
  9. RUN unzip awscliv2.zip
  10. RUN ./aws/install
  11. # install general dependencies
  12. RUN apt install gcc
  13. RUN python -m pip install --upgrade pip && pip install --no-cache-dir -r requirements.txt
  14. # copy script & certificates
  15. COPY ./listen_to_mqtt.py ./listen_to_mqtt.py
  16. COPY ./ca.pem ./ca.pem
  17. COPY ./cert.pem ./cert.pem
  18. COPY ./key.pem ./key.pem
  19. # run the script
  20. CMD python listen_to_mqtt.py

I used wireshark and tcpdump to analyze the network traffic from the container, but I'm admittedly a newbie doing that. When setting telegraf up with Port 8883 it fails within ~30s, and I don't see any key exchange and the [SYN] call doesn't receive a [SYN ACK]. When setting it up with Port 443 I see certificate exchanges and it takes around ~60s for the, essentially, same error message to appear: i/o timeout.

So things I tried are:

  • changing the telegraf version (for 1.18.3 the container doesn't crash, but it fails to receive messages, later versions crash after mentioned delay)
  • changing the port (443, 8883, 8443)
  • ping the AWS server from inside the telegraf-container before it crashes (works!)
  • settings insecure_skip_verify = false or insecure_skip_verify = true or leaving it out entirely
  • played around with the AWS IoT-Core's security policy (i.e. TLS13_1_3_2022_10, TLS12_1_0_2015_01, etc.) -- I see via wireshark that telegraf then uses TLS1.3 or TLS1.2, but apart from that no mayor change in behavior occurs

I'm expecting this to be about some tiny setting or configuration in either telegraf or AWS that I just don't know about. I'm not sure how I can proceed after spending close to a day on this already.

答案1

得分: 0

解决方案实际上相当简单,但由于Python脚本而有点令人困惑:出站端口在我们的网络中被阻止了。Python脚本(如前所述)使用了端口443,因此它能够通过 - 但对于Telegraf来说,端口443行不通 - 而且端口8883被阻止导致超时错误。

英文:

turns out the solution was quite simple, but a bit confusing due to the Python script: the outgoing Port was blocked in our network. The Python script (as mentioned) used Port 443 so it got through - Port 443 wouldn't work for telegraf though - and Port 8883 was blocked leading to the timeout error.

huangapple
  • 本文由 发表于 2023年7月6日 21:14:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/76629239.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定