Parse website sitemap.xml using Ansible XML.

huangapple go评论72阅读模式
英文:

Parse website sitemap.xml using Ansible XML

问题

以下是您要翻译的内容:

  1. Download the file using ansible.builtin.uri and register the result into a variable.
  2. Either loop the url nodes inside urlset and create a list with the loc and the priority, or
  3. Convert it to JSON and do the same.

I am stuck at points 2-3. This is my current code:

- name: Get the website 'sitemap.xml' file
  ansible.builtin.uri:
    url: "https://example.com/sitemap.xml"
    method: GET
    return_content: true
    headers:
      Accept: "application/xml"
    status_code: 200
    timeout: 5
  register: sitemap
  delegate_to: localhost

- name: Parse the retrieved XML file
  community.general.xml:
    xmlstring: "{{ sitemap.content }}"
    xpath: /s:urlset
    content: text
    namespaces:
      s: http://www.sitemaps.org/schemas/sitemap/0.9
  register: parsedxml
  delegate_to: localhost

Now parsedxml.xmlstring contains the XML of the file sitemap.xml, which is something I already had at the sitemap.content variable. So, basically, I haven't been able to either:

  1. Use community.general.xml to somehow build a list of dicts (with loc and priority) by looping the list of url nodes,
  2. Or convert the XML file to JSON using the ansible.netcommon.parse_xml filter, but I have not been able to produce a specifications file to be passed as parametre to the filter. And the documentation of such filter seems to be missing.

Any hints on how to loop through all the url nodes and build such list of dictionaries?

英文:

There is this website with a /sitemap.xml file such as follows:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <url>
    <loc>https://example.com/es/</loc>
    <lastmod>2023-02-15</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
  </url>
  <url>
    <loc>https://example/en/</loc>
    <lastmod>2023-02-15</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
  </url>
  <url>
    <loc>https://example.com/en/destinations/</loc>
    <lastmod>2021-09-16</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
  </url>
[..]
</urlset>

Using Ansible (latest version) I am trying to:

  1. Download the file using ansible.builtin.uri and register the result into a variable.
  2. Either loop the url nodes inside urlset and create a list with the loc and the priority, or
  3. Convert it to JSON and do the same.

I am stuck at points 2-3. This is my current code:

- name: Get the website 'sitemap.xml' file
  ansible.builtin.uri:
    url: "https://example.com/sitemap.xml"
    method: GET
    return_content: true
    headers:
      Accept: "application/xml"
    status_code: 200
    timeout: 5
  register: sitemap
  delegate_to: localhost

- name: Parse the retrieved XML file
  community.general.xml:
    xmlstring: "{{ sitemap.content }}"
    xpath: /s:urlset
    content: text
    namespaces:
      s: http://www.sitemaps.org/schemas/sitemap/0.9
  register: parsedxml
  delegate_to: localhost

Now parsedxml.xmlstring contains the XML of the file sitemap.xml, which is something I already had at the sitemap.content variable. So, basically, I haven't been able to either:

  1. Use community.general.xml to somehow build a list of dicts (with loc and priority) by looping the list of url nodes,
  2. Or convert the XML file to JSON using the ansible.netcommon.parse_xml filter, but I have not been able to produce a specifications file to be passed as parametre to the filter. And the documentation of such filter seems to be missing.

Any hints on how to loop through all the url nodes and build such list of dictionaries?

答案1

得分: 1

好的,以下是您要的翻译部分:

"So, yeah, ansible isn't great at dealing with xml, and that parsexml module you found is really designed for use by ansible net module authors which explains why it is so terrible to use"
"所以,是的,Ansible 不擅长处理 XML,你找到的 parsexml 模块实际上是为 Ansible 网络模块的作者设计的,这就解释了为什么它很难使用。"

"This is my approach:"
"这是我的方法:"

"which produces:"
"生成的结果如下:"

"as best I can tell, that .xml: module really is designed for more 'surgical' changes than a generic XPath query into a document, and definitely bad for 'give me multiple keys'. So, we just cheat and compose the XPath more than once, for each child key we wish, and then | zip the two result lists back together."
"就我所知,这个 .xml: 模块的设计更多地用于进行'精细'的更改,而不是通用的XPath查询,对于'给我多个键'来说确实不太适用。因此,我们只需多次组合XPath,对于我们想要的每个子键,然后将两个结果列表使用 | zip 组合在一起。"

英文:

So, yeah, ansible isn't great at dealing with xml, and that parsexml module you found is really designed for use by ansible net module authors which explains why it is so terrible to use

This is my approach:

  tasks:
    - vars:
        sitemap:
          content: |
            <?xml version="1.0" encoding="UTF-8"?>
            <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
              <url>
                <loc>https://example.com/es/</loc>
                <lastmod>2023-02-15</lastmod>
                <changefreq>monthly</changefreq>
                <priority>0.5</priority>
              </url>
              <url>
                <loc>https://example/en/</loc>
                <lastmod>2023-02-15</lastmod>
                <changefreq>monthly</changefreq>
                <priority>0.5</priority>
              </url>
            </urlset>            
      with_items: [ loc, priority ]
      register: parsedxml
      community.general.xml:
        xmlstring: '{{ sitemap.content }}'
        xpath: /s:urlset/s:url/s:{{ item }}
        content: text
        namespaces:
          s: "http://www.sitemaps.org/schemas/sitemap/0.9"
    - set_fact:
        things: '{{ tmp | from_yaml }}'
      vars:
        zipped_shape: |
          [
            [ {"loc": "http"}, {"pri": "1.0"} ],
            [ {"loc": "http"}, {"pri": "1.0"} ],
          ]          
        tmp: |
           {% set zipped = parsedxml.results[0].matches 
                     | zip(parsedxml.results[1].matches) %}
           {% for tup in zipped %}
           - url: {{ tup[0].values()|first }}
             pri: {{ tup[1].values()|first }}
           {% endfor %}           

which produces:

{
    "ansible_facts": {
        "things": [
            {
                "pri": 0.5,
                "url": "https://example.com/es/"
            },
            {
                "pri": 0.5,
                "url": "https://example/en/"
            }
        ]
} }

as best I can tell, that .xml: module really is designed for more "surgical" changes than a generic XPath query into a document, and definitely bad for "give me multiple keys". So, we just cheat and compose the XPath more than once, for each child key we wish, and then | zip the two result lists back together.

huangapple
  • 本文由 发表于 2023年3月4日 06:58:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/75632549.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定