How to get the file name part from HTML src attribute of <script> tag using Regex pattern in Java

huangapple go评论71阅读模式
英文:

How to get the file name part from HTML src attribute of <script> tag using Regex pattern in Java

问题

        String javaScript = "<script src=\"https://www.xxx.co.uk/rta2/v-0.52.min.js\" class=\"RTA2-loader\" data-hosts=\"ted.xxx.co.uk\"></script>";

        Pattern scriptPattern = Pattern.compile("<script[^>]+src\\s*=\\s*[\"'](.*?)[\"'][^>]*>");

        Matcher script = scriptPattern.matcher(javaScript);
        if (script.find()) {
            String srcValue = script.group(1);
            String[] pathSegments = srcValue.split("[\\\\/]");
            String fileName = pathSegments[pathSegments.length - 1];
            System.out.println(fileName);
        }

Output:

v-0.52.min.js
英文:

I need to get the file name from the src attribute of HTML 'script' tag. I managed to get the value for entire src attribute but not sure how to get only file name including extension. Below is the code with example.

        String javaScript = &quot;&lt;script src=\&quot;https://www.xxx.co.uk/rta2/v-0.52.min.js\&quot; class=\&quot;RTA2-loader\&quot; data-hosts=\&quot;ted.xxx.co.uk\&quot;&gt;&lt;/script&gt;&quot;;

        Pattern scriptPattern = Pattern.compile(&quot;&lt;script[^&gt;]+src\\s*=\\s*[\&quot;&#39;](.*?)[\&quot;&#39;][^&gt;]*&gt;&quot;);

        Matcher script = scriptPattern.matcher(javaScript);
        if (script.find()) {
            System.out.println(script.group(1));
        }

The above one prints https://www.xxx.co.uk/rta2/v-0.52.min.js

Instead of entire URL I want the file name i.e.

v-0.52.min.js

Also it should support '/' and '\' path separator.

Please help.

答案1

得分: 0

String javaScript = "<script src=\"https://www.xxx.co.uk/rta2/v-0.52.min.js\" class=\"RTA2-loader\" data-hosts=\"ted.xxx.co.uk\"></script>";
Pattern pattern = Pattern.compile("<script src=\"[^\"]+(?:/|\\\\)([^\"]+)\"");
Matcher matcher = pattern.matcher(javaScript);
if (matcher.find()) {
    String src = matcher.group(1);
    System.out.println(src);
}

The regular expression searches for the literal string <script src=
followed by a single double quote character, i.e. "
followed by one or more characters that are not the double quote character
followed by either a single forward slash, i.e. /, or a single backslash, i.e. \
again followed by one or more characters that are not the double quote character (and these characters are placed in a capturing group)
and finally followed by another double quote character.

The above code displays the following:

v-0.52.min.js

Nonetheless, I wish to point out that using an HTML parser is preferred over regular expressions when it comes to parsing HTML.

英文:
String javaScript = &quot;&lt;script src=\&quot;https://www.xxx.co.uk/rta2/v-0.52.min.js\&quot; class=\&quot;RTA2-loader\&quot; data-hosts=\&quot;ted.xxx.co.uk\&quot;&gt;&lt;/script&gt;&quot;;
Pattern pattern = Pattern.compile(&quot;&lt;script src=\&quot;[^\&quot;]+(?:/|\\\\)([^\&quot;]+)\&quot;&quot;);
Matcher matcher = pattern.matcher(javaScript);
if (matcher.find()) {
    String src = matcher.group(1);
    System.out.println(src);
}

The regular expression searches for the literal string &lt;script src=
followed by a single double quote character, i.e. &quot;
followed by one or more characters that are not the double quote character
followed by either a single forward slash, i.e. /, or a single backslash, i.e. \
again followed by one or more characters that are not the double quote character (and these characters are placed in a capturing group)
and finally followed by another double quote character.

The above code displays the following:

v-0.52.min.js

Nonetheless, I wish to point out that using a HTML parser is preferred over regular expressions when it comes to parsing HTML.

huangapple
  • 本文由 发表于 2020年7月26日 22:53:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/63101768.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定