获取所有带有Jsoup的img src。

huangapple go评论61阅读模式
英文:

Get all img src with Jsoup

问题

I've translated the code-related portions as you requested:

我有以下HTML代码中的`img src`部分:
我想获取所有具有属性`alt = Screenshot Image`的屏幕截图因此我需要获取属性`srcset``data-srcset`中的值2个不同的属性名称= 2个不同的情况)。

我编写了这段代码

```java
List<String> src = htmlDocument.select("img[src]").stream()
                .filter(img -> img.attr("alt").equals("Screenshot Image"))
                .map(element -> element.absUrl("data-srcset").replace("2x", ""))
                //或者对于第一种情况
                .map(element -> element.absUrl("srcset")..
                //
                .collect(Collectors.toList());

但是现在我无法从第一种情况中获取这个值,其中这个属性是srcset,而不是data-srcset。我是否可以在不进行额外迭代的情况下获取这两种情况的src - 例如,不创建另一个流,然后将所有结果合并到一个集合中?也许Jsoup库中的一些正则表达式和其他方法(似乎.absUrl不适用于正则表达式)可以帮助?

我不喜欢replace部分(也许某些src将包含2x作为自己的一部分)。

.map(element -> element.absUrl("data-srcset").replace("2x", ""))

但是如果没有这个操作,我将获得不正确的src。

https://lh3.googleusercontent.com/Z...=w1440-h620-rw 2x

是否可以改进这个replace解决方案以使用其他方法?


如果你有关于代码的进一步问题,请随时提出。

<details>
<summary>英文:</summary>

I&#39;ve html code with following `img src` parts:

<img src="https://lh3.googleusercontent.com/...rw" srcset="https://lh3.googleusercontent.com/...rw 2x" class="T75of DYfLw" width="551" height="310" alt="Screenshot Image"">

<img data-src="https://lh3.googleusercontent.com/...w720-h310-rw" ... data-srcset="https://lh3.googleusercontent.com/... w1440-h620-rw 2x" src="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="width="551" height="310" alt="Screenshot Image">


I want to get all screenshots with attribute `alt=Screenshot Image`. So I need the value inside attribute `srcset` and `data-srcset` (2 different attribute names = 2 different cases).

I wrote this code:

List<String> src = htmlDocument.select("img[src]").stream()
.filter(img -> img.attr("alt").equals("Screenshot Image"))
.map(element -> element.absUrl("data-srcset").replace("2x", ""))
//or for 1st case
.map(element -> element.absUrl("srcset")..
//
.collect(Collectors.toList());

But now I can&#39;t get this value from first case, where this attribute is `srcset`, not `data-srcset`. Can I get srcs for both scenarios without additional iteration - like not to create another stream and then unite all results into one collection? Maybe some regex and another method (seems like `.absUrl` doesn&#39;t work with regex) in Jsoup library can help?

And I don&#39;t like the part with `replace` (maybe some src will contain 2x as own part). 

.map(element -> element.absUrl("data-srcset").replace("2x", ""))

But without this manipulation I&#39;ll get non-correct src.

https://lh3.googleusercontent.com/Z...=w1440-h620-rw 2x

Can I improve this `replace` solution with smth else?

</details>


# 答案1
**得分**: 1

你可以尝试创建一个集合的集合,然后使用`flatMap`:

```java
List<String> src = htmlDocument.select("img[src]").stream()
        .filter(img -> img.attr("alt").equals("Screenshot Image"))
        .map(element -> {
            List<String> url = new ArrayList<>();
            url.add(element.absUrl("data-srcset").replace("2x", ""));
            url.add(element.absUrl("srcset"));
            return url;
        })
        .flatMap(List::stream)
        .collect(Collectors.toList());

对于你上一次的回答,假设你的URL不包含空格,你可以使用:

StringUtils.substringBefore(element.absUrl("data-srcset"), " ")

编辑:

我假设你的图像中可能同时包含srcsetdata-srcset。再次阅读后,我得到了一个更好的方法:

List<String> src = htmlDocument.select("img[src]").stream()
        .filter(img -> img.attr("alt").equals("Screenshot Image"))
        .map(element -> StringUtils.isNotEmpty(element.absUrl("srcset")) ? 
           element.absUrl("srcset") : 
           element.absUrl("data-srcset").replace("2x", ""))
        .collect(Collectors.toList());
英文:

You could try to create a collection of collections and then flatMap:

List&lt;String&gt; src = htmlDocument.select(&quot;img[src]&quot;).stream()
            .filter(img -&gt; img.attr(&quot;alt&quot;).equals(&quot;Screenshot Image&quot;))
            .map(element -&gt; {
            	List&lt;String&gt; url = new ArrayList&lt;&gt;();
            	url.add( element.absUrl(&quot;data-srcset&quot;).replace(&quot;2x&quot;, &quot;&quot;));
            	url.add( element.absUrl(&quot;srcset&quot;));
            	return url;
            })
            .flatMap(List::stream)
            .collect(Collectors.toList());

For your last answer, assuming your URLs don't contain white spaces you could use:

StringUtils.substringBefore(element.absUrl(&quot;data-srcset&quot;),&quot; &quot;)

EDIT:

I assumed you could have both srcset and data-srcset in the same image. Reading again I end up with a better approach:

    List&lt;String&gt; src = htmlDocument.select(&quot;img[src]&quot;).stream()
                .filter(img -&gt; img.attr(&quot;alt&quot;).equals(&quot;Screenshot Image&quot;))
                .map(element -&gt; StringUtils.isNotEmpty(element.absUrl(&quot;srcset&quot;)) ? 
                   element.absUrl(&quot;srcset&quot;) : 
                   element.absUrl(&quot;data-srcset&quot;).replace(&quot;2x&quot;, &quot;&quot;))
                .collect(Collectors.toList());

huangapple
  • 本文由 发表于 2020年8月13日 23:24:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/63398290.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定