英文:
Get all img src with Jsoup
问题
I've translated the code-related portions as you requested:
我有以下HTML代码中的`img src`部分:
我想获取所有具有属性`alt = Screenshot Image`的屏幕截图。因此,我需要获取属性`srcset`和`data-srcset`中的值(2个不同的属性名称= 2个不同的情况)。
我编写了这段代码:
```java
List<String> src = htmlDocument.select("img[src]").stream()
.filter(img -> img.attr("alt").equals("Screenshot Image"))
.map(element -> element.absUrl("data-srcset").replace("2x", ""))
//或者对于第一种情况
.map(element -> element.absUrl("srcset")..
//
.collect(Collectors.toList());
但是现在我无法从第一种情况中获取这个值,其中这个属性是srcset
,而不是data-srcset
。我是否可以在不进行额外迭代的情况下获取这两种情况的src - 例如,不创建另一个流,然后将所有结果合并到一个集合中?也许Jsoup库中的一些正则表达式和其他方法(似乎.absUrl
不适用于正则表达式)可以帮助?
我不喜欢replace
部分(也许某些src将包含2x作为自己的一部分)。
.map(element -> element.absUrl("data-srcset").replace("2x", ""))
但是如果没有这个操作,我将获得不正确的src。
https://lh3.googleusercontent.com/Z...=w1440-h620-rw 2x
是否可以改进这个replace
解决方案以使用其他方法?
如果你有关于代码的进一步问题,请随时提出。
<details>
<summary>英文:</summary>
I've html code with following `img src` parts:
<img src="https://lh3.googleusercontent.com/...rw" srcset="https://lh3.googleusercontent.com/...rw 2x" class="T75of DYfLw" width="551" height="310" alt="Screenshot Image"">
<img data-src="https://lh3.googleusercontent.com/...w720-h310-rw" ... data-srcset="https://lh3.googleusercontent.com/... w1440-h620-rw 2x" src="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="width="551" height="310" alt="Screenshot Image">
I want to get all screenshots with attribute `alt=Screenshot Image`. So I need the value inside attribute `srcset` and `data-srcset` (2 different attribute names = 2 different cases).
I wrote this code:
List<String> src = htmlDocument.select("img[src]").stream()
.filter(img -> img.attr("alt").equals("Screenshot Image"))
.map(element -> element.absUrl("data-srcset").replace("2x", ""))
//or for 1st case
.map(element -> element.absUrl("srcset")..
//
.collect(Collectors.toList());
But now I can't get this value from first case, where this attribute is `srcset`, not `data-srcset`. Can I get srcs for both scenarios without additional iteration - like not to create another stream and then unite all results into one collection? Maybe some regex and another method (seems like `.absUrl` doesn't work with regex) in Jsoup library can help?
And I don't like the part with `replace` (maybe some src will contain 2x as own part).
.map(element -> element.absUrl("data-srcset").replace("2x", ""))
But without this manipulation I'll get non-correct src.
https://lh3.googleusercontent.com/Z...=w1440-h620-rw 2x
Can I improve this `replace` solution with smth else?
</details>
# 答案1
**得分**: 1
你可以尝试创建一个集合的集合,然后使用`flatMap`:
```java
List<String> src = htmlDocument.select("img[src]").stream()
.filter(img -> img.attr("alt").equals("Screenshot Image"))
.map(element -> {
List<String> url = new ArrayList<>();
url.add(element.absUrl("data-srcset").replace("2x", ""));
url.add(element.absUrl("srcset"));
return url;
})
.flatMap(List::stream)
.collect(Collectors.toList());
对于你上一次的回答,假设你的URL不包含空格,你可以使用:
StringUtils.substringBefore(element.absUrl("data-srcset"), " ")
编辑:
我假设你的图像中可能同时包含srcset
和data-srcset
。再次阅读后,我得到了一个更好的方法:
List<String> src = htmlDocument.select("img[src]").stream()
.filter(img -> img.attr("alt").equals("Screenshot Image"))
.map(element -> StringUtils.isNotEmpty(element.absUrl("srcset")) ?
element.absUrl("srcset") :
element.absUrl("data-srcset").replace("2x", ""))
.collect(Collectors.toList());
英文:
You could try to create a collection of collections and then flatMap
:
List<String> src = htmlDocument.select("img[src]").stream()
.filter(img -> img.attr("alt").equals("Screenshot Image"))
.map(element -> {
List<String> url = new ArrayList<>();
url.add( element.absUrl("data-srcset").replace("2x", ""));
url.add( element.absUrl("srcset"));
return url;
})
.flatMap(List::stream)
.collect(Collectors.toList());
For your last answer, assuming your URLs don't contain white spaces you could use:
StringUtils.substringBefore(element.absUrl("data-srcset")," ")
EDIT:
I assumed you could have both srcset
and data-srcset
in the same image. Reading again I end up with a better approach:
List<String> src = htmlDocument.select("img[src]").stream()
.filter(img -> img.attr("alt").equals("Screenshot Image"))
.map(element -> StringUtils.isNotEmpty(element.absUrl("srcset")) ?
element.absUrl("srcset") :
element.absUrl("data-srcset").replace("2x", ""))
.collect(Collectors.toList());
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论