英文:
Using HtmlUnit to load dynamic web apps
问题
请原谅代码很长,但我一直在尝试创建一个可以加载静态和动态页面的网页加载器类。下面的代码在大多数静态网页上都可以正常工作,但无法加载SPA Web应用程序(并且CSS选择器总是失败),比如运行在Vue、React、GWT等上的应用程序,还有需要身份验证的网站(因此有HttpHeaderSpec
数据结构)。
这段代码可能有什么问题:
public class WebPageLoader {
private static final int WEB_CLIENT_POOL_SIZE = 3;
private static final int JAVASCRIPT_WAIT_TIME = 10000;
private static final int MAX_ATTEMPTS = 3;
private final String targetUrl;
private final String cssSelector;
private final int maxAttempts;
private final HttpHeadersSpec httpHeadersSpec;
private Consumer<String> whenMaxAttemptsFailed;
private static final BlockingQueue<WebClient> webClients = new LinkedBlockingQueue<>();
static {
// 静态地创建和初始化3个Web客户端
for (int i = 0; i < WEB_CLIENT_POOL_SIZE; i++) {
webClients.add(createWebClient());
}
}
public WebPageLoader(String targetUrl) {
this(targetUrl, null, MAX_ATTEMPTS, null); // 默认的最大尝试次数为5
}
public WebPageLoader(String targetUrl, String cssSelector) {
this(targetUrl, cssSelector, MAX_ATTEMPTS, null); // 默认的最大尝试次数为5
}
public WebPageLoader(String targetUrl, String cssSelector, HttpHeadersSpec httpHeadersSpec) {
this(targetUrl, cssSelector, MAX_ATTEMPTS, httpHeadersSpec); // 默认的最大尝试次数为5
}
public WebPageLoader(String targetUrl, String cssSelector, int maxAttempts, HttpHeadersSpec httpHeadersSpec) {
this.targetUrl = Objects.requireNonNull(targetUrl, "目标URL不能为空");
this.cssSelector = cssSelector;
this.httpHeadersSpec = httpHeadersSpec != null ? httpHeadersSpec : new HttpHeadersSpec();
this.maxAttempts = MAX_ATTEMPTS;
}
public void whenMaxAttemptsFailed(Consumer<String> whenMaxAttemptsFailed) {
this.whenMaxAttemptsFailed = whenMaxAttemptsFailed;
}
public Webpage loadWebPageAfterCSSSelectorIsReady() throws InterruptedException, IOException {
WebClient webClient = getAvailableWebClient();
try {
HtmlPage page = prepareAndLoadPage(webClient);
if (cssSelector != null) waitForSelector(page);
return createWebpage(page);
} finally {
// 释放Web客户端以供重用
releaseWebClient(webClient);
}
}
// 其余代码...
}
希望这能帮助你找出问题所在。如果你需要进一步的帮助或有其他问题,请随时提出。
英文:
Pardon the long code, but I have been trying to make a web page loader class that can load both static and dynamic pages. The code below works fine with mostly static web pages but it won't load SPA web apps (and the CSS selector always fail), like apps than runs on Vue, React, GWT, etc. also websites that are gated by authentication (hence the HttpHeaderSpec
data structure)
What could be wrong in this code:
public class WebPageLoader {
private static final int WEB_CLIENT_POOL_SIZE = 3;
private static final int JAVASCRIPT_WAIT_TIME = 10000;
private static final int MAX_ATTEMPTS = 3;
private final String targetUrl;
private final String cssSelector;
private final int maxAttempts;
private final HttpHeadersSpec httpHeadersSpec;
private Consumer<String> whenMaxAttemptsFailed;
private static final BlockingQueue<WebClient> webClients = new LinkedBlockingQueue<>();
static {
// Create and initialize 3 web clients statically
for (int i = 0; i < WEB_CLIENT_POOL_SIZE; i++) {
webClients.add(createWebClient());
}
}
public WebPageLoader(String targetUrl) {
this(targetUrl, null, MAX_ATTEMPTS, null); // Default max attempts is 5
}
public WebPageLoader(String targetUrl, String cssSelector) {
this(targetUrl, cssSelector, MAX_ATTEMPTS, null); // Default max attempts is 5
}
public WebPageLoader(String targetUrl, String cssSelector, HttpHeadersSpec httpHeadersSpec) {
this(targetUrl, cssSelector, MAX_ATTEMPTS, httpHeadersSpec); // Default max attempts is 5
}
public WebPageLoader(String targetUrl, String cssSelector, int maxAttempts, HttpHeadersSpec httpHeadersSpec) {
this.targetUrl = Objects.requireNonNull(targetUrl, "Target URL cannot be null");
this.cssSelector = cssSelector;
this.httpHeadersSpec = httpHeadersSpec != null ? httpHeadersSpec : new HttpHeadersSpec();
this.maxAttempts = MAX_ATTEMPTS;
}
public void whenMaxAttemptsFailed(Consumer<String> whenMaxAttemptsFailed) {
this.whenMaxAttemptsFailed = whenMaxAttemptsFailed;
}
public Webpage loadWebPageAfterCSSSelectorIsReady() throws InterruptedException, IOException {
WebClient webClient = getAvailableWebClient();
try {
HtmlPage page = prepareAndLoadPage(webClient);
if (cssSelector != null) waitForSelector(page);
return createWebpage(page);
} finally {
// Release the web client for reuse
releaseWebClient(webClient);
}
}
protected HtmlPage prepareAndLoadPage(WebClient webClient) throws IOException {
long startTime = System.currentTimeMillis();
String domain = LinkProcessor.getDomainName(targetUrl);
if(httpHeadersSpec != null) {
applyHttpHeaders(webClient, domain, httpHeadersSpec);
}
HtmlPage page = webClient.getPage(targetUrl);
if (httpHeadersSpec != null && httpHeadersSpec.getLocal_storage() != null) {
for (StorageItemSpec item : httpHeadersSpec.getLocal_storage()) {
page.executeJavaScript(
"localStorage.setItem('" + item.getKey() + "', '" + item.getValue() + "');");
}
}
// Refresh the page after loading keys to the local storage to hopefully be picked up
page.refresh();
webClient.waitForBackgroundJavaScriptStartingBefore(JAVASCRIPT_WAIT_TIME);
long endTime = System.currentTimeMillis();
long executionTime = endTime - startTime;
System.out.println("Page execution time: " + executionTime + " ms");
return page;
}
protected void waitForSelector(HtmlPage page) throws InterruptedException {
if (cssSelector != null) {
int attempts = 0;
while (attempts < maxAttempts) {
DomNodeList<DomNode> elements = page.querySelectorAll(cssSelector);
if (elements.size() > 0) {
break;
}
synchronized (page) {
page.wait(JAVASCRIPT_WAIT_TIME); // wait fox x seconds before trying again
}
attempts++;
}
if (attempts == maxAttempts) {
if (whenMaxAttemptsFailed != null) {
whenMaxAttemptsFailed.accept(cssSelector);
}
}
}
}
protected String getCssSelector() {
return this.cssSelector;
}
private Webpage createWebpage(HtmlPage page) {
int statusCode = page.getWebResponse().getStatusCode();
Webpage webPage = new Webpage(page.asXml(), statusCode);
webPage.setUrl(targetUrl);
return webPage;
}
// Get an available web client from the pool or create a new one if none is available
protected WebClient getAvailableWebClient() throws InterruptedException {
WebClient webClient;
synchronized (webClients) {
while (webClients.isEmpty()) {
try {
webClients.wait();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
webClient = webClients.take();
}
return webClient;
}
// Release the web client back to the pool for reuse
protected void releaseWebClient(WebClient webClient) {
synchronized (webClients) {
webClients.add(webClient);
webClients.notifyAll();
}
}
private static void applyHttpHeaders(WebClient webClient, String domain, HttpHeadersSpec httpHeadersSpec) {
// Apply cookies
if(httpHeadersSpec.getCookies() != null) {
httpHeadersSpec.getCookies().forEach(cookieSpec ->
webClient.getCookieManager().addCookie(new Cookie(domain, cookieSpec.getKey(), cookieSpec.getValue())));
}
}
// Create a WebClient with the desired settings
private static WebClient createWebClient() {
WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED);
webClient.getOptions().setUseInsecureSSL(false);
webClient.getOptions().setAppletEnabled(false);
webClient.getOptions().setDownloadImages(false);
webClient.getOptions().setPopupBlockerEnabled(true);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setPrintContentOnFailingStatusCode(false);
webClient.setJavaScriptErrorListener(new SilentJavaScriptErrorListener());
webClient.getOptions().setCssEnabled(true);
webClient.setCssErrorHandler(new SilentCssErrorHandler());
webClient.getCookieManager().setCookiesEnabled(true);
return webClient;
}
}
答案1
得分: 1
还存在一些问题,主要是因为一些缺少的DOM功能(例如ShadowDOM)或一些与JavaScript支持相关的问题。这两个领域不断改进,但我们必须具体问题具体分析。如果您能够提出一个缺少或运行不正常的函数的问题(并可能提供一个小的测试案例),则有很大机会修复/改进这个问题。
除了这些一般性的问题之外,您可能会考虑查看Wetator(www.wetator.org)项目的HtmlUnitBrowser类。这个类具有许多通用的实现,您可能会发现有用的思路,以解决一些常见问题。
例如,可以查看waitForImmediateJobs()或getCurrentPage()。
英文:
There are still some problems with these applications manly because of some missing DOM features (e.g. ShadowDOM) or some problems with the javascript support.
Both areas are constantly improving but we have to look at this from case to case. If you are able to open an issue pointing to a missing or misbehaving function (and maybe provide a small test case) there is a good chance to get this fixed/improved.
Outside of this general problems it might be a good idea to have a look at the HtmlUnitBrowser class from the Wetator (www.wetator.org) project. This has many generic implementations you might find useful to get some ideas how you can handle common problems.
E.g have a look at waitForImmediateJobs() or getCurrentPage().
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论