如何在使用JSOUP进行数据抓取时防止超时而导致请求失败?

huangapple go评论73阅读模式
英文:

How to prevent dead timed out while scraping data using JSOUP java?

问题

我学习如何使用jsoup java从网页上爬取数据,在第一次尝试中我成功获取了输出,但当我再次尝试运行时,它给出了一个错误消息。以下是我的代码:

package solution;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class WebScraper {

    public static void main(String[] args) throws IOException {

        Document d = Jsoup.connect("https://www.wikihow.com/wikiHowTo?search=adjust+bass+on+computerr")
                          .timeout(6000)
                          .get();
        Elements ele = d.select("div#searchresults_list");
        for (Element element : ele.select("div.result")) {
            String img_url = element.select("div.result_title").text();
            System.out.println(img_url);
        }

    }
}

以下是我收到的错误消息:

Exception in thread "main" java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    ...
    at solution.WebScraper.main(WebScraper.java:14)

Process finished with exit code 1

有人能帮忙吗?

附言编辑:

解决了这个问题后,有几种解决方法,例如:

  1. 在timeout参数中设置更大的值,例如将时间设置为8000(之前是6000)。

  2. 确保您的互联网连接稳定。

感谢每个为这个问题提供建议的人。

英文:

I learn how to scraping data from a web using jsoup java, in the first try i'm successfully to get the output, but when I try to run again, it gives an error message. Here is my code

package solution;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class WebScraper {

    public static void main(String[] args) throws IOException {

        Document d=Jsoup.connect("https://www.wikihow.com/wikiHowTo?search=adjust+bass+on+computerr").timeout(6000).get();
        Elements ele=d.select("div#searchresults_list");
        for (Element element : ele.select("div.result")) {
            String img_url=element.select("div.result_title").text();
            System.out.println(img_url);
        }

    }
}

Here are the message error that I got

Exception in thread "main" java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:466)
	at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:460)
	at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:159)
	at sun.security.ssl.SSLTransport.decode(SSLTransport.java:110)
	at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1198)
	at sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1107)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:400)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:372)
	at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:587)
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:167)
	at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:732)
	at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:707)
	at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:297)
	at org.jsoup.helper.HttpConnection.get(HttpConnection.java:286)
	at solution.WebScraper.main(WebScraper.java:14)

Process finished with exit code 1

Anyone can help ?

P.S edit:

After solved this issue, there are several solutions approach to this problem such as:

  1. give a higher value in timeout parameter, e.g the time set to 8000 (before 6000)

  2. make sure your internet connection is stable

thanks for everyone who has give advices for this problem

答案1

得分: 3

可能您的互联网连接速度很慢。
检查您的互联网连接。

或者在浏览器中尝试打开该网址。检查加载HTML页面所需的时间。

另外,添加一个try-catch代码块。

英文:

Possibly your internet connection speed is very low.
Check your Internet connection.

Or try the url on the browser. Check how much time it takes to load the html.

Also, add a try-catch block.

答案2

得分: 2

一些观察:

  1. 堆栈跟踪显示超时发生在客户端仍在进行 SSL 设置过程中。在这个过程中可能会出现一些问题。

  2. timeout(6000) 将超时设置为6秒。这个时间相当短……如果网络路径拥塞,服务器距离较远,服务器负载较重等等。

  3. 你说开始时可以运行,后来就不行了。这可能是负载或拥塞问题。或者服务器可能从你的客户端看到了重复的调用,请求相同的URL,并将其解释为DOS攻击或配置不正确的应用程序……然后在你的IP地址上设置了封锁。

英文:

Some observations:

  1. The stacktrace shows that the timeout occurred while the client is still going through the SSL setup. There are a few things that can go wrong in that process.

  2. The timeout(6000) is setting the timeout to 6 seconds. That is pretty short ... if the network path is congested, the server is a long way away, the server is heavily loaded and so on.

  3. You said it worked to start with and stopped working. This could be a load or congestion issue. Or the server might have seen repeated calls from your client asking for the same URL, and interpreted it as a DOS attack or a misconfigured application ... and put a block on your IP address.

huangapple
  • 本文由 发表于 2020年10月8日 22:15:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/64264452.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定