英文:
Why am I getting 403 status code in Java after a while?
问题
以下是您提供的代码的翻译部分:
当我尝试检查网站内部的状态代码时,经过一段时间后会遇到403响应代码。首先,当我运行代码时,每个网站都会发送回数据,但在代码使用计时器重复执行后,我看到一个网页返回403响应代码。以下是我的代码示例:
public class Main {
public static void checkSites() {
Timer ifSee403 = new Timer();
try {
File links = new File("./linkler.txt");
Scanner scan = new Scanner(links);
ArrayList<String> list = new ArrayList<>();
while(scan.hasNext()) {
list.add(scan.nextLine());
}
File linkStatus = new File("LinkStatus.txt");
if(!linkStatus.exists()){
linkStatus.createNewFile();
}else{
System.out.println("文件已经存在");
}
BufferedWriter writer = new BufferedWriter(new FileWriter(linkStatus));
for(String link : list) {
try {
if(!link.startsWith("http")) {
link = "http://"+link;
}
URL url = new URL(link);
HttpURLConnection.setFollowRedirects(true);
HttpURLConnection http = (HttpURLConnection)url.openConnection();
http.setRequestMethod("HEAD");
http.setConnectTimeout(5000);
http.setReadTimeout(8000);
int statusCode = http.getResponseCode();
if (statusCode == 200) {
ifSee403.wait(5000);
System.out.println("你好,我们再次开始");
}
http.disconnect();
System.out.println(link + " " + statusCode);
writer.write(link + " " + statusCode);
writer.newLine();
} catch (Exception e) {
writer.write(link + " " + e.getMessage());
writer.newLine();
System.out.println(link + " " +e.getMessage());
}
}
try {
writer.close();
} catch (Exception e) {
System.out.println(e.getMessage());
}
System.out.println("完成。");
} catch (Exception e) {
System.out.println(e.getMessage());
}
}
public static void main(String[] args) throws Exception {
Timer myTimer = new Timer();
TimerTask sendingRequest = new TimerTask() {
public void run() {
checkSites();
}
};
myTimer.schedule(sendingRequest,0,150000);
}
}
如何解决这个问题?谢谢。
编辑后的注释:
1. 我添加了 `http.disconnect();` 来在检查状态代码后关闭连接。
2. 我还添加了以下内容:
```java
if (statusCode == 200) {
ifSee403.wait(5000);
System.out.println("测试消息");
}
```
但是没有起作用。编译器返回当前线程不是所有者的错误。我需要解决这个问题,并将200更改为403,然后说 `ifSee403.wait(5000);`,然后再次尝试获取状态代码。
英文:
When I try to check status codes within sites I face off 403 response code after a while. First when I run the code every sites send back datas but after my code repeat itself with Timer I see one webpage returns 403 response code. Here is my code.
public class Main {
public static void checkSites() {
Timer ifSee403 = new Timer();
try {
File links = new File("./linkler.txt");
Scanner scan = new Scanner(links);
ArrayList<String> list = new ArrayList<>();
while(scan.hasNext()) {
list.add(scan.nextLine());
}
File linkStatus = new File("LinkStatus.txt");
if(!linkStatus.exists()){
linkStatus.createNewFile();
}else{
System.out.println("File already exists");
}
BufferedWriter writer = new BufferedWriter(new FileWriter(linkStatus));
for(String link : list) {
try {
if(!link.startsWith("http")) {
link = "http://"+link;
}
URL url = new URL(link);
HttpURLConnection.setFollowRedirects(true);
HttpURLConnection http = (HttpURLConnection)url.openConnection();
http.setRequestMethod("HEAD");
http.setConnectTimeout(5000);
http.setReadTimeout(8000);
int statusCode = http.getResponseCode();
if (statusCode == 200) {
ifSee403.wait(5000);
System.out.println("Hello, here we go again");
}
http.disconnect();
System.out.println(link + " " + statusCode);
writer.write(link + " " + statusCode);
writer.newLine();
} catch (Exception e) {
writer.write(link + " " + e.getMessage());
writer.newLine();
System.out.println(link + " " +e.getMessage());
}
}
try {
writer.close();
} catch (Exception e) {
System.out.println(e.getMessage());
}
System.out.println("Finished.");
} catch (Exception e) {
System.out.println(e.getMessage());
}
}
public static void main(String[] args) throws Exception {
Timer myTimer = new Timer();
TimerTask sendingRequest = new TimerTask() {
public void run() {
checkSites();
}
};
myTimer.schedule(sendingRequest,0,150000);
}
}
How can I solve this? Thanks
Edited comment:
-
I've added http.disconnect(); for closing connection after checked status codes.
-
Also I've added
if(statusCode == 200) { ifSee403.wait(5000); System.out.println("Test message);
}
But it didn't work. Compiler returned current thread is not owner error. I need to fix this and change 200 with 403 and say ifSee403.wait(5000) and try it again the status code.
答案1
得分: 1
一种“替代”的方式是尝试遵循安全代码期望您执行的操作,而不是使用 IP / 欺骗 / 匿名化。如果您打算编写一个“爬虫”,并且意识到有一个“机器人检测”不希望您在访问网站时一遍又一遍地调试代码的情况,您可以尝试使用我在上一个问题中回答中提到的HTML下载。
如果您下载HTML并保存(每小时保存一次),然后使用您保存的文件的HTML内容编写HTML解析/监视代码,您将(很可能)遵守网站安全要求,同时仍然能够检查可用性。
如果您希望继续使用JSoup,该API提供了将HTML作为字符串接收的选项。因此,如果您使用我发布的HTML抓取代码,然后将该“HTML字符串”写入磁盘,您可以随时将其提供给JSoup,而不会触发机器人检测安全检查。
如果偶尔按照他们的规定行事,您可以毫不费力地编写测试程序。
import java.io.*;
import java.net.*;
...
HttpURLConnection con = (HttpURLConnection) url.openConnection();
con.setRequestMethod("GET");
con.setRequestProperty("User-Agent", "Chrome/61.0.3163.100");
InputStream is = con.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
StringBuffer sb = new StringBuffer();
String s;
while ((s = br.readLine()) != null) sb.append(s + "\n");
File outF = new File("SavedSite.html");
outF.createNewFile();
FileWriter fw = new FileWriter(outF);
fw.write(sb.toString());
fw.close();
再次说明,此代码非常基础,不使用任何特殊的JAR库代码。下一种方法使用了JSoup库(您明确要求使用,尽管我不使用它……它很好!)。这是方法“parse”,它将解析您刚刚保存的String
。您可以从磁盘加载此HTML字符串
,并使用以下方式将其发送给JSoup:
> 方法文档:org.jsoup.Jsoup.parse(File in, String charsetName, String baseUri)
如果要调用JSoup,只需将java.io.File
实例传递给它:
File f = new File("SavedSite.html");
Document d = Jsoup.parse(f, "UTF-8", url.toString());
我认为您根本不需要计时器…
**再次强调:**如果您要频繁调用服务器。本答案的目的是向您展示如何将服务器的响应保存到磁盘上的文件中,***这样您就不必频繁调用 - 只需一次!***如果您将对服务器的调用限制为每小时一次,那么您(很可能,但不保证)可以避免出现403 Forbidden
的机器人检测问题。
英文:
One "alternative" - by the way - to IP / Spoofing / Anonymizing would be to (instead) try "obeying" what the security-code is expecting you to do. If you are going to write a "scraper", and are aware there is a "bot detection" that doesn't like you debugging your code while you visit the site over and over and over - you should try using the HTML Download which I posted as an answer to the last question you asked.
If you download the HTML and save it (save it to a file - once an hour), and then write you HTML Parsing / Monitoring Code using the HTML contents of the file you have saved, you will (likely) be abiding by the security-requirements of the web-site and still be able to check availability.
If you wish to continue to use JSoup, that A.P.I. has an option for receiving HTML as a String. So if you use the HTML Scrape Code I posted, and then write that HTML String
to disk, you can feed that to JSoup as often as you like without causing the Bot Detection Security Checks to go off.
If you play by their rules once in a while, you can write your tester without much hassle.
import java.io.*;
import java.net.*;
...
// This line asks the "url" that you are trying to connect with for
// an instance of HttpURLConnection. These two classes (URL and HttpURLConnection)
// are in the standard JDK Package java.net.*
HttpURLConnection con = (HttpURLConnection) url.openConnection();
// Tells the connection to use "GET" ... and to "pretend" that you are
// using a "Chrome" web-browser. Note, the User-Agent sometimes means
// something to the web-server, and sometimes is fully ignored.
con.setRequestMethod("GET");
con.setRequestProperty("User-Agent", "Chrome/61.0.3163.100");
// The classes InputStream, InputStreamReader, and BufferedReader
// are all JDK 1.0 package java.io.* classes.
InputStream is = con.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
StringBuffer sb = new StringBuffer();
String s;
// This reads each line from the web-server.
while ((s = br.readLine()) != null) sb.append(s + "\n");
// This writes the results from the web-server to a file
// It is using classes java.io.File and java.io.FileWriter
File outF = new File("SavedSite.html");
outF.createNewFile();
FileWriter fw = new FileWriter(outF);
fw.write(sb.toString());
fw.close();
Again, this code is very basic stuff that doesn't use any special JAR Library Code at all. The next method uses the JSoup library (which you have explicitly requested - even though I don't use it... It is just fine!) ... This is the method "parse" which will parse the String
you have just saved. You may load this HTML String
from disk, and send it to JSoup using:
> Method Documentation: org.jsoup.Jsoup.parse(File in, String
> charsetName, String
> baseUri)
If you wish to invoke JSoup just pass it a java.io.File
instance using the following:
File f = new File("SavedSite.html");
Document d = Jsoup.parse(f, "UTF-8", url.toString());
I do not think you need timers at all...
AGAIN: If you are making lots of calls to the server. The purpose of this answer is to show you how to save the response of the server to a file on disk, so you don't have to make lots of calls - JUST ONE! If you restrict your calls to the server to once per hour, then you will (likely, but not a guarantee) avoid getting a 403 Forbidden
Bot Detection Problem.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论