英文:
Best way to download all images from a site using Java? Currently getting an 403 Status Error
问题
我正在尝试下载网站上的所有图片,但我不确定这是否是最佳方法,因为我已经尝试设置了用户代理和引用,但没有成功。只有在尝试从src页面下载图像时才会出现403状态错误,而将所有图像放在一个页面上的页面没有显示任何错误,并将src发送到图像。我不确定是否有一种方法可以在不访问src页面的情况下下载这些图像?或者是否有更好的方法来完成这个任务?
以下是我的代码:
private static void getPages() throws IOException {
Document doc = Jsoup.connect("https://manganelo.com/chapter/read_bleach_manga_online_for_free2/chapter_686")
.get();
Elements media = doc.getElementsByTag("img");
System.out.println(media);
Iterator<Element> ie = media.iterator();
int i = 1;
while (ie.hasNext()) {
Response resultImageResponse = Jsoup.connect(ie.next().attr("src")).ignoreContentType(true)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0")
.referrer("www.google.com").timeout(120000).execute();
FileOutputStream out = (new FileOutputStream(new java.io.File("image #" + i++ + ".jpg")));
out.write(resultImageResponse.bodyAsBytes());
out.close();
}
}
英文:
I am trying to download all the images off of a site, but I'm not sure if this is the best way, as I have tried setting a user agent and referrer to no avail. The 403 Status Error only occurs when trying to download the images from the src page, while the page that has all the images in one place is doesn't show any errors and sends the src to the images. I am not sure if there is a way to download the images without visiting the src page? Or a better way to do this entirely.
Here is my code so far.
private static void getPages() throws IOException {
Document doc = Jsoup.connect("https://manganelo.com/chapter/read_bleach_manga_online_for_free2/chapter_686")
.get();
Elements media = doc.getElementsByTag("img");
System.out.println(media);
Iterator<Element> ie = media.iterator();
int i = 1;
while (ie.hasNext()) {
Response resultImageResponse = Jsoup.connect(ie.next().attr("src")).ignoreContentType(true)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0")
.referrer("www.google.com").timeout(120000).execute();
FileOutputStream out = (new FileOutputStream(new java.io.File("image #" + i++ + ".jpg")));
out.write(resultImageResponse.bodyAsBytes());
out.close();
}
}
答案1
得分: 1
你的建议方法存在几个问题:
1)你试图使用JSoup来下载文件内容数据... JSoup只用于文本数据,不会返回图像内容/值。要下载图像内容,你需要进行HTTP请求。
2)要下载图像,你还需要复制通过浏览器进行的请求。你可以打开Chrome,打开开发者工具,然后打开网络选项卡。输入要从中爬取图像的页面的URL,你会看到正在进行的一系列请求。在视图中的某个位置会有一个用于每个图像的单独请求... 如果你点击标记为“1.jpg”的那个请求,你会看到下载第一个图像的请求,然后你需要复制用于发起该图像请求的所有标头。请注意,请求和响应标头在此视图中都显示出来。一旦你成功复制了请求,就可以开始测试所需的标头/cookies。我发现唯一的真正要求是“引用者”标头是必要的。
我已经去除了你可能需要/想要的大部分内容,但类似下面的内容是你想要的。我以完整质量提取了漫画图片。我引入了一个小的休眠计时器,以免过多地向服务器发出请求,因为有时你会被限制访问速率。即使没有它,你应该也没问题,但你不想被长时间阻止,所以你让请求慢慢返回给你的速度越慢越好。你甚至可以并行进行请求。
你可以在下面的代码中进一步简化,我几乎可以肯定,以获得更清晰的结果... 但它可以工作,我认为这已经是足够的结果了。
有趣的问题。
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Iterator;
public class JSoupExample {
private static int TIMEOUT = 30000;
private static final int BUFFER_SIZE = 4096;
public static void main(String... args) throws InterruptedException, IOException {
String url = "https://manganelo.com/chapter/read_bleach_manga_online_for_free2/chapter_686";
Document doc = Jsoup.connect(url).get();
// Select only urls where the source starts with the relevant url (not all images)
Elements media = doc.select("img[src^=\"https://s5.mkklcdnv5.com/mangakakalot/r1/read_bleach_manga_online_for_free2/chapter_686_death_and_strawberry/\"]");
Iterator<Element> ie = media.iterator();
int i = 1;
while (ie.hasNext()) {
String imageUrlString = ie.next().attr("src");
System.out.println(imageUrlString + " ");
try {
HttpURLConnection response = makeImageRequest(url, imageUrlString);
if (response.getResponseCode() == 200) {
writeToFile(i, response);
}
} catch (IOException e) {
// skip file and move to next if unavailable
e.printStackTrace();
System.out.println("Unable to download file: " + imageUrlString);
}
i++; // increment image ID whatever the result of the request.
Thread.sleep(200l); // prevent yourself from being blocked due to rate limiting
}
}
private static void writeToFile(int i, HttpURLConnection response) throws IOException {
// opens input stream from the HTTP connection
InputStream inputStream = response.getInputStream();
// opens an output stream to save into file
FileOutputStream outputStream = new FileOutputStream("image_" + i + ".jpg");
int bytesRead = -1;
byte[] buffer = new byte[BUFFER_SIZE];
while ((bytesRead = inputStream.read(buffer)) != -1) {
outputStream.write(buffer, 0, bytesRead);
}
outputStream.close();
inputStream.close();
System.out.println("File downloaded");
}
private static HttpURLConnection makeImageRequest(String referer, String imageUrlString) throws IOException {
URL imageUrl = new URL(imageUrlString);
HttpURLConnection response = (HttpURLConnection) imageUrl.openConnection();
response.setRequestMethod("GET");
response.setRequestProperty("referer", referer);
response.setConnectTimeout(TIMEOUT);
response.setReadTimeout(TIMEOUT);
response.connect();
return response;
}
}
我还想确保根据内容类型设置正确的文件扩展名,因为我认为有些返回的是.png
格式,而不是.jpeg
。我还相当肯定写入文件可以简化为更简单/更清晰的形式,而不是读取字节流。
英文:
You have a few problems with your suggested approach:
-
you're trying to use JSoup to download file content data... JSoup is only for the text data but won't return the image content/values. To download image content you will need an HTTP request
-
to download the images you also need to copy the request that would be made via a browser. You can open up Chrome, open developer tools and open the network tab. Enter the URL for the page you want to scrape images from, and you'll see a bunch of requests being made. There'll be an individual request for each image somewhere in the view... if you click on the one labelled
1.jpg
you'll see the request made to download the first image, you'll then need to copy all headers that are used to make the request for that image. You'll note, request AND response headers are shown in this view. Once you've replicated the request successfully, you can then start testing which headers/cookies are required. I found the only real requirement was for the "referer" header being necessary.
I've stripped out most of what you might need/want but something similar to the below is what you're after. I've pulled the comic book images in their entirety at full quality. I introduced a small sleep timer so as not to overload the server as sometimes you'll get rate limited. Even without it you should be fine but you don't want to get blocked for a lengthy period of time so the slower you can allow the requests to come back to you the better. You could even make the requests in parallel.
You could cut back even more on some of the code below I'm almost certain, to get a cleaner result... but it works and I'm assuming that's more than enough of a result.
Interesting question.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Iterator;
public class JSoupExample {
private static int TIMEOUT = 30000;
private static final int BUFFER_SIZE = 4096;
public static void main(String... args) throws InterruptedException, IOException {
String url = "https://manganelo.com/chapter/read_bleach_manga_online_for_free2/chapter_686";
Document doc = Jsoup.connect(url).get();
// Select only urls where the source starts with the relevant url (not all images)
Elements media = doc.select("img[src^=\"https://s5.mkklcdnv5.com/mangakakalot/r1/read_bleach_manga_online_for_free2/chapter_686_death_and_strawberry/\"]");
Iterator<Element> ie = media.iterator();
int i = 1;
while (ie.hasNext()) {
String imageUrlString = ie.next().attr("src");
System.out.println(imageUrlString + " ");
try {
HttpURLConnection response = makeImageRequest(url, imageUrlString);
if (response.getResponseCode() == 200) {
writeToFile(i, response);
}
} catch (IOException e) {
// skip file and move to next if unavailable
e.printStackTrace();
System.out.println("Unable to download file: " + imageUrlString);
}
i++; // increment image ID whatever the result of the request.
Thread.sleep(200l); // prevent yourself from being blocked due to rate limiting
}
}
private static void writeToFile(int i, HttpURLConnection response) throws IOException {
// opens input stream from the HTTP connection
InputStream inputStream = response.getInputStream();
// opens an output stream to save into file
FileOutputStream outputStream = new FileOutputStream("image_" + i + ".jpg");
int bytesRead = -1;
byte[] buffer = new byte[BUFFER_SIZE];
while ((bytesRead = inputStream.read(buffer)) != -1) {
outputStream.write(buffer, 0, bytesRead);
}
outputStream.close();
inputStream.close();
System.out.println("File downloaded");
}
private static HttpURLConnection makeImageRequest(String referer, String imageUrlString) throws IOException {
URL imageUrl = new URL(imageUrlString);
HttpURLConnection response = (HttpURLConnection) imageUrl.openConnection();
response.setRequestMethod("GET");
response.setRequestProperty("referer", referer);
response.setConnectTimeout(TIMEOUT);
response.setReadTimeout(TIMEOUT);
response.connect();
return response;
}
}
I'd also want to ensure I set the right file extension based on the content type as I believe some were coming back as .png
format rather than .jpeg
. I'm also fairly sure the write to file can be cleaned up to be simpler/clearer, rather than reading in a byte stream.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论