英文:
HTTP CURL works - Java Jsoup doesn't
问题
我尝试从网站(https://bs.to)中爬取一些聊天消息,但我必须先通过HTTP POST登录。在CURL中,我的代码运行得很好:
curl -v -X POST ^
-H "Cookie: __bsduid=226mq3kt8oafl5f1le1hv3ognl; " ^
-d "login[user]=RainbowSimon&login[pass]=MY_PASSWORD&security_token=687f7de7247f9a95f7fccc6a" "https://bs.to" ^
--output "out.txt";
但是当我尝试用Java和JSoup来实现时,我得到了状态码200和一个HTML结构,但我没有登录:
Connection.Response loggedIn;
loggedIn = Jsoup.connect("http://bs.to")
.cookie("__bsduid", cookieUID)
.data("login[user]", loginUserName)
.data("login[pass]", loginUserPassword)
.data("security_token", securityTokenForm)
.method(Method.POST)
.execute();
System.out.println(loggedIn.statusCode());
System.out.println(loggedIn.parse());
我甚至从Java应用程序中获取了security_token和cookie,然后将它们放入CURL中,也可以正常工作。
有人能看出我在尝试实现到Java时犯的错误吗?
英文:
I try to scrape some chat messages from a site (https://bs.to), but I have to login first via HTTP POST. In CURL my code works fine:
curl -v -X POST ^
-H "Cookie: __bsduid=226mq3kt8oafl5f1le1hv3ognl; " ^
-d "login[user]=RainbowSimon&login[pass]=MY_PASSWORD&security_token=687f7de7247f9a95f7fccc6a" "https://bs.to" ^
--output "out.txt"
But then when I tried to get it into Java with JSoup, I get status code 200 and a HTML structure, but I'm not logged in
Connection.Response loggedIn;
loggedIn = Jsoup.connect("http://bs.to")
.cookie("__bsduid", cookieUID)
.data("login[user]", loginUserName)
.data("login[pass]", loginUserPassword)
.data("security_token", securityTokenForm)
.method(Method.POST)
.execute();
System.out.println(loggedIn.statusCode());
System.out.println(loggedIn.parse());
I did even retrieve the security_token and the cookie from the Java application and put them in CURL and it worked too.
Does someone see the mistake I made when trying to implement to Java?
答案1
得分: 0
你会得到不同的响应,因为你发送了不同的请求。这里的主要区别在于请求头。
Web浏览器和curl会自动为您设置一些基本的请求头,但Jsoup不会这样做。您必须明确地将它们添加到连接中。您正在使用带有 -v
的curl,因此它们已经可见:
> POST / HTTP/2
> Host: bs.to
> User-Agent: curl/7.60.0
> Accept: */*
> Cookie: __bsduid=226mq3kt8oafl5f1le1hv3ognl;
> Content-Length: 88
> Content-Type: application/x-www-form-urlencoded
Jsoup不会设置User-Agent
、Accept
和Content-Type
等头部。其中一些头部在某些服务器上是必需的,以区分真实的Web浏览器和网络爬虫。尝试使用 .header(name, value)
将它们设置为与上述完全相同的值,以模拟相同的请求。
curl和Jsoup之间的另一个区别是,curl似乎使用HTTP2,而Jsoup使用HTTP1.1,但这不应该是问题。为了确保,请尝试使用带有 --http1.1
开关的curl。
由于我无法测试上述内容,因为您的Cookie对我无效,所以您必须自行进行实验。
英文:
You get different responses because you send different request. The main difference here are headers.
Web browsers and curl are automatically setting for you some basic request headers but Jsoup won't do this. You have to explicitly add them to the connection. You're using curl with -v
so they are already visible:
> POST / HTTP/2
> Host: bs.to
> User-Agent: curl/7.60.0
> Accept: */*
> Cookie: __bsduid=226mq3kt8oafl5f1le1hv3ognl;
> Content-Length: 88
> Content-Type: application/x-www-form-urlencoded
Jsoup won't set headers: User-Agent
, Accept
and Content-Type
. Some of them are required by some servers to tell the difference between real web browsers and crawlers. Try to set them to exactly the same values as above using .header(name, value)
to simulate the same request.
The other difference between curl and Jsoup is that curl seems to be using HTTP2 but Jsoup uses HTTP1.1 but that shouldn't be the case. To make sure try using curl with --http1.1
switch.
None of the above can be tested by me because your cookies don't work for me so you have to experiment by yourself.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论