英文:
How to Parse Html after submitting search form that gives data from data base
问题
Connection.Response redirectResponse = Jsoup.connect(internalConstant.getSearchResultURL())
.userAgent(USER_AGENT)
.sslSocketFactory(utilService.socketFactory())
.cookies(coky)
.method(Method.GET)
.execute();
String yestarday = utilService.getYesterdayDateString();
logger.info("yestarday date: " + yestarday);
Document responseDocument = redirectResponse.parse();
FormElement searchForm = (FormElement) responseDocument.select("form[id=searchDisputesForm]").first();
checkElement("form element", searchForm);
FormElement form = (FormElement) searchForm;
Element searchField = form.select("input[name=DateFrom]").first();
checkElement("date from: ", searchField);
searchField.val(yestarday);
Element searchField1 = form.select("input[name=DateTo]").first();
checkElement("Date to: ", searchField1);
searchField1.val(yestarday);
Connection.Response searchResponse = form.submit()
.cookies(coky)
.sslSocketFactory(utilService.socketFactory())
.userAgent(USER_AGENT)
.method(Method.POST)
.ignoreHttpErrors(true)
.execute();
<!-- This is the HTML response that you get in the network -->
<form action="/123Mobile/Portal/Dispute/SearchDisputes" class="form-horizontal" enctype="multipart/form-data" id="searchDisputesForm" method="post">
<input name="__RequestVerificationToken" type="hidden" value="ClhG-V26p7Aj3vi6W8tarCwHk_enagGT4mDVv2AZsU4MHATOBTbEQHuFmylooB5qvxF29aF1-wSvZZ26ijlcZJn5p_OSshL3KeqcEbAaYg01" />
<!-- ... (other form elements) ... -->
<button type="submit" class="btn btn-primary col-md-2 col-md-offset-1" id="SearchBtn" name="SearchBtn" formaction="/123Mobile/Portal/Reports/SearchTransactionReport">
<span class="glyphicon glyphicon-search" aria-hidden="true"></span> Search
</button>
</form>
Note: The HTML content has been retained as-is. If you need any further assistance or explanations, please let me know.
英文:
The data comes from a database, so how to parse HTML with this data, In my code, I connect to direct to my target page with cookies, then I connect to set search form, so how to parse with the result of the search. it comes from the backend. the code below contains my service connection code first, then the HTML of what the search form response gives, and finally, the html that appears in the network of the browser returning back after submitting the search form of the API post Method.
Connection.Response redirectResponse = Jsoup.connect(internalConstant.getSearchResultURL())
.userAgent(USER_AGENT)
.sslSocketFactory(utilService.socketFactory())
.cookies(coky)
.method(Method.GET)
.execute();
//search file
String yestarday = utilService.getYesterdayDateString();
logger.info("yestarday date: " + yestarday);
Document responseDocument = redirectResponse.parse();
FormElement searchForm = (FormElement) responseDocument.select("form[id=searchDisputesForm]").first();
checkElement("form element", searchForm);
FormElement form = (FormElement) searchForm;
Element searchField = form.select("input[name=DateFrom]").first();
checkElement("date from: ", searchField);
searchField.val(yestarday);
Element searchField1 = form.select("input[name=DateTo]").first();
checkElement("Date to: ", searchField1);
searchField1.val(yestarday);
Connection.Response searchResponse = form.submit()
.cookies(coky)
.sslSocketFactory(utilService.socketFactory())
.userAgent(USER_AGENT)
.method(Method.POST)
.ignoreHttpErrors(true)
.execute();
//and here is what gives me:
2020-10-15 11:37:38.193 INFO 3360 --- [http-nio-8095-exec-1] c.d.e.serviceImpl.UploaderTemplateImpl : searchResponse :<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>404 - File or directory not found.</title>
<style type="text/css">
<!--
body{margin:0;font-size:.7em;font-family:Verdana, Arial, Helvetica, sans-serif;background:#EEEEEE;}
fieldset{padding:0 15px 10px 15px;}
h1{font-size:2.4em;margin:0;color:#FFF;}
h2{font-size:1.7em;margin:0;color:#CC0000;}
h3{font-size:1.2em;margin:10px 0 0 0;color:#000000;}
#header{width:96%;margin:0 0 0 0;padding:6px 2% 6px 2%;font-family:"trebuchet MS", Verdana, sans-serif;color:#FFF;
background-color:#555555;}
#content{margin:0 0 0 2%;position:relative;}
.content-container{background:#FFF;width:96%;margin-top:8px;padding:10px;position:relative;}
-->
</style>
</head>
<body>
<div id="header">
<h1>Server Error</h1>
</div>
<div id="content">
<div class="content-container">
<fieldset>
<h2>404 - File or directory not found.</h2>
<h3>The resource you are looking for might have been removed, had its name changed, or is temporarily unavailable.</h3>
</fieldset>
</div>
</div>
</body>
</html>
//but the response in the network shows what I want:
<form action="/123Mobile/Portal/Dispute/SearchDisputes" class="form-horizontal" enctype="multipart/form-data" id="searchDisputesForm" method="post"><input name="__RequestVerificationToken" type="hidden" value="ClhG-V26p7Aj3vi6W8tarCwHk_enagGT4mDVv2AZsU4MHATOBTbEQHuFmylooB5qvxF29aF1-wSvZZ26ijlcZJn5p_OSshL3KeqcEbAaYg01" />
<div class='col-md-12'>
<div class='panel panel-grey equal-height-column' id=''>
<div class='panel-heading'>
<h3 class='panel-title'>
<i class='fa fa-tasks'></i>
Search criteria
</h3>
</div>
<div class='panel-body'>
<div class='form-group'> <label class="col-md-2 control-label" for="DateFrom">Date From</label>
<div class='col-md-4 col-md-offset-0' id=''>
<input class="form-control datepicker" data-val="true" data-val-date="The field Date From must be a date." id="DateFrom" name="DateFrom" placeholder="" type="text" value="2020-10-14" />
<span class="field-validation-valid text-danger" data-valmsg-for="DateFrom" data-valmsg-replace="true"></span>
</div><label class="col-md-2 control-label" for="DateTo">Date To</label>
<div class='col-md-4 col-md-offset-0' id=''>
<input class="form-control datepicker" data-val="true" data-val-date="The field Date To must be a date." id="DateTo" name="DateTo" placeholder="" type="text" value="2020-10-14" />
<span class="field-validation-valid text-danger" data-valmsg-for="DateTo" data-valmsg-replace="true"></span>
</div></div>
<div class='col-md-12'>
<div class=' col-md-6'>
</div>
<button type='button' class='btn btn-success col-md-2 col-md-offset-1 ' id='ResetBtn' name='ResetBtn' onclick=''>
<span class='glyphicon glyphicon-refresh' aria-hidden='true'></span> Reset
</button>
<button type='submit' class='btn btn-primary col-md-2 col-md-offset-1 ' id='SearchBtn' name='SearchBtn' formaction='/123Mobile/Portal/Reports/SearchTransactionReport' onclick=''>
<span class='glyphicon glyphicon-search' aria-hidden='true'></span> Search
</button></div>
</div>
</div>
</div>
<div class='col-md-12'>
<div class='panel panel-grey equal-height-column' id=''>
<div class='panel-heading'>
<h3 class='panel-title'>
<i class='fa fa-tasks'></i>
Search results
</h3>
</div>
<div class='panel-body'>
<div style='' class='table-responsive'><table class='table table-hover table-striped'><thead><tr><th style='width:15%;'>File Name</th><th style='width:10%;'>Scheme</th><th style='width:15%;'>Proccessing Date</th><th style='width:8%;'></th><th style='width:8%;'></th></tr> </thead><tbody><tr><td style='-moz-word-break: break-all; -o-word-break: break-all; word-break: break-word;-ms-word-break: break-all;overflow-wrap: break-word;word-wrap: break-word; width: 15%;' class=''>Transaction-FIBWallet-20201014</td><td style='-moz-word-break: break-all; -o-word-break: break-all; word-break: break-word;-ms-word-break: break-all;overflow-wrap: break-word;word-wrap: break-word; width: 10%;' class=''>FIBWallet</td><td style='-moz-word-break: break-all; -o-word-break: break-all; word-break: break-word;-ms-word-break: break-all;overflow-wrap: break-word;word-wrap: break-word; width: 15%;' class=''>2020-10-14</td><td style='width: 8%;' class=''><a href="/123Mobile/Portal/Reports/DownloadTransactionReport/40072">Download as Text </a></td><td style='width: 8%;' class=''><a href="/123Mobile/Portal/Reports/DownloadTransactionReport/40073">Download as Excel </a></td></tr></tbody></table></div>
</div>
</div>
</div>
<div class='col-md-12'> <div class="pagination-container"><ul class="pagination"><li class="disabled PagedList-skipToFirst"><a>««</a></li><li class="disabled PagedList-skipToPrevious"><a rel="prev">«</a></li><li class="active"><a>1</a></li><li class="disabled PagedList-skipToNext"><a rel="next">»</a></li><li class="disabled PagedList-skipToLast"><a>»»</a></li></ul></div></div>
</form>
答案1
得分: 1
除非您需要从服务器获取会话 cookie 和一些其他信息(使用标头),否则我建议避免使用 GET 请求开始,并直接使用 POST 表单请求。
您说您收到的响应看起来类似于
<h2>404 - 文件或目录未找到。</h2>
这表明您在进行 POST 请求时没有发送正确的标头,或者该页面是一个单页面应用程序,该应用程序会发出一些后续(异步)请求,然后修改页面源代码。
根据所提供的信息,我无法提供更多帮助,除了建议您使用 Chrome 开发工具上的网络选项卡,调查在发生 POST 请求时向服务器发送了哪些请求。查看标头。
感兴趣的典型标头涉及某种会话 cookie,可能还包括引荐者标头。但是,如果仅凭这两者不能满足您的需求,您将需要开始精确复制请求... 表单中可能有 CSRF 令牌,您需要获取这些令牌(在这种情况下,初始 GET 请求对于获取有效令牌以在 POST 中发送至关重要)。
英文:
Unless you need to get a session cookie and some other info from the server with headers, I'd avoid starting with a GET request and go straight for the POST form request.
You say the response you're getting looks something like
<h2>404 - File or directory not found.</h2>
which suggests you're making the POST request without the correct headers sent across OR the page is a single page app that makes some subsequent (asynchronous) request which then modifies the page source.
I can't really help much further with the info provided, other than to say look at using Chrome dev tools on the network tab to investigate what requests are being sent to the server when the POST request happens. Look at the headers.
Typical headers of interest involve some sort of session cookie, possibly a referrer header. But if those two alone don't get you what you want you'll need to start replicating the request exactly... There may be CSRF tokens in the form you need to obtain (in which case the initial GET request is essential to obtain a valid token to send back in the POST).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论