问题

Please note that I have made the necessary Page?
d-color:#272822;">public class StockScraper { public static void main(String[] args) { Scanner input = new Scanner(System.in); System.out.println("Enter the complete url (including http://) of the site you would like to parse:"); String html = input.nextLine(); try { Document doc = Jsoup.connect(html).get(); System.out.printf("Title: %s", doc.title()); // Try to print site content style="color:#75715e"> System.out.println(""); System.out.println("Writing html contents to 'html.txt'..."); // Save html contents to text file style="color:#75715e"> PrintWriter outputfile = new PrintWriter("html.txt"); outputfile.print(doc.outerHtml()); outputfile.close(); // Select stock data you want to retrieve style="color:#75715e"> System.out.println("Enter the name of the stock you want to check"); String name = input.nextLine(); // Pull data from CNBC Markets style="color:#75715e"> Element table = doc.select("table").get(0); Elements rows = table.select("tr"); System.out.println(rows.size()); for (int i = 1; i < rows.size(); i++) { Element rowx = rows.get(i); Elements col = rowx.select("td"); // Change 'rows' to 'rowx' style="color:#75715e"> if (col.get(0).text().equals(name)) { // Use .text() to compare cell content style="color:#75715e"> System.out.println("I worked!"); System.out.println(col.get(1).text()); } } } catch (IOException e) { e.printStackTrace(); } } style="color:#f92672">} changes to your provided code to address the issues you mentioned. Make sure to replace your existing code with the corrected version above.



英文:
I have a program I am writing that takes user input to connect to a site, download it's html into a text, and retrieve data from a table twice a day. I understand the code will not be one size fits all for any page (I will likely "hardwire" the url into the code once I get it working). My issue presently is that my jsoup parser isn't properly reading in the tabular data. I'm not sure if my element selectors are too generic? The table looks like it is in standard table/tr/td format, but my rows array populates with size 0. If someone could help me debug my parser and possibly provide some suggestions on where to look for making it grab data silently twice a day, I'd really appreciate it! No runtime/compile errors, just need to correct output.
Source site: https://www.cnbc.com/us-markets/

Source code for table (snipet) :
&lt;table class=&quot;BasicTable-table&quot;&gt;&lt;thead class=&quot;BasicTable-tableHeading BasicTable-tableHeadingSortable&quot;&gt;&lt;tr&gt;&lt;th class=&quot;BasicTable-textData&quot;&gt;&lt;span&gt;SYMBOL &lt;span class=&quot;icon-sort undefined&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/th&gt;&lt;th class=&quot;BasicTable-numData&quot;&gt;&lt;span&gt;PRICE &lt;span class=&quot;icon-sort undefined&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/th&gt;&lt;th class=&quot;BasicTable-numData&quot;&gt;

My code:
public class StockScraper {
public static void main(String[] args) {
Scanner input = new Scanner (System.in);
System.out.println(&quot;Enter the complete url (including http://) of the site you would like to parse:&quot;);
String html = input.nextLine();
try {
Document doc = Jsoup.connect(html).get();
System.out.printf(&quot;Title: %s&quot;, doc.title());
//Try to print site content
System.out.println(&quot;&quot;);
System.out.println(&quot;Writing html contents to &#39;html.txt&#39;...&quot;);
//Save html contents to text file
PrintWriter outputfile = new PrintWriter(&quot;html.txt&quot;);
outputfile.print(doc.outerHtml());
outputfile.close();
//Select stock data you want to retrieve
System.out.println(&quot;Enter the name of the stock you want to check&quot;);
String name = input.nextLine();
//Pull data from CNBC Markets
Element table = doc.select(&quot;table&quot;).get(0);
Elements rows = table.select(&quot;tr&quot;);
System.out.println(rows.size());
for(int i = 1; i &lt; rows.size(); i++) {
Element rowx = rows.get(i);
Elements col = rows.select(&quot;td&quot;);
if(col.get(0).equals(name)) {
System.out.println(&quot;I worked!&quot;);
System.out.println(col.get(1));
}
}
} catch (IOException e) {
e.printStackTrace();
}
}


答案1
得分: 1
这里的问题是该网站是一个动态页面，在浏览器初始下载页面后加载内容。 Jsoup 不足以对这样的页面进行爬取。 你有几个选择：
1）使用模拟浏览器并进行所有必要的 API 调用的工具。 一些选择是 Selenium WebDriver 或 HTMLUnit。
2）找出你在这个网站上感兴趣的 API 调用，直接调用这些 API 以获取可解析的 JSON 文档。 你可以通过在浏览器中打开开发者工具，然后查看网络选项卡来查看 API 的详细信息。 对于这个网站，一个示例是以下内容，其中包括 DJI 的股票报价：
https://quote.cnbc.com/quote-html-webservice/quote.htm?noform=1&amp;partnerId=2&amp;fund=1&amp;exthrs=0&amp;output=json&amp;symbolType=issue&amp;symbols=599362|579435|593933|49020635|49031016|5093160|617254|601065&amp;requestMethod=extended
返回：
ExtendedQuoteResult: {
xmlns: &quot;http://quote.cnbc.com/services/MultiQuote/2006&quot;,
ExtendedQuote: [{
QuickQuote: {
symbol: &quot;.DJI&quot;,
code: &quot;0&quot;,
curmktstatus: &quot;REG_MKT&quot;,
FundamentalData: {
yrlodate: &quot;2020-03-23&quot;,
yrloprice: &quot;18213.65&quot;,
yrhidate: &quot;2020-02-12&quot;,
yrhiprice: &quot;29568.57&quot;
},
...


英文:
The problem here is that this site is a dynamic page that is loading content after the browser initially downloads the page.  Jsoup is not going to be adequate to scrape pages like this.  A couple options you have:


Use a tool that simulates a browser and makes all the necessary api calls.  A couple options are Selenium WebDriver or HTMLUnit.


Figure out the api calls you are interested in on this site, and just call those api's directly to get a JSON document you can parse.  You can see api details by opening developer tools in your browser, then look at the Network tab.  For this site an example would be the following, which includes the stock quote for DJI:


https://quote.cnbc.com/quote-html-webservice/quote.htm?noform=1&amp;partnerId=2&amp;fund=1&amp;exthrs=0&amp;output=json&amp;symbolType=issue&amp;symbols=599362|579435|593933|49020635|49031016|5093160|617254|601065&amp;requestMethod=extended
Returns:
ExtendedQuoteResult: {
xmlns: &quot;http://quote.cnbc.com/services/MultiQuote/2006&quot;,
ExtendedQuote: [{
QuickQuote: {
symbol: &quot;.DJI&quot;,
code: &quot;0&quot;,
curmktstatus: &quot;REG_MKT&quot;,
FundamentalData: {
yrlodate: &quot;2020-03-23&quot;,
yrloprice: &quot;18213.65&quot;,
yrhidate: &quot;2020-02-12&quot;,
yrhiprice: &quot;29568.57&quot;
},
mappedSymbol: {
xsi:nil: &quot;true&quot;
},
source: &quot;Exchange&quot;,
cnbcId: &quot;599362&quot;,
prev_prev_closing: &quot;21413.44&quot;,
high: &quot;22783.45&quot;,
low: &quot;21693.63&quot;,
provider: &quot;CNBC Quote Cache&quot;,
streamable: &quot;0&quot;,
last_time: &quot;2020-04-06T17:16:28.000-0400&quot;,
countryCode: &quot;US&quot;,
previous_day_closing: &quot;21052.53&quot;,
altName: &quot;Dow Industrials&quot;,
reg_last_time: &quot;2020-04-06T17:16:28.000-0400&quot;,
last_time_msec: &quot;1586207788000&quot;,
altSymbol: &quot;.DJI&quot;,
change_pct: &quot;7.73&quot;,
providerSymbol: &quot;.DJI&quot;,
assetSubType: &quot;Index&quot;,
comments: &quot;RIC&quot;,
last: &quot;22679.99&quot;,
issue_id: &quot;599362&quot;,
cacheServed: &quot;false&quot;,
responseTime: &quot;Mon Apr 06 19:12:09 EDT 2020&quot;,
change: &quot;1627.46&quot;,
timeZone: &quot;EDT&quot;,
onAirName: &quot;Dow Industrials&quot;,
symbolType: &quot;issue&quot;,
assetType: &quot;INDEX&quot;,
volume: &quot;614200990&quot;,
fullVolume: &quot;614200990&quot;,
realTime: &quot;true&quot;,
name: &quot;Dow Jones Industrial Average&quot;,
quoteDesc: { },
exchange: &quot;Dow Jones Global Indexes&quot;,
shortName: &quot;DJIA&quot;,
cachedTime: &quot;Mon Apr 06 19:12:09 EDT 2020&quot;,
currencyCode: &quot;USD&quot;,
open: &quot;21693.63&quot;
}
}
...







通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。






						
	

点赞		

					https://go.coder-hub.com/61070037.html
复制链接
复制链接
		


go

如何从CNBC Markets页面解析表格数据？

问题

答案1

将相同线性方程的所有行分组。

Set> 的工作概念是如何的。

Java DAO模式 – 在不使用多个类的情况下分离数据库通信责任

如何在Spring Boot中接受所有嵌套的子目录直至PathVariable？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论