如何从CNBC Markets页面解析表格数据?

huangapple go评论105阅读模式
英文:

How to parse tabular data from CNBC Markets Page?

问题

public class StockScraper {
    
    public static void main(String[] args) {
        Scanner input = new Scanner(System.in);
        System.out.println("Enter the complete url (including http://) of the site you would like to parse:");
        String html = input.nextLine();
        try {
            Document doc = Jsoup.connect(html).get();
            System.out.printf("Title: %s", doc.title());
            // Try to print site content
            System.out.println("");
            System.out.println("Writing html contents to 'html.txt'...");
            // Save html contents to text file
            PrintWriter outputfile = new PrintWriter("html.txt");
            outputfile.print(doc.outerHtml());
            outputfile.close();

            // Select stock data you want to retrieve
            System.out.println("Enter the name of the stock you want to check");
            String name = input.nextLine();

            // Pull data from CNBC Markets
            Element table = doc.select("table").get(0);
            Elements rows = table.select("tr");
            System.out.println(rows.size());
            for (int i = 1; i < rows.size(); i++) {
                Element rowx = rows.get(i);
                Elements col = rowx.select("td"); // Change 'rows' to 'rowx'
                if (col.get(0).text().equals(name)) { // Use .text() to compare cell content
                    System.out.println("I worked!");
                    System.out.println(col.get(1).text());
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Please note that I have made the necessary changes to your provided code to address the issues you mentioned. Make sure to replace your existing code with the corrected version above.

英文:

I have a program I am writing that takes user input to connect to a site, download it's html into a text, and retrieve data from a table twice a day. I understand the code will not be one size fits all for any page (I will likely "hardwire" the url into the code once I get it working). My issue presently is that my jsoup parser isn't properly reading in the tabular data. I'm not sure if my element selectors are too generic? The table looks like it is in standard table/tr/td format, but my rows array populates with size 0. If someone could help me debug my parser and possibly provide some suggestions on where to look for making it grab data silently twice a day, I'd really appreciate it! No runtime/compile errors, just need to correct output.

Source site: https://www.cnbc.com/us-markets/
Source code for table (snipet) :

&lt;table class=&quot;BasicTable-table&quot;&gt;&lt;thead class=&quot;BasicTable-tableHeading BasicTable-tableHeadingSortable&quot;&gt;&lt;tr&gt;&lt;th class=&quot;BasicTable-textData&quot;&gt;&lt;span&gt;SYMBOL &lt;span class=&quot;icon-sort undefined&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/th&gt;&lt;th class=&quot;BasicTable-numData&quot;&gt;&lt;span&gt;PRICE &lt;span class=&quot;icon-sort undefined&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/th&gt;&lt;th class=&quot;BasicTable-numData&quot;&gt;

My code:

public class StockScraper {
public static void main(String[] args) {
Scanner input = new Scanner (System.in);
System.out.println(&quot;Enter the complete url (including http://) of the site you would like to parse:&quot;);
String html = input.nextLine();
try {
Document doc = Jsoup.connect(html).get();
System.out.printf(&quot;Title: %s&quot;, doc.title());
//Try to print site content
System.out.println(&quot;&quot;);
System.out.println(&quot;Writing html contents to &#39;html.txt&#39;...&quot;);
//Save html contents to text file
PrintWriter outputfile = new PrintWriter(&quot;html.txt&quot;);
outputfile.print(doc.outerHtml());
outputfile.close();
//Select stock data you want to retrieve
System.out.println(&quot;Enter the name of the stock you want to check&quot;);
String name = input.nextLine();
//Pull data from CNBC Markets
Element table = doc.select(&quot;table&quot;).get(0);
Elements rows = table.select(&quot;tr&quot;);
System.out.println(rows.size());
for(int i = 1; i &lt; rows.size(); i++) {
Element rowx = rows.get(i);
Elements col = rows.select(&quot;td&quot;);
if(col.get(0).equals(name)) {
System.out.println(&quot;I worked!&quot;);
System.out.println(col.get(1));
}
}
} catch (IOException e) {
e.printStackTrace();
}
}

答案1

得分: 1

这里的问题是该网站是一个动态页面,在浏览器初始下载页面后加载内容。 Jsoup 不足以对这样的页面进行爬取。 你有几个选择:

1)使用模拟浏览器并进行所有必要的 API 调用的工具。 一些选择是 Selenium WebDriver 或 HTMLUnit。

2)找出你在这个网站上感兴趣的 API 调用,直接调用这些 API 以获取可解析的 JSON 文档。 你可以通过在浏览器中打开开发者工具,然后查看网络选项卡来查看 API 的详细信息。 对于这个网站,一个示例是以下内容,其中包括 DJI 的股票报价:

https://quote.cnbc.com/quote-html-webservice/quote.htm?noform=1&amp;partnerId=2&amp;fund=1&amp;exthrs=0&amp;output=json&amp;symbolType=issue&amp;symbols=599362|579435|593933|49020635|49031016|5093160|617254|601065&amp;requestMethod=extended
返回:
ExtendedQuoteResult: {
xmlns: &quot;http://quote.cnbc.com/services/MultiQuote/2006&quot;,
ExtendedQuote: [{
QuickQuote: {
symbol: &quot;.DJI&quot;,
code: &quot;0&quot;,
curmktstatus: &quot;REG_MKT&quot;,
FundamentalData: {
yrlodate: &quot;2020-03-23&quot;,
yrloprice: &quot;18213.65&quot;,
yrhidate: &quot;2020-02-12&quot;,
yrhiprice: &quot;29568.57&quot;
},
...
英文:

The problem here is that this site is a dynamic page that is loading content after the browser initially downloads the page. Jsoup is not going to be adequate to scrape pages like this. A couple options you have:

  1. Use a tool that simulates a browser and makes all the necessary api calls. A couple options are Selenium WebDriver or HTMLUnit.

  2. Figure out the api calls you are interested in on this site, and just call those api's directly to get a JSON document you can parse. You can see api details by opening developer tools in your browser, then look at the Network tab. For this site an example would be the following, which includes the stock quote for DJI:

https://quote.cnbc.com/quote-html-webservice/quote.htm?noform=1&amp;partnerId=2&amp;fund=1&amp;exthrs=0&amp;output=json&amp;symbolType=issue&amp;symbols=599362|579435|593933|49020635|49031016|5093160|617254|601065&amp;requestMethod=extended
Returns:
ExtendedQuoteResult: {
xmlns: &quot;http://quote.cnbc.com/services/MultiQuote/2006&quot;,
ExtendedQuote: [{
QuickQuote: {
symbol: &quot;.DJI&quot;,
code: &quot;0&quot;,
curmktstatus: &quot;REG_MKT&quot;,
FundamentalData: {
yrlodate: &quot;2020-03-23&quot;,
yrloprice: &quot;18213.65&quot;,
yrhidate: &quot;2020-02-12&quot;,
yrhiprice: &quot;29568.57&quot;
},
mappedSymbol: {
xsi:nil: &quot;true&quot;
},
source: &quot;Exchange&quot;,
cnbcId: &quot;599362&quot;,
prev_prev_closing: &quot;21413.44&quot;,
high: &quot;22783.45&quot;,
low: &quot;21693.63&quot;,
provider: &quot;CNBC Quote Cache&quot;,
streamable: &quot;0&quot;,
last_time: &quot;2020-04-06T17:16:28.000-0400&quot;,
countryCode: &quot;US&quot;,
previous_day_closing: &quot;21052.53&quot;,
altName: &quot;Dow Industrials&quot;,
reg_last_time: &quot;2020-04-06T17:16:28.000-0400&quot;,
last_time_msec: &quot;1586207788000&quot;,
altSymbol: &quot;.DJI&quot;,
change_pct: &quot;7.73&quot;,
providerSymbol: &quot;.DJI&quot;,
assetSubType: &quot;Index&quot;,
comments: &quot;RIC&quot;,
last: &quot;22679.99&quot;,
issue_id: &quot;599362&quot;,
cacheServed: &quot;false&quot;,
responseTime: &quot;Mon Apr 06 19:12:09 EDT 2020&quot;,
change: &quot;1627.46&quot;,
timeZone: &quot;EDT&quot;,
onAirName: &quot;Dow Industrials&quot;,
symbolType: &quot;issue&quot;,
assetType: &quot;INDEX&quot;,
volume: &quot;614200990&quot;,
fullVolume: &quot;614200990&quot;,
realTime: &quot;true&quot;,
name: &quot;Dow Jones Industrial Average&quot;,
quoteDesc: { },
exchange: &quot;Dow Jones Global Indexes&quot;,
shortName: &quot;DJIA&quot;,
cachedTime: &quot;Mon Apr 06 19:12:09 EDT 2020&quot;,
currencyCode: &quot;USD&quot;,
open: &quot;21693.63&quot;
}
}
...

huangapple
  • 本文由 发表于 2020年4月7日 06:36:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/61070037.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定