Android使用Jsoup解析嵌套表格

huangapple go评论76阅读模式
英文:

Android parse nested tables with Jsoup

问题

以下是翻译好的内容:

我正在尝试使用 Jsoup 解析在线的 HTML 页面,从一个包含多个表格的页面中提取数据。我想要解析的页面包含不止一个表格。

我应该如何做呢?

这是我想要解析的示例页面:

https://www.cpu-world.com/info/AMD/AMD_A4-Series.html

我想要提取的数据是“Model Name”(型号名称)和详情页面的 URL。

编辑:

以下是我用于从详情页面提取数据的部分代码。

try {
    /**
     * 适用于迭代以下网站上的项目
     * https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html
     */
    URL url = new URL("https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html");
    
    Document doc = Jsoup.parse(url, 3000);
    
    // spec_table 是与表格相关的类的名称
    Elements table = doc.select("table.spec_table");
    Elements rows = table.select("tr");
    
    Iterator<Element> rowIterator = rows.iterator();
    rowIterator.next();
    boolean wasMatch = false;
    
    // 遍历列表中的所有项目
    while (rowIterator.hasNext()) {
        Element row = rowIterator.next();
        Elements cols = row.select("td");
        String rowName = cols.get(0).text();
    }
} catch (MalformedURLException e) {
    e.printStackTrace();
} catch (IOException e) {
    e.printStackTrace();
}

我已经阅读了一些教程以及文档,但似乎无法弄清楚如何浏览网页以提取我正在寻找的数据。我了解 HTML 和 CSS,但刚开始学习 Jsoup。

(我将此标记为 Android,因为我在使用 Java 代码。但并不需要如此详细。)

英文:

I am trying to parse an HTML page online to retrieve data from a table with Jsoup. The page I want to parse contains more than one table.

How can I do that?

Here is a sample page that I want to parse:

https://www.cpu-world.com/info/AMD/AMD_A4-Series.html

The data I want to extract is the Model Name and the URL of the details page.

Edit:

This is some of the code I'm using to extract data from the details page.

            try {
                /**
                 * Works to iterate through the items at the following website
                 * https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html
                 */
                URL url = new URL(&quot;https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html&quot;);
                
                Document doc = Jsoup.parse(url, 3000);
                
                // spec_table is the name of the class associated with the table
                Elements table = doc.select(&quot;table.spec_table&quot;);
                Elements rows = table.select(&quot;tr&quot;);
                
                Iterator&lt;Element&gt; rowIterator = rows.iterator();
                rowIterator.next();
                boolean wasMatch = false;
                
                // Loop through all items in list
                while (rowIterator.hasNext()) {
                    Element row = rowIterator.next();
                    Elements cols = row.select(&quot;td&quot;);
                    String rowName = cols.get(0).text();
                }
            } catch (MalformedURLException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            }

I've been reading some tutorials as well as the documentation and I can't seem to figure out how to navigate the web pages to extract the data I'm looking for. I understand the HTML and CSS, but am just learning about Jsoup.

(I tagged this as Android because that's where I'm using the Java code. Guess it's not necessary to be that specific.)

答案1

得分: 0

以下是您提供的代码的翻译:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.IOException;
import java.net.URL;

public class CpuWorld {

    public static void main(String[] args) throws IOException {
        URL url = null;
        try {
            /**
             * 用于迭代以下网站的项目
             * https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html
             */
            url = new URL("https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html");
        } catch (IOException e) {
            e.printStackTrace();
        }

        Document doc = Jsoup.parse(url, 3000);
        // spec_table 是与表格相关联的类名
        String modelNumber = doc.select("table tr:has(td:contains(Model number)) td b a").text();
        String modelUrl = doc.select("table tr:has(td:contains(Model number)) td b a").attr("href");

        System.out.println(modelNumber + " : " + modelUrl);
    }
}
A4-3300 : https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html

进程以退出代码 0 结束
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Optional;
import java.util.stream.Collectors;

public class CpuWorld {

    public static final String CPU_WORLD_COM_URL = "https://www.cpu-world.com/info/AMD/AMD_A4-Series.html";

    public static final String SCRAMBLED_DATA_HEADER = "<!--\r\nfunction JSC_Process () {var bk,xg,qh,k,aj,y,e,cq,u,a,ei;";
    public static final String SCRAMBLED_DATA_FOOTER = "//- qh=[\"\"];k=[2];cq=[7600];if (CW_AB){if\t((AB_v!='0')&&(AB_v!='X')&&(AB_Gl((AB_v=='')?99:3)==3)){y=1;AB_cb=function(){JSC_Process();};}else{y=2;}}for(aj=e=0;aj<k.length;aj++){ei=cq[aj];bk=qh[aj];if (!bk) bk=\"JSc_\"+aj;u=CW_E(bk);if (u){bk=\" jsc_a\";if (y>=k[aj]){xg=a.substr(e,ei);xg=xg.replace(/(.)(.)/g,\"$2$1\");u.innerHTML=xg.replace(/\\n/g,\"\\n\");bk='';}u.className=u.className.replace(/(^| )jsc_\\w+$/,bk);}e+=ei;}}JSC_Process();";

    private static RestTemplate restTemplate = new RestTemplate();

    public static void main(String[] args) throws IOException {
        Document tableData = getTableData(CPU_WORLD_COM_URL);

        List<String> fullUrls = tableData.select("table tr td:contains(a) a").stream()
                .map(e -> "https://www.cpu-world.com/" + e.attr("href"))
                .collect(Collectors.toList());

        List<String> fullModels = tableData.select("table tr td:contains(a) a").stream()
                .map(e -> e.text())
                .collect(Collectors.toList());

        for (int i=0; i<fullUrls.size(); i++) {
            System.out.println(fullModels.get(i) + " : " + fullUrls.get(i));
        }
    }

    // ...(其余内容省略,因为此部分内容较长且是后续代码)
}

请注意,我只对您提供的代码进行了翻译,而没有包括代码的执行结果或解释。如果您有任何其他翻译需求,请随时告诉我。

英文:

This looks like what you're after:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.IOException;
import java.net.URL;

public class CpuWorld {

    public static void main(String[] args) throws IOException {
        URL url = null;
        try {
            /**
             * Works to iterate through the items at the following website
             * https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html
             */
             url = new URL(&quot;https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html&quot;);
        } catch (IOException e) {
            e.printStackTrace();
        }

        Document doc = Jsoup.parse(url, 3000);
        // spec_table is the name of the class associated with the table
        String modelNumber = doc.select(&quot;table tr:has(td:contains(Model number)) td b a&quot;).text();
        String modelUrl = doc.select(&quot;table tr:has(td:contains(Model number)) td b a&quot;).attr(&quot;href&quot;);

        System.out.println(modelNumber + &quot; : &quot; + modelUrl);
    }
}

Let me know if this is not what you're after

EDIT: Results:

A4-3300 : https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html

Process finished with exit code 0

EDIT:

This is crazier than a box of frogs but here we go... I'll leave you to put 2 and 2 together to iterate through the URLs to get the individual details you're after:

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.web.client.RestTemplate;

import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Optional;
import java.util.stream.Collectors;

public class CpuWorld {

    public static final String CPU_WORLD_COM_URL = &quot;https://www.cpu-world.com/info/AMD/AMD_A4-Series.html&quot;;

    public static final String SCRAMBLED_DATA_HEADER = &quot;&lt;!--\r\nfunction JSC_Process () {var bk,xg,qh,k,aj,y,e,cq,u,a,ei;\r\na=\&quot;&quot;;
    public static final String SCRAMBLED_DATA_FOOTER = &quot;//- qh=[\&quot;\&quot;];k=[2];cq=[7600];if (CW_AB){if\t((AB_v!=&#39;0&#39;)&amp;&amp;(AB_v!=&#39;X&#39;)&amp;&amp;(AB_Gl((AB_v==&#39;&#39;)?99:3)==3)){y=1;AB_cb=function(){JSC_Process();};}else{y=2;}}for(aj=e=0;aj&lt;k.length;aj++){ei=cq[aj];bk=qh[aj];if (!bk) bk=\&quot;JSc_\&quot;+aj;u=CW_E(bk);if (u){bk=\&quot; jsc_a\&quot;;if (y&gt;=k[aj]){xg=a.substr(e,ei);xg=xg.replace(/(.)(.)/g,\&quot;$2$1\&quot;);u.innerHTML=xg.replace(/\\\\n/g,\&quot;\\n\&quot;);bk=&#39;&#39;;}u.className=u.className.replace(/(^| )jsc_\\w+$/,bk);}e+=ei;}}JSC_Process();&quot;;

    private static RestTemplate restTemplate = new RestTemplate();

    public static void main(String[] args) throws IOException {
        Document tableData = getTableData(CPU_WORLD_COM_URL);

        List&lt;String&gt; fullUrls = tableData.select(&quot;table tr td:contains(a) a&quot;).stream()
                .map(e -&gt; &quot;https://www.cpu-world.com/&quot; + e.attr(&quot;href&quot;))
                .collect(Collectors.toList());

        List&lt;String&gt; fullModels = tableData.select(&quot;table tr td:contains(a) a&quot;).stream()
                .map(e -&gt; e.text())
                .collect(Collectors.toList());

        for (int i=0; i&lt; fullUrls.size(); i++) {
            System.out.println(fullModels.get(i) + &quot; : &quot; + fullUrls.get(i));
        }
    }

    private static Document getTableData(String url) {
        Connection.Response response = null;
        try {
            response = Jsoup
                    .connect(url)
                    .headers(getHeaders())
                    .method(Connection.Method.GET)
                    .data()
                    .execute();

        } catch (IOException e) {
            e.printStackTrace();
            System.exit(1);
        }
        Elements script = Jsoup.parse(response.body()).select(&quot;script&quot;);

        // take substring of the child node from after the header and before the footer (- 6 more chars which seem dynamic)
        // The script tag containing JSC_Process is the one with the data in (but all mangled).
        Optional&lt;String&gt; scrambledData = script.stream()
                .filter(element -&gt; element.data().contains(&quot;JSC_Process&quot;))
                .map(node -&gt; node.data().substring(SCRAMBLED_DATA_HEADER.length(), (node.data().length() - SCRAMBLED_DATA_FOOTER.length()-6)))
                .findFirst();

        String tableData = Unscrambler.unscramble(scrambledData.orElseThrow(() -&gt; new RuntimeException(&quot;scrambled data not found in relevant script tag&quot;)));

        Document doc = Jsoup.parse(tableData);
        return doc;
    }

    private static boolean isNotEmptyString(Element node) {
        return node.data() != null &amp;&amp; !node.data().equals(&quot;&quot;);
    }

    /**
    * trick server into thinking we&#39;re not a bot
    * by telling the server we were referred by the server itself
    * and give tell it we&#39;re using a Mozilla/Safari browser
    **/
    private static Map&lt;String, String&gt; getHeaders() {
        Map&lt;String, String&gt; headersMap = new HashMap&lt;&gt;();
        headersMap.put(&quot;User-Agent&quot;, &quot;Mozilla/5.0 Safari/537.36&quot;);
        headersMap.put(&quot;Referer&quot;, CPU_WORLD_COM_URL);
        return headersMap;
    }
}

class Unscrambler {

    public static final String SCRAMBLED_DATA_HEADER = &quot;&lt;!--\r\nfunction JSC_Process () {var bk,xg,qh,k,aj,y,e,cq,u,a,ei;\r\na=\&quot;&quot;;
    public static final String SCRAMBLED_DATA_FOOTER = &quot;qh=[\&quot;\&quot;];k=[2];cq=[7600];if (CW_AB){if\t((AB_v!=&#39;0&#39;)&amp;&amp;(AB_v!=&#39;X&#39;)&amp;&amp;(AB_Gl((AB_v==&#39;&#39;)?99:3)==3)){y=1;AB_cb=function(){JSC_Process();};}else{y=2;}}for(aj=e=0;aj&lt;k.length;aj++){ei=cq[aj];bk=qh[aj];if (!bk) bk=\&quot;JSc_\&quot;+aj;u=CW_E(bk);if (u){bk=\&quot; jsc_a\&quot;;if (y&gt;=k[aj]){xg=a.substr(e,ei);xg=xg.replace(/(.)(.)/g,\&quot;$2$1\&quot;);u.innerHTML=xg.replace(/\\\\n/g,\&quot;\\n\&quot;);bk=&#39;&#39;;}u.className=u.className.replace(/(^| )jsc_\\w+$/,bk);}e+=ei;}}JSC_Process();&quot;;

    public static String unscramble(String data) {
        String a=data.replace(&quot;\\\&quot;&quot;,&quot;&#39;&quot;)
                .replace(&quot;\\\\&quot;, &quot;\\&quot;)
                .replace(&quot;\\r&quot;, &quot;&quot;)
                .replace(&quot;\\n&quot;, &quot;&quot;)
                .replace(&quot;\&quot;+\r\n\&quot;&quot;, &quot;&quot;); // remove gunk that mucks up processing in java
        StringBuffer buffer = new StringBuffer();
        int e = 0;
        int ei = 2;

        // This is effectively what the code in the footer is doing. Heavily un-obfuscated below.
        // swap two chars around - through
        for (int aj=0; aj &lt; a.length()-2; aj+=2) {
            String xg = a.substring(e, ei);
            buffer.append(xg.substring(1,2) + xg.substring(0,1));
            e+=2;
            ei+=2;
        }
        return buffer.toString().replace(&quot;\n&quot;,&quot;&quot;);
    }
}

huangapple
  • 本文由 发表于 2020年10月19日 15:50:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/64423175.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定