2020年10月19日 15:50:21go评论203阅读模式

英文:

Android parse nested tables with Jsoup

问题

以下是翻译好的内容：

我正在尝试使用 Jsoup 解析在线的 HTML 页面，从一个包含多个表格的页面中提取数据。我想要解析的页面包含不止一个表格。

我应该如何做呢？

这是我想要解析的示例页面：

https://www.cpu-world.com/info/AMD/AMD_A4-Series.html

我想要提取的数据是“Model Name”（型号名称）和详情页面的 URL。

编辑：

以下是我用于从详情页面提取数据的部分代码。

try {
    /**
     * 适用于迭代以下网站上的项目
     * https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html
     */
    URL url = new URL("https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html");
    
    Document doc = Jsoup.parse(url, 3000);
    
    // spec_table 是与表格相关的类的名称
    Elements table = doc.select("table.spec_table");
    Elements rows = table.select("tr");
    
    Iterator<Element> rowIterator = rows.iterator();
    rowIterator.next();
    boolean wasMatch = false;
    
    // 遍历列表中的所有项目
    while (rowIterator.hasNext()) {
        Element row = rowIterator.next();
        Elements cols = row.select("td");
        String rowName = cols.get(0).text();
    }
} catch (MalformedURLException e) {
    e.printStackTrace();
} catch (IOException e) {
    e.printStackTrace();
}

我已经阅读了一些教程以及文档，但似乎无法弄清楚如何浏览网页以提取我正在寻找的数据。我了解 HTML 和 CSS，但刚开始学习 Jsoup。

（我将此标记为 Android，因为我在使用 Java 代码。但并不需要如此详细。）

英文:

I am trying to parse an HTML page online to retrieve data from a table with Jsoup. The page I want to parse contains more than one table.

How can I do that?

Here is a sample page that I want to parse:

https://www.cpu-world.com/info/AMD/AMD_A4-Series.html

The data I want to extract is the Model Name and the URL of the details page.

Edit:

This is some of the code I'm using to extract data from the details page.

            try {
                /**
                 * Works to iterate through the items at the following website
                 * https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html
                 */
                URL url = new URL(&quot;https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html&quot;);
                
                Document doc = Jsoup.parse(url, 3000);
                
                // spec_table is the name of the class associated with the table
                Elements table = doc.select(&quot;table.spec_table&quot;);
                Elements rows = table.select(&quot;tr&quot;);
                
                Iterator&lt;Element&gt; rowIterator = rows.iterator();
                rowIterator.next();
                boolean wasMatch = false;
                
                // Loop through all items in list
                while (rowIterator.hasNext()) {
                    Element row = rowIterator.next();
                    Elements cols = row.select(&quot;td&quot;);
                    String rowName = cols.get(0).text();
                }
            } catch (MalformedURLException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            }

I've been reading some tutorials as well as the documentation and I can't seem to figure out how to navigate the web pages to extract the data I'm looking for. I understand the HTML and CSS, but am just learning about Jsoup.

(I tagged this as Android because that's where I'm using the Java code. Guess it's not necessary to be that specific.)

答案1

得分: 0

以下是您提供的代码的翻译：

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.IOException;
import java.net.URL;

public class CpuWorld {

    public static void main(String[] args) throws IOException {
        URL url = null;
        try {
            /**
             * 用于迭代以下网站的项目
             * https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html
             */
            url = new URL("https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html");
        } catch (IOException e) {
            e.printStackTrace();
        }

        Document doc = Jsoup.parse(url, 3000);
        // spec_table 是与表格相关联的类名
        String modelNumber = doc.select("table tr:has(td:contains(Model number)) td b a").text();
        String modelUrl = doc.select("table tr:has(td:contains(Model number)) td b a").attr("href");

        System.out.println(modelNumber + " : " + modelUrl);
    }
}

A4-3300 : https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html

进程以退出代码 0 结束

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Optional;
import java.util.stream.Collectors;

public class CpuWorld {

    public static final String CPU_WORLD_COM_URL = "https://www.cpu-world.com/info/AMD/AMD_A4-Series.html";

    public static final String SCRAMBLED_DATA_HEADER = "<!--\r\nfunction JSC_Process () {var bk,xg,qh,k,aj,y,e,cq,u,a,ei;";
    public static final String SCRAMBLED_DATA_FOOTER = "//- qh=[\"\"];k=[2];cq=[7600];if (CW_AB){if\t((AB_v!='0')&&(AB_v!='X')&&(AB_Gl((AB_v=='')?99:3)==3)){y=1;AB_cb=function(){JSC_Process();};}else{y=2;}}for(aj=e=0;aj<k.length;aj++){ei=cq[aj];bk=qh[aj];if (!bk) bk=\"JSc_\"+aj;u=CW_E(bk);if (u){bk=\" jsc_a\";if (y>=k[aj]){xg=a.substr(e,ei);xg=xg.replace(/(.)(.)/g,\"$2$1\");u.innerHTML=xg.replace(/\\n/g,\"\\n\");bk='';}u.className=u.className.replace(/(^| )jsc_\\w+$/,bk);}e+=ei;}}JSC_Process();";

    private static RestTemplate restTemplate = new RestTemplate();

    public static void main(String[] args) throws IOException {
        Document tableData = getTableData(CPU_WORLD_COM_URL);

        List<String> fullUrls = tableData.select("table tr td:contains(a) a").stream()
                .map(e -> "https://www.cpu-world.com/" + e.attr("href"))
                .collect(Collectors.toList());

        List<String> fullModels = tableData.select("table tr td:contains(a) a").stream()
                .map(e -> e.text())
                .collect(Collectors.toList());

        for (int i=0; i<fullUrls.size(); i++) {
            System.out.println(fullModels.get(i) + " : " + fullUrls.get(i));
        }
    }

    // ...（其余内容省略，因为此部分内容较长且是后续代码）
}

请注意，我只对您提供的代码进行了翻译，而没有包括代码的执行结果或解释。如果您有任何其他翻译需求，请随时告诉我。

英文:

This looks like what you're after:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.IOException;
import java.net.URL;

public class CpuWorld {

    public static void main(String[] args) throws IOException {
        URL url = null;
        try {
            /**
             * Works to iterate through the items at the following website
             * https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html
             */
             url = new URL(&quot;https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html&quot;);
        } catch (IOException e) {
            e.printStackTrace();
        }

        Document doc = Jsoup.parse(url, 3000);
        // spec_table is the name of the class associated with the table
        String modelNumber = doc.select(&quot;table tr:has(td:contains(Model number)) td b a&quot;).text();
        String modelUrl = doc.select(&quot;table tr:has(td:contains(Model number)) td b a&quot;).attr(&quot;href&quot;);

        System.out.println(modelNumber + &quot; : &quot; + modelUrl);
    }
}

Let me know if this is not what you're after

EDIT: Results:

A4-3300 : https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html

Process finished with exit code 0

EDIT:

This is crazier than a box of frogs but here we go... I'll leave you to put 2 and 2 together to iterate through the URLs to get the individual details you're after:

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.web.client.RestTemplate;

import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Optional;
import java.util.stream.Collectors;

public class CpuWorld {

    public static final String CPU_WORLD_COM_URL = &quot;https://www.cpu-world.com/info/AMD/AMD_A4-Series.html&quot;;

    public static final String SCRAMBLED_DATA_HEADER = &quot;&lt;!--\r\nfunction JSC_Process () {var bk,xg,qh,k,aj,y,e,cq,u,a,ei;\r\na=\&quot;&quot;;
    public static final String SCRAMBLED_DATA_FOOTER = &quot;//- qh=[\&quot;\&quot;];k=[2];cq=[7600];if (CW_AB){if\t((AB_v!=&#39;0&#39;)&amp;&amp;(AB_v!=&#39;X&#39;)&amp;&amp;(AB_Gl((AB_v==&#39;&#39;)?99:3)==3)){y=1;AB_cb=function(){JSC_Process();};}else{y=2;}}for(aj=e=0;aj&lt;k.length;aj++){ei=cq[aj];bk=qh[aj];if (!bk) bk=\&quot;JSc_\&quot;+aj;u=CW_E(bk);if (u){bk=\&quot; jsc_a\&quot;;if (y&gt;=k[aj]){xg=a.substr(e,ei);xg=xg.replace(/(.)(.)/g,\&quot;$2$1\&quot;);u.innerHTML=xg.replace(/\\\\n/g,\&quot;\\n\&quot;);bk=&#39;&#39;;}u.className=u.className.replace(/(^| )jsc_\\w+$/,bk);}e+=ei;}}JSC_Process();&quot;;

    private static RestTemplate restTemplate = new RestTemplate();

    public static void main(String[] args) throws IOException {
        Document tableData = getTableData(CPU_WORLD_COM_URL);

        List&lt;String&gt; fullUrls = tableData.select(&quot;table tr td:contains(a) a&quot;).stream()
                .map(e -&gt; &quot;https://www.cpu-world.com/&quot; + e.attr(&quot;href&quot;))
                .collect(Collectors.toList());

        List&lt;String&gt; fullModels = tableData.select(&quot;table tr td:contains(a) a&quot;).stream()
                .map(e -&gt; e.text())
                .collect(Collectors.toList());

        for (int i=0; i&lt; fullUrls.size(); i++) {
            System.out.println(fullModels.get(i) + &quot; : &quot; + fullUrls.get(i));
        }
    }

    private static Document getTableData(String url) {
        Connection.Response response = null;
        try {
            response = Jsoup
                    .connect(url)
                    .headers(getHeaders())
                    .method(Connection.Method.GET)
                    .data()
                    .execute();

        } catch (IOException e) {
            e.printStackTrace();
            System.exit(1);
        }
        Elements script = Jsoup.parse(response.body()).select(&quot;script&quot;);

        // take substring of the child node from after the header and before the footer (- 6 more chars which seem dynamic)
        // The script tag containing JSC_Process is the one with the data in (but all mangled).
        Optional&lt;String&gt; scrambledData = script.stream()
                .filter(element -&gt; element.data().contains(&quot;JSC_Process&quot;))
                .map(node -&gt; node.data().substring(SCRAMBLED_DATA_HEADER.length(), (node.data().length() - SCRAMBLED_DATA_FOOTER.length()-6)))
                .findFirst();

        String tableData = Unscrambler.unscramble(scrambledData.orElseThrow(() -&gt; new RuntimeException(&quot;scrambled data not found in relevant script tag&quot;)));

        Document doc = Jsoup.parse(tableData);
        return doc;
    }

    private static boolean isNotEmptyString(Element node) {
        return node.data() != null &amp;&amp; !node.data().equals(&quot;&quot;);
    }

    /**
    * trick server into thinking we&#39;re not a bot
    * by telling the server we were referred by the server itself
    * and give tell it we&#39;re using a Mozilla/Safari browser
    **/
    private static Map&lt;String, String&gt; getHeaders() {
        Map&lt;String, String&gt; headersMap = new HashMap&lt;&gt;();
        headersMap.put(&quot;User-Agent&quot;, &quot;Mozilla/5.0 Safari/537.36&quot;);
        headersMap.put(&quot;Referer&quot;, CPU_WORLD_COM_URL);
        return headersMap;
    }
}

class Unscrambler {

    public static final String SCRAMBLED_DATA_HEADER = &quot;&lt;!--\r\nfunction JSC_Process () {var bk,xg,qh,k,aj,y,e,cq,u,a,ei;\r\na=\&quot;&quot;;
    public static final String SCRAMBLED_DATA_FOOTER = &quot;qh=[\&quot;\&quot;];k=[2];cq=[7600];if (CW_AB){if\t((AB_v!=&#39;0&#39;)&amp;&amp;(AB_v!=&#39;X&#39;)&amp;&amp;(AB_Gl((AB_v==&#39;&#39;)?99:3)==3)){y=1;AB_cb=function(){JSC_Process();};}else{y=2;}}for(aj=e=0;aj&lt;k.length;aj++){ei=cq[aj];bk=qh[aj];if (!bk) bk=\&quot;JSc_\&quot;+aj;u=CW_E(bk);if (u){bk=\&quot; jsc_a\&quot;;if (y&gt;=k[aj]){xg=a.substr(e,ei);xg=xg.replace(/(.)(.)/g,\&quot;$2$1\&quot;);u.innerHTML=xg.replace(/\\\\n/g,\&quot;\\n\&quot;);bk=&#39;&#39;;}u.className=u.className.replace(/(^| )jsc_\\w+$/,bk);}e+=ei;}}JSC_Process();&quot;;

    public static String unscramble(String data) {
        String a=data.replace(&quot;\\\&quot;&quot;,&quot;&#39;&quot;)
                .replace(&quot;\\\\&quot;, &quot;\\&quot;)
                .replace(&quot;\\r&quot;, &quot;&quot;)
                .replace(&quot;\\n&quot;, &quot;&quot;)
                .replace(&quot;\&quot;+\r\n\&quot;&quot;, &quot;&quot;); // remove gunk that mucks up processing in java
        StringBuffer buffer = new StringBuffer();
        int e = 0;
        int ei = 2;

        // This is effectively what the code in the footer is doing. Heavily un-obfuscated below.
        // swap two chars around - through
        for (int aj=0; aj &lt; a.length()-2; aj+=2) {
            String xg = a.substring(e, ei);
            buffer.append(xg.substring(1,2) + xg.substring(0,1));
            e+=2;
            ei+=2;
        }
        return buffer.toString().replace(&quot;\n&quot;,&quot;&quot;);
    }
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Android使用Jsoup解析嵌套表格

问题

答案1

如何制作一个开关

如何使用charAt和string.length()来分割字符串。

Flutter – Native Android Component invoke with MethodChannel only works on first time then throws MissingPluginException

谷歌应用引擎 – Go vs. Python 的推荐？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论