2020年9月25日 03:27:59go评论164阅读模式

英文:

Why is my Jsoup Code not Returning the Correct Elements?

问题

我正在使用Android Studio开发一个应用程序，在使用JSoup进行网络爬虫时遇到了一些问题。我已经成功连接到了网页并返回了一些基本元素以测试这个库，但是现在我实际上无法获取我需要的元素。

我正在尝试获取一些带有"data-at"属性的元素。奇怪的是，返回了一些带有"data-at"属性的元素，但不是我要找的那些。由于某种原因，我的代码没有提取出在网页上共享"data-at"属性的所有元素。

这是我要爬取的网页的URL：
https://express.liatoyotaofcolonie.com/inventory?f=dealer.name%3ALia%20Toyota%20of%20Colonie&f=submodel%3ACamry&f=trim%3ALE&f=year%3A2020

包含网络爬虫代码的方法：

@Override
protected String doInBackground(Void... params) {
    String title = "";
    Document doc;
    Log.d(TAG, queryString.toString());
    try {
        doc = Jsoup.connect(queryString.toString()).get();
        Elements content = doc.select("[data-at]");
        for (Element e: content) {
            Log.d(TAG, e.text());
        }
    } catch (IOException e) {
        Log.e(TAG, e.toString());
    }
    return title;
}

Logcat中的结果

我想要提取的元素

实际被检索出的元素之一

英文:

I am working on an app in Android Studio and am having some trouble web-scraping with JSoup. I have successfully connected to the webpage and returned some basic elements to test the library, but now I cannot actually get the elements I need for my app.

I am trying to get a number of elements with the "data-at" attribute. The weird thing is, a few elements with the "data-at" attribute are returned, but not the ones I am looking for. For whatever reason my code is not extracting all of the elements that share the "data-at" attribute on the web page.

This is the URL of the webpage I am scraping:
https://express.liatoyotaofcolonie.com/inventory?f=dealer.name%3ALia%20Toyota%20of%20Colonie&f=submodel%3ACamry&f=trim%3ALE&f=year%3A2020

The method containing the web-scraping code:

@Override
    protected String doInBackground(Void... params) {
        String title = &quot;&quot;;
        Document doc;
        Log.d(TAG, queryString.toString());
        try {
            doc = Jsoup.connect(queryString.toString()).get();
            Elements content = doc.select(&quot;[data-at]&quot;);
            for (Element e: content) {
                Log.d(TAG, e.text());
            }
        } catch (IOException e) {
            Log.e(TAG, e.toString());
        }
        return title;
    }

The results in Logcat

The element I want to retrieve

One of the elements that is actually being retrieved

答案1

得分: 1

这是因为其中一些内容（包括您正在寻找的内容）是通过异步方式创建的，并且不会在初始DOM中出现（Javascript ;))

当您查看页面源代码时，您会注意到只有17个 data-at 出现，而运行 document.querySelector(""[data-at]"") 会返回 29 个节点。

在JSoup中，您能够获取的是页面的静态内容（初始DOM）。您将无法获取动态创建的内容，因为您没有运行所需的JS脚本。

为了克服这个问题，您要么需要手动获取和解析所需的资源（例如跟踪浏览器发出的AJAX调用），要么使用无界面浏览器设置。Selenium + 无界面Chrome 应该足够满足需求。

后者选项将使您能够爬取任何可能的 Web 应用程序，包括SPA应用程序，这是使用普通 Jsoup 不可能实现的。

英文:

This is because some of the content - including the one you are looking for - is created asyncronously and is not present in initial DOM (Javascript ;))

When you view the source of the page you will notice that there is only 17 data-at occurences, while running document.querySelector("[data-at]") 29 nodes are returned.

What you are able to get in the JSoup is static content of the page (initial DOM). You wont be able to fetch dynamically created content as you do not run required JS scripts.

In order to overcome this, you will have to either fetch and parse required resources manually (eg trace what AJAX calls are made by the browser) or use headless browser setup. Selenium + headless Chrome should be enough.

Letter option will allow you to scrape ANY posible web application, including SPA apps, which is not possible using plaing Jsoup.

答案2

得分: 0

以下是翻译好的部分：

I don't quite know what to do about this, but I'm going to try one more time... The "Problematic Lines" in your code are these:

>         doc = Jsoup.connect(queryString.toString()).get();
>         Elements content = doc.select("[data-at]");

It is the **`queryString`** that you have requested - the **`URL`** points to a page that contains quite a bit of script code. When you load up a browser and click the button (or menu-option) that reads: **`"View Source"`**, the **`HTML`** you see is not the same exact **`HTML`** that is broadcast to and received by JSoup.

If the **`HTML`** that is broadcast contains any **`<SCRIPT TYPE="text/javascript"> ... </SCRIPT>`** in it (and the named **`URL`** in your question does), **AND** those `<SCRIPT>` tags are involved in the initial loading of the page, then JSoup will not know anything about it... _It only parses what it receives, it cannot process any dynamic content._

There are four ways that I know of to get the "Post Script Loaded" version of the **`HTML`** from a dynamic web-page, and I will type them here, now. The first is likely the most popular method (in Java) that I have heard about on Stack Overflow:

 - **Selenium**  [This Answer][1] will show how the tool can run Java-Script. These are some [Selenium Docs][2]. And then there is [this page][3] right here has a great "first class" for using the tool to retrieve ***`post-script processed HTML`***. Again, there is no way JSoup can retrieve HTML that is sent to the browser by script (JS/AJAX/Angular/React) since it ***just a parser***.
 - **Puppeteer**  This requires running a language called **Node.js**  Perhaps calling a simple **Node.js** program from Java could work, but it would be a "Two Language" solution. I've never used it. Here is [an answer][4] that shows getting, sort of, what you are trying to get... The HTML after the script.
 - **WebView** Android Java Programmers have a popular class called **"WebView"** ([documented here][5]), that I have recently been told about (yesterday ... but it has been out for years) that will execute script in a browser, and return the HTML. Here is [an answer][6] that shows "JavaScript Injection" to retrieve DOM Tree elements from a "WebView" instance (which is how I was told it was done)
 - **Splash** My favorite tool, which I don't think anyone has heard of, but has been the simplest for me... So there is an A.P.I. called the "Splash API". Here is [their explanation][7] for a "Java-Script Rendering Service." Since this one I have been using... I'll post a code snippet that shows how "Splash Tool" can retrieve ***`post-script processed HTML`*** below.

To run the **`Splash API`** (only if you have access to the **`docker`** loading program) ...  You start a **`Splash Server`** as below. These two lines are typed into a GCP `(Google Cloud Platform)` Shell instance, and the server starts right up without any configurations:

>  Pull the image:
>  `$ sudo docker pull scrapinghub/splash`
>
>  Start the container:
>  `$ sudo docker run -it -p 8050:8050 --rm scrapinghub/splash`
>
>  In your code, just prepend the String to your **`URL's`**:
>  `"http://localhost:8050/render.html?url="`

So in your code, you would use the following command (instead), and the script would (more likely) load all the HTML Elements that you are not finding:

>     String SPLASH_URL = "http://localhost:8050/render.html?url=";
>     doc = Jsoup.connect(SPLASH_URL + queryString.toString()).get();

希望这些翻译能够满足你的要求。

英文:

I don't quite know what to do about this, but I'm going to try one more time... The "Problematic Lines" in your code are these:

> doc = Jsoup.connect(queryString.toString()).get();
> Elements content = doc.select("[data-at]");

It is the queryString that you have requested - the URL points to a page that contains quite a bit of script code. When you load up a browser and click the button (or menu-option) that reads: "View Source", the HTML you see is not the same exact HTML that is broadcast to and received by JSoup.

If the HTML that is broadcast contains any <SCRIPT TYPE="text/javascript"> ... </SCRIPT> in it (and the named URL in your question does), AND those <SCRIPT> tags are involved in the initial loading of the page, then JSoup will not know anything about it... It only parses what it receives, it cannot process any dynamic content.

There are four ways that I know of to get the "Post Script Loaded" version of the HTML from a dynamic web-page, and I will type them here, now. The first is likely the most popular method (in Java) that I have heard about on Stack Overflow:

Selenium This Answer will show how the tool can run Java-Script. These are some Selenium Docs. And then there is this page right here has a great "first class" for using the tool to retrieve post-script processed HTML. Again, there is no way JSoup can retrieve HTML that is sent to the browser by script (JS/AJAX/Angular/React) since it just a parser.
Puppeteer This requires running a language called Node.js Perhaps calling a simple Node.js program from Java could work, but it would be a "Two Language" solution. I've never used it. Here is an answer that shows getting, sort of, what you are trying to get... The HTML after the script.
WebView Android Java Programmers have a popular class called "WebView" (documented here), that I have recently been told about (yesterday ... but it has been out for years) that will execute script in a browser, and return the HTML. Here is an answer that shows "JavaScript Injection" to retrieve DOM Tree elements from a "WebView" instance (which is how I was told it was done)
Splash My favorite tool, which I don't think anyone has heard of, but has been the simplest for me... So there is an A.P.I. called the "Splash API". Here is their explanation for a "Java-Script Rendering Service." Since this one I have been using... I'll post a code snippet that shows how "Splash Tool" can retrieve post-script processed HTML below.

To run the Splash API (only if you have access to the docker loading program) ... You start a Splash Server as below. These two lines are typed into a GCP (Google Cloud Platform) Shell instance, and the server starts right up without any configurations:

> Pull the image:
> $ sudo docker pull scrapinghub/splash
>
> Start the container:
> $ sudo docker run -it -p 8050:8050 --rm scrapinghub/splash
>
> In your code, just prepend the String to your URL's:
> "http://localhost:8050/render.html?url="

So in your code, you would use the following command (instead), and the script would (more likely) load all the HTML Elements that you are not finding:

> String SPLASH_URL = "http://localhost:8050/render.html?url=";
> doc = Jsoup.connect(SPLASH_URL + queryString.toString()).get();

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么我的 Jsoup 代码没有返回正确的元素？

问题

包含网络爬虫代码的方法：

Logcat中的结果

我想要提取的元素

实际被检索出的元素之一

The method containing the web-scraping code:

The results in Logcat

The element I want to retrieve

One of the elements that is actually being retrieved

答案1

答案2

在特定索引之后使用符号隐藏字符

如何在 JSP 文件中使用印地语内容？

Java – 当 JFrame 关闭时如何打印消息？

太多等待线程在Java 8中的WebSocket客户端中导致Java堆转储。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论