如何在Java中获取我想要的HTML数据?

huangapple go评论76阅读模式
英文:

How can I scrape the HTML data which I want in Java?

问题

我正在练习并从网站上抓取数据。我卡在了一个网站上,其URL为https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12&currentPage=1。我想要获取Kurum - İlan Numarası - Şehir (公司 - 通知编号 - 城市) 的数据。我认为我无法抓取div部分。当我编译包含此代码 div.search-results-header row 的代码时,它不起作用。另外,我想获取这个网站的前20页。我该如何做?由于有复杂的代码,所以我在附件中添加了图片。如果你至少告诉我如何获取Kurum,我想我可以处理其他部分。谢谢。

以下是你正在处理的项目代码:

public static void main(String[] args) throws Exception {

    File iflasHukuku = new File("/Users/Berkan/Desktop/Iflas Hukuku.txt");
    iflasHukuku.createNewFile();

    FileWriter fileWriter = new FileWriter(iflasHukuku);
    BufferedWriter bufferedWriter = new BufferedWriter(fileWriter);

    final Document document = Jsoup.connect("https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12&currentPage=1").get();

    for(Element x: document.select(".search-results-table-container container mb-4 ng-tns-c6-3 ng-star-inserted")) {

        final String kurumAdi = x.select("div.search-results-header row").text();
        System.out.println(kurumAdi);

    }
}

如何在Java中获取我想要的HTML数据?

英文:

I'm practicing and scraping datas from sites. I've stucked within a site which URL is https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12&currentPage=1. I want to get Kurum - İlan Numarası - Şehir ( Corporation - Notice Number - City ) datas. I can't scrape div I think. When I compile the code which includes this code div.search-results-header row It doesn't work. Also I want to get first 20 pages of this site. How can I do this? There are complicated bunch of code so I'm adding images as attachments. If you tell me at least how can I get Kurum I think I can handle others. Thank you.如何在Java中获取我想要的HTML数据?

However, this is the code what I'm working on for project.

public static void main(String[] args) throws Exception {

    File iflasHukuku = new File("/Users/Berkan/Desktop/Iflas Hukuku.txt");
    iflasHukuku.createNewFile();

    FileWriter fileWriter = new FileWriter(iflasHukuku);
    BufferedWriter bufferedWriter = new BufferedWriter(fileWriter);

    final Document document = Jsoup.connect("https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12&currentPage=1").get();


    for(Element x: document.select(".search-results-table-container container mb-4 ng-tns-c6-3 ng-star-inserted")) {

        final String kurumAdi = x.select("div.search-results-header row").text();
        System.out.println(kurumAdi);

    }

    }

答案1

得分: 1

以下是翻译的代码部分:

import io.github.bonigarcia.wdm.WebDriverManager;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.WebDriverWait;

public class JSoupTest {

    public static void main(String[] args) {
        WebDriverManager.chromedriver().setup(); // 下载驱动

        ChromeOptions chromeOptions = new ChromeOptions();
        chromeOptions.setHeadless(true);

        WebDriver driver = new ChromeDriver(chromeOptions);
        driver.get("https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12&currentPage=1");

        WebDriverWait wait = new WebDriverWait(driver, 30);
        wait.until(webDriver -> driver.getPageSource().contains("İlan Açıklaması"));

        final Document document = Jsoup.parse(driver.getPageSource());

        Elements xx = document.select(".search-results-row");

        for (Element x : document.select(".search-results-row")) {

            System.out.println(x.text());
            // 进一步解析
        }

    }
}

所需的依赖项:

<dependency>
    <groupId>io.github.bonigarcia</groupId>
    <artifactId>webdrivermanager</artifactId>
    <version>4.2.2</version>
</dependency>
<dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-chrome-driver</artifactId>
    <version>3.141.59</version>
</dependency>
<dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-support</artifactId>
    <version>3.141.59</version>
</dependency>
<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>28.2-jre</version>
</dependency>
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
</dependency>
英文:

It appears the webpage is Angular App. So, you cannot simply grab the HTML content using Jsoup.connect because the browser needs to execute the JS to render the page. So, you have to use WebDriver to load the content and get the pageSource and send that to Jsoup.

See this:

import io.github.bonigarcia.wdm.WebDriverManager;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.WebDriverWait;

public class JSoupTest {

    public static void main(String[] args) {
        WebDriverManager.chromedriver().setup(); //downloads the driver

        ChromeOptions chromeOptions = new ChromeOptions();
        chromeOptions.setHeadless(true);

        WebDriver driver = new ChromeDriver(chromeOptions);
        driver.get(&quot;https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12&amp;currentPage=1&quot;);

        WebDriverWait wait = new WebDriverWait(driver, 30);
        wait.until(webDriver -&gt; driver.getPageSource().contains(&quot;İlan A&#231;ıklaması&quot;));

        final Document document = Jsoup.parse(driver.getPageSource());

        Elements xx = document.select(&quot;.search-results-row&quot;);

        for (Element x : document.select(&quot;.search-results-row&quot;)) {

            System.out.println(x.text());
            //parse it further
        }

    }


}

Required Dependencies:

        &lt;dependency&gt;
            &lt;groupId&gt;io.github.bonigarcia&lt;/groupId&gt;
            &lt;artifactId&gt;webdrivermanager&lt;/artifactId&gt;
            &lt;version&gt;4.2.2&lt;/version&gt;
        &lt;/dependency&gt;
        &lt;dependency&gt;
            &lt;groupId&gt;org.seleniumhq.selenium&lt;/groupId&gt;
            &lt;artifactId&gt;selenium-chrome-driver&lt;/artifactId&gt;
            &lt;version&gt;3.141.59&lt;/version&gt;
        &lt;/dependency&gt;
        &lt;dependency&gt;
            &lt;groupId&gt;org.seleniumhq.selenium&lt;/groupId&gt;
            &lt;artifactId&gt;selenium-support&lt;/artifactId&gt;
            &lt;version&gt;3.141.59&lt;/version&gt;
        &lt;/dependency&gt;

        &lt;dependency&gt;
            &lt;groupId&gt;com.google.guava&lt;/groupId&gt;
            &lt;artifactId&gt;guava&lt;/artifactId&gt;
            &lt;version&gt;28.2-jre&lt;/version&gt;
        &lt;/dependency&gt;
        &lt;dependency&gt;
            &lt;groupId&gt;org.jsoup&lt;/groupId&gt;
            &lt;artifactId&gt;jsoup&lt;/artifactId&gt;
            &lt;version&gt;1.13.1&lt;/version&gt;
        &lt;/dependency&gt;

huangapple
  • 本文由 发表于 2020年10月1日 03:45:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/64144711.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定