英文:
How can I scrape the HTML data which I want in Java?
问题
我正在练习并从网站上抓取数据。我卡在了一个网站上,其URL为https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12¤tPage=1。我想要获取Kurum - İlan Numarası - Şehir (公司 - 通知编号 - 城市) 的数据。我认为我无法抓取div部分。当我编译包含此代码 div.search-results-header row
的代码时,它不起作用。另外,我想获取这个网站的前20页。我该如何做?由于有复杂的代码,所以我在附件中添加了图片。如果你至少告诉我如何获取Kurum,我想我可以处理其他部分。谢谢。
以下是你正在处理的项目代码:
public static void main(String[] args) throws Exception {
File iflasHukuku = new File("/Users/Berkan/Desktop/Iflas Hukuku.txt");
iflasHukuku.createNewFile();
FileWriter fileWriter = new FileWriter(iflasHukuku);
BufferedWriter bufferedWriter = new BufferedWriter(fileWriter);
final Document document = Jsoup.connect("https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12¤tPage=1").get();
for(Element x: document.select(".search-results-table-container container mb-4 ng-tns-c6-3 ng-star-inserted")) {
final String kurumAdi = x.select("div.search-results-header row").text();
System.out.println(kurumAdi);
}
}
英文:
I'm practicing and scraping datas from sites. I've stucked within a site which URL is https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12&currentPage=1. I want to get Kurum - İlan Numarası - Şehir ( Corporation - Notice Number - City ) datas. I can't scrape div I think. When I compile the code which includes this code div.search-results-header row
It doesn't work. Also I want to get first 20 pages of this site. How can I do this? There are complicated bunch of code so I'm adding images as attachments. If you tell me at least how can I get Kurum I think I can handle others. Thank you.
However, this is the code what I'm working on for project.
public static void main(String[] args) throws Exception {
File iflasHukuku = new File("/Users/Berkan/Desktop/Iflas Hukuku.txt");
iflasHukuku.createNewFile();
FileWriter fileWriter = new FileWriter(iflasHukuku);
BufferedWriter bufferedWriter = new BufferedWriter(fileWriter);
final Document document = Jsoup.connect("https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12&currentPage=1").get();
for(Element x: document.select(".search-results-table-container container mb-4 ng-tns-c6-3 ng-star-inserted")) {
final String kurumAdi = x.select("div.search-results-header row").text();
System.out.println(kurumAdi);
}
}
答案1
得分: 1
以下是翻译的代码部分:
import io.github.bonigarcia.wdm.WebDriverManager;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.WebDriverWait;
public class JSoupTest {
public static void main(String[] args) {
WebDriverManager.chromedriver().setup(); // 下载驱动
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.setHeadless(true);
WebDriver driver = new ChromeDriver(chromeOptions);
driver.get("https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12¤tPage=1");
WebDriverWait wait = new WebDriverWait(driver, 30);
wait.until(webDriver -> driver.getPageSource().contains("İlan Açıklaması"));
final Document document = Jsoup.parse(driver.getPageSource());
Elements xx = document.select(".search-results-row");
for (Element x : document.select(".search-results-row")) {
System.out.println(x.text());
// 进一步解析
}
}
}
所需的依赖项:
<dependency>
<groupId>io.github.bonigarcia</groupId>
<artifactId>webdrivermanager</artifactId>
<version>4.2.2</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-chrome-driver</artifactId>
<version>3.141.59</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-support</artifactId>
<version>3.141.59</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>28.2-jre</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
英文:
It appears the webpage is Angular App. So, you cannot simply grab the HTML content using Jsoup.connect because the browser needs to execute the JS to render the page. So, you have to use WebDriver to load the content and get the pageSource and send that to Jsoup.
See this:
import io.github.bonigarcia.wdm.WebDriverManager;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.WebDriverWait;
public class JSoupTest {
public static void main(String[] args) {
WebDriverManager.chromedriver().setup(); //downloads the driver
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.setHeadless(true);
WebDriver driver = new ChromeDriver(chromeOptions);
driver.get("https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12&currentPage=1");
WebDriverWait wait = new WebDriverWait(driver, 30);
wait.until(webDriver -> driver.getPageSource().contains("İlan Açıklaması"));
final Document document = Jsoup.parse(driver.getPageSource());
Elements xx = document.select(".search-results-row");
for (Element x : document.select(".search-results-row")) {
System.out.println(x.text());
//parse it further
}
}
}
Required Dependencies:
<dependency>
<groupId>io.github.bonigarcia</groupId>
<artifactId>webdrivermanager</artifactId>
<version>4.2.2</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-chrome-driver</artifactId>
<version>3.141.59</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-support</artifactId>
<version>3.141.59</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>28.2-jre</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论