Selenium – Java – 从结构不变的网站提取文本

huangapple go评论75阅读模式
英文:

Selenium - Java - Extract text from site with no change in structure

问题

我正试图从一个网站中提取数据,链接如下:

https://bible.usccb.org/bible/readings/090120.cfm

代码:

String quote2 = driver.findElement(By.xpath("//*[@id='block-usccb-readings-content']/div/div[6]/div/div/div/div/div[2]")).getText();

这段代码捕获了文本,但我的要求是将其写入一个文本文件中,而在这样做时,它变成了一个段落,以下是我的预期和实际结果:

预期:

主的道路都公正。
上主宽仁,充满怜悯,
缓于发怒,富于慈爱。
上主善待一切万物,
他的仁慈覆及他一切的作为。
主的道路都公正。
愿你的一切作为颂赞你,上主!
愿你的圣徒们赞美你!
他们传扬你国的光荣,
并宣讲你威能的事迹。

当前:

主的道路都公正。上主宽仁,充满怜悯,缓于发怒,富于慈爱。上主善待一切万物,他的仁慈覆及他一切的作为。主的道路都公正。愿你的一切作为颂赞你,上主!愿你的圣徒们赞美你!他们传扬你国的光荣,并宣讲你威能的事迹。主的道路都公正。使众人知道你的威能和你国度的光荣壮丽。你的国度是一个永恒的国度,你的统治万世常存。主的道路都公正。上主在他的一切言辞中都是忠信的,在他的一切作为中都是神圣的。上主扶持一切颓废的人,扶起一切屈身的人。主的道路都公正。

想知道是否有可能实现?

英文:

I am trying to extract data from a site, link below

https://bible.usccb.org/bible/readings/090120.cfm

Code :

String quote2 = driver.findElement(By.xpath("//*[@id='block-usccb-readings-content']/div/div[6]/div/div/div/div/div[2]")).getText();

This captures the text but my requirement is to write this to a text file and while I do so it comes as a paragraph, below is my expected and actual

Expected :

R. (17) The Lord is just in all his ways.
The LORD is gracious and merciful,
slow to anger and of great kindness.
The LORD is good to all
and compassionate toward all his works.
R. The Lord is just in all his ways.
Let all your works give you thanks, O LORD,
and let your faithful ones bless you.
Let them discourse of the glory of your Kingdom
and speak of your might.

Current :

R. (17) The Lord is just in all his ways. The LORD is gracious and merciful, slow to anger and of great kindness. The LORD is good to all and compassionate toward all his works. R. The Lord is just in all his ways. Let all your works give you thanks, O LORD, and let your faithful ones bless you. Let them discourse of the glory of your Kingdom and speak of your might. R. The Lord is just in all his ways. Making known to men your might and the glorious splendor of your Kingdom. Your Kingdom is a Kingdom for all ages, and your dominion endures through all generations. R. The Lord is just in all his ways. The LORD is faithful in all his words and holy in all his works. The LORD lifts up all who are falling and raises up all who are bowed down. R. The Lord is just in all his ways.   

Wondering if its possible ?

答案1

得分: 2

你可以使用以下代码来获取带有精确格式的上下文。

System.setProperty("webdriver.chrome.driver", System.getProperty("user.dir") + "\\src\\test\\resources\\executables\\chromedriver.exe");
WebDriver driver = new ChromeDriver();
driver.get("https://bible.usccb.org/bible/readings/090120.cfm");
WebDriverWait wait = new WebDriverWait(driver, 20);
wait.until(ExpectedConditions.elementToBeClickable(By.xpath("//*[@id='block-usccb-readings-content']/div/div[6]/div/div/div/div/div[2]")));
WebElement el = driver.findElement(By.xpath("//*[@id='block-usccb-readings-content']/div/div[6]/div/div/div/div/div[2]"));
String str = el.getAttribute("innerHTML");
BufferedWriter writer;
try {
    writer = new BufferedWriter(new FileWriter(System.getProperty("user.dir") + "\\src\\test\\resources\\executables\\Download.html"));
    writer.write(str);
    writer.close();
} catch (IOException e) {
    e.printStackTrace();
}
driver.quit();
英文:

You can find the context with exact formatting by using below code.

System.setProperty("webdriver.chrome.driver", System.getProperty("user.dir") + "\\src\\test\\resources\\executables\\chromedriver.exe");
	WebDriver driver = new ChromeDriver();
	driver.get("https://bible.usccb.org/bible/readings/090120.cfm");
	WebDriverWait wait = new WebDriverWait(driver, 20);
	wait.until(ExpectedConditions.elementToBeClickable(By.xpath("//*[@id='block-usccb-readings-content']/div/div[6]/div/div/div/div/div[2]")));
	WebElement el = driver.findElement(By.xpath("//*[@id='block-usccb-readings-content']/div/div[6]/div/div/div/div/div[2]"));
	String str = el.getAttribute("innerHTML");
    BufferedWriter writer;
	try {
		writer = new BufferedWriter(new FileWriter(System.getProperty("user.dir") + "\\src\\test\\resources\\executables\\Download.html"));
        writer.write(str); 
        writer.close();
	} catch (IOException e) {
		e.printStackTrace();
	} 
	driver.quit();

huangapple
  • 本文由 发表于 2020年9月1日 13:02:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/63681657.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定