2020年1月6日 22:25:02go评论103阅读模式

英文:

Save data in XML file in Python

问题

你可以尝试以下更改以确保所有五个评论都保存在文件中。首先，你可以将创建XML树的部分移到主函数内，以确保每个评论都能正确地添加到XML树中。然后，将文件写入操作移至主函数之外，以避免在每次添加评论时都覆盖文件。以下是修改后的代码：

import requests
from bs4 import BeautifulSoup
import re
import json
import xml.etree.cElementTree as ET
source = requests.get('https://www.tripadvisor.ch/Hotel_Review-g188113-d228146-Reviews-Coronado_Hotel-Zurich.html#REVIEWS').text
soup = BeautifulSoup(source, 'lxml')
pattern = re.compile(r'window.__WEB_CONTEXT__={pageManifest:(\{.*\})};')
script = soup.find("script", text=pattern)
dictData = pattern.search(script.text).group(1)
jsonData = json.loads(dictData)
def get_countrycitydata():
    countrycity_dict = dict()
    country_data = jsonData['urqlCache']['3960485871']['data']['locations']
    for data in country_data:
        data1 = data['parents']
        countrycity_dict["country_name"] = data1[2]['name']
        countrycity_dict["tripadvisorid_city"] = data1[0]['locationId']
        countrycity_dict["city_name"] = data1[0]['name']
    return countrycity_dict
def get_hoteldata():
    hotel_dict = dict()
    locations = jsonData['urqlCache']['669061039']['data']['locations']
    for data in locations:
        hotel_dict["tripadvisorid_hotel"] = data['locationId']
        hotel_dict["hotel_name"] = data['name']
    return hotel_dict
def get_reviews():
    all_dictionaries = []
    for locations in jsonData['urqlCache']['669061039']['data']['locations']:
        for reviews in locations['reviewListPage']['reviews']:
            review_dict = {}
            review_dict["reviewid"] = reviews['id']
            review_dict["reviewurl"] = reviews['absoluteUrl']
            review_dict["reviewlang"] = reviews['language']
            review_dict["reviewtitle"] = reviews['title']
            reviewtext = reviews['text']
            clean_reviewtext = reviewtext.replace('\n', ' ')
            review_dict["reviewtext"] = clean_reviewtext
            all_dictionaries.append(review_dict)
    return all_dictionaries
def xml_tree(new_dict, root):
    country = ET.SubElement(root, "country")
    ET.SubElement(country, "name").text = new_dict["country_name"]
    city = ET.SubElement(country, "city")
    ET.SubElement(city, "tripadvisorid").text = str(new_dict["tripadvisorid_city"])
    ET.SubElement(city, "name").text = new_dict["city_name"]
    hotels = ET.SubElement(city, "hotels")
    hotel = ET.SubElement(hotels, "hotel")
    ET.SubElement(hotel, "tripadvisorid").text = str(new_dict["tripadvisorid_hotel"])
    ET.SubElement(hotel, "name").text = new_dict["hotel_name"]
    reviews = ET.SubElement(hotel, "reviews")
    for review_data in new_dict["reviews"]:
        review = ET.SubElement(reviews, "review")
        ET.SubElement(review, "reviewid").text = str(review_data["reviewid"])
        ET.SubElement(review, "reviewurl").text = review_data["reviewurl"]
        ET.SubElement(review, "reviewlang").text = review_data["reviewlang"]
        ET.SubElement(review, "reviewtitle").text = review_data["reviewtitle"]
        ET.SubElement(review, "reviewtext").text = review_data["reviewtext"]
def main():
    city_dict = get_countrycitydata()
    hotel_dict = get_hoteldata()
    review_list = get_reviews()
    root = ET.Element("countries")
    for index in range(len(review_list)):
        new_dict = {**city_dict, **hotel_dict}
        new_dict["reviews"] = review_list
        xml_tree(new_dict, root)
    tree = ET.ElementTree(root)
    tree.write("test.xml", encoding='unicode')
if __name__ == "__main__":
    main()

这样，你的XML树将在主函数中构建，然后一次性写入文件，确保所有五个评论都保存在同一个文件中。希望这对你有所帮助！

英文:

I am trying to save my data to an XML file. This data comes from a website where I want to collect the reviews. There are always five reviews per page, which I want to save in XML format in a file. The problem is that if I print out the XML tree with print(ET.tostring(root, encoding='utf8').decode('utf8')) then there are all five reviews that I want to have. But if I save them into the file with tree.write("test.xml", encoding='unicode') then I only see one review... Here is my code:

import requests
from bs4 import BeautifulSoup
import re
import json
import xml.etree.cElementTree as ET
source = requests.get(&#39;https://www.tripadvisor.ch/Hotel_Review-g188113-d228146-Reviews-Coronado_Hotel-Zurich.html#REVIEWS&#39;).text
soup = BeautifulSoup(source, &#39;lxml&#39;)
pattern = re.compile(r&#39;window.__WEB_CONTEXT__={pageManifest:(\{.*\})};&#39;)
script = soup.find(&quot;script&quot;, text=pattern)
dictData = pattern.search(script.text).group(1)
jsonData = json.loads(dictData)
def get_countrycitydata():
countrycity_dict = dict()
country_data = jsonData[&#39;urqlCache&#39;][&#39;3960485871&#39;][&#39;data&#39;][&#39;locations&#39;]
for data in country_data:
data1 = data[&#39;parents&#39;]
countrycity_dict[&quot;country_name&quot;] = data1[2][&#39;name&#39;]
countrycity_dict[&quot;tripadvisorid_city&quot;] = data1[0][&#39;locationId&#39;]
countrycity_dict[&quot;city_name&quot;] = data1[0][&#39;name&#39;]
return countrycity_dict
def get_hoteldata():
hotel_dict = dict()
locations = jsonData[&#39;urqlCache&#39;][&#39;669061039&#39;][&#39;data&#39;][&#39;locations&#39;]
for data in locations:
hotel_dict[&quot;tripadvisorid_hotel&quot;] = data[&#39;locationId&#39;]
hotel_dict[&quot;hotel_name&quot;] = data[&#39;name&#39;]
return hotel_dict
def get_reviews():	
all_dictionaries = []
for locations in jsonData[&#39;urqlCache&#39;][&#39;669061039&#39;][&#39;data&#39;][&#39;locations&#39;]:
for reviews in locations[&#39;reviewListPage&#39;][&#39;reviews&#39;]:
review_dict = {}
review_dict[&quot;reviewid&quot;] = reviews[&#39;id&#39;]
review_dict[&quot;reviewurl&quot;] =  reviews[&#39;absoluteUrl&#39;]
review_dict[&quot;reviewlang&quot;] = reviews[&#39;language&#39;]
review_dict[&quot;reviewtitle&quot;] = reviews[&#39;title&#39;]
reviewtext = reviews[&#39;text&#39;]
clean_reviewtext = reviewtext.replace(&#39;\n&#39;, &#39; &#39;)
review_dict[&quot;reviewtext&quot;] = clean_reviewtext
all_dictionaries.append(review_dict)
return all_dictionaries
def xml_tree(new_dict): # should I change something here???
root = ET.Element(&quot;countries&quot;)
country = ET.SubElement(root, &quot;country&quot;)
ET.SubElement(country, &quot;name&quot;).text = new_dict[&quot;country_name&quot;]
city = ET.SubElement(country, &quot;city&quot;)
ET.SubElement(city, &quot;tripadvisorid&quot;).text = str(new_dict[&quot;tripadvisorid_city&quot;])
ET.SubElement(city, &quot;name&quot;).text = new_dict[&quot;city_name&quot;]
hotels = ET.SubElement(city, &quot;hotels&quot;)
hotel = ET.SubElement(hotels, &quot;hotel&quot;)
ET.SubElement(hotel, &quot;tripadvisorid&quot;).text = str(new_dict[&quot;tripadvisorid_hotel&quot;])
ET.SubElement(hotel, &quot;name&quot;).text = new_dict[&quot;hotel_name&quot;]
reviews = ET.SubElement(hotel, &quot;reviews&quot;)
review = ET.SubElement(reviews, &quot;review&quot;)
ET.SubElement(review, &quot;reviewid&quot;).text = str(new_dict[&quot;reviewid&quot;])
ET.SubElement(review, &quot;reviewurl&quot;).text = new_dict[&quot;reviewurl&quot;]
ET.SubElement(review, &quot;reviewlang&quot;).text = new_dict[&quot;reviewlang&quot;]
ET.SubElement(review, &quot;reviewtitle&quot;).text = new_dict[&quot;reviewtitle&quot;]
ET.SubElement(review, &quot;reviewtext&quot;).text = new_dict[&quot;reviewtext&quot;]
tree = ET.ElementTree(root)
tree.write(&quot;test.xml&quot;, encoding=&#39;unicode&#39;)	
print(ET.tostring(root, encoding=&#39;utf8&#39;).decode(&#39;utf8&#39;))
##########################################################	
def main():
city_dict = get_countrycitydata()
hotel_dict = get_hoteldata()
review_list = get_reviews()
for index in range(len(review_list)):
new_dict = {**city_dict, **hotel_dict, **review_list[index]}
xml_tree(new_dict)
if __name__ == &quot;__main__&quot;:
main()

How can I change the XML tree so that all five reviews are saved in the file? The XML file should look like this:

&lt;countries&gt;
&lt;country&gt;
&lt;name&gt;Schweiz&lt;/name&gt;
&lt;city&gt;
&lt;tripadvisorid&gt;188113&lt;/tripadvisorid&gt;
&lt;name&gt;Z&#252;rich&lt;/name&gt;
&lt;hotels&gt;
&lt;hotel&gt;
&lt;tripadvisorid&gt;228146&lt;/tripadvisorid&gt;
&lt;name&gt;Hotel Coronado&lt;/name&gt;
&lt;reviews&gt;
&lt;review&gt;
&lt;reviewid&gt;672052111&lt;/reviewid&gt; 
&lt;reviewurl&gt;https://www.tripadvisor.ch/ShowUserReviews-g188113-d228146-r672052111-Coronado Hotel-Zurich.html&lt;/reviewurl&gt;
&lt;reviewlang&gt;de&lt;/reviewlang&gt;
&lt;reviewtitle&gt;Optimale Lage und Preis&lt;/reviewtitle&gt;
&lt;reviewtext&gt;Hervorragendes Hotel.Beste Erfahrun mit Service und Zimme.Die Qalit&#228;t der Betten ist optimalr. Zimmer sind trotz geringer Gr&#246;&#223;e sehr gut ausgestattet.Der F&#246;hn war in diesem Fall (nicht in fr&#252;heren)etwas lahm&lt;/reviewtext&gt;
&lt;/review&gt;
&lt;review&gt;
second review here ...
&lt;/review&gt;
&lt;review&gt;
third review here ...
&lt;/review&gt;
...
&lt;/reviews&gt;
&lt;/hotel&gt;
&lt;/hotels&gt;
&lt;/city&gt;
&lt;/country&gt;
&lt;/countries&gt;

Thank you in advance for all suggestions!

答案1

得分: 2

因为你的 xml_tree(new_dict) 存在于一个 for 循环内，tree.write() 方法被多次调用，覆盖了你的文件。

在 open() 中以 a（追加）模式打开你的文件：

tree.write(open('test.xml', 'a'), encoding='unicode')

请查看文档此处

英文:

Because your xml_tree(new_dict) exists inside of a for loop, the tree.write() method is being called multiple times overwriting your file.

Open your file in a (append) mode with open():

tree.write(open(&#39;test.xml&#39;, &#39;a&#39;), encoding=&#39;unicode&#39;)

See documentation here

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Python中保存数据到XML文件

问题

答案1

尝试在将信息添加到腌制的字典之前对其进行加密。

“Subsample after GroupBy” 可以翻译为 “分组后进行子采样”。

'NoneType'对象没有属性'get' | Python GUI

查找两个列表中共同的最大数 #python

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。