在Python中保存数据到XML文件

huangapple go评论103阅读模式
英文:

Save data in XML file in Python

问题

你可以尝试以下更改以确保所有五个评论都保存在文件中。首先,你可以将创建XML树的部分移到主函数内,以确保每个评论都能正确地添加到XML树中。然后,将文件写入操作移至主函数之外,以避免在每次添加评论时都覆盖文件。以下是修改后的代码:

  1. import requests
  2. from bs4 import BeautifulSoup
  3. import re
  4. import json
  5. import xml.etree.cElementTree as ET
  6. source = requests.get('https://www.tripadvisor.ch/Hotel_Review-g188113-d228146-Reviews-Coronado_Hotel-Zurich.html#REVIEWS').text
  7. soup = BeautifulSoup(source, 'lxml')
  8. pattern = re.compile(r'window.__WEB_CONTEXT__={pageManifest:(\{.*\})};')
  9. script = soup.find("script", text=pattern)
  10. dictData = pattern.search(script.text).group(1)
  11. jsonData = json.loads(dictData)
  12. def get_countrycitydata():
  13. countrycity_dict = dict()
  14. country_data = jsonData['urqlCache']['3960485871']['data']['locations']
  15. for data in country_data:
  16. data1 = data['parents']
  17. countrycity_dict["country_name"] = data1[2]['name']
  18. countrycity_dict["tripadvisorid_city"] = data1[0]['locationId']
  19. countrycity_dict["city_name"] = data1[0]['name']
  20. return countrycity_dict
  21. def get_hoteldata():
  22. hotel_dict = dict()
  23. locations = jsonData['urqlCache']['669061039']['data']['locations']
  24. for data in locations:
  25. hotel_dict["tripadvisorid_hotel"] = data['locationId']
  26. hotel_dict["hotel_name"] = data['name']
  27. return hotel_dict
  28. def get_reviews():
  29. all_dictionaries = []
  30. for locations in jsonData['urqlCache']['669061039']['data']['locations']:
  31. for reviews in locations['reviewListPage']['reviews']:
  32. review_dict = {}
  33. review_dict["reviewid"] = reviews['id']
  34. review_dict["reviewurl"] = reviews['absoluteUrl']
  35. review_dict["reviewlang"] = reviews['language']
  36. review_dict["reviewtitle"] = reviews['title']
  37. reviewtext = reviews['text']
  38. clean_reviewtext = reviewtext.replace('\n', ' ')
  39. review_dict["reviewtext"] = clean_reviewtext
  40. all_dictionaries.append(review_dict)
  41. return all_dictionaries
  42. def xml_tree(new_dict, root):
  43. country = ET.SubElement(root, "country")
  44. ET.SubElement(country, "name").text = new_dict["country_name"]
  45. city = ET.SubElement(country, "city")
  46. ET.SubElement(city, "tripadvisorid").text = str(new_dict["tripadvisorid_city"])
  47. ET.SubElement(city, "name").text = new_dict["city_name"]
  48. hotels = ET.SubElement(city, "hotels")
  49. hotel = ET.SubElement(hotels, "hotel")
  50. ET.SubElement(hotel, "tripadvisorid").text = str(new_dict["tripadvisorid_hotel"])
  51. ET.SubElement(hotel, "name").text = new_dict["hotel_name"]
  52. reviews = ET.SubElement(hotel, "reviews")
  53. for review_data in new_dict["reviews"]:
  54. review = ET.SubElement(reviews, "review")
  55. ET.SubElement(review, "reviewid").text = str(review_data["reviewid"])
  56. ET.SubElement(review, "reviewurl").text = review_data["reviewurl"]
  57. ET.SubElement(review, "reviewlang").text = review_data["reviewlang"]
  58. ET.SubElement(review, "reviewtitle").text = review_data["reviewtitle"]
  59. ET.SubElement(review, "reviewtext").text = review_data["reviewtext"]
  60. def main():
  61. city_dict = get_countrycitydata()
  62. hotel_dict = get_hoteldata()
  63. review_list = get_reviews()
  64. root = ET.Element("countries")
  65. for index in range(len(review_list)):
  66. new_dict = {**city_dict, **hotel_dict}
  67. new_dict["reviews"] = review_list
  68. xml_tree(new_dict, root)
  69. tree = ET.ElementTree(root)
  70. tree.write("test.xml", encoding='unicode')
  71. if __name__ == "__main__":
  72. main()

这样,你的XML树将在主函数中构建,然后一次性写入文件,确保所有五个评论都保存在同一个文件中。希望这对你有所帮助!

英文:

I am trying to save my data to an XML file. This data comes from a website where I want to collect the reviews. There are always five reviews per page, which I want to save in XML format in a file. The problem is that if I print out the XML tree with print(ET.tostring(root, encoding='utf8').decode('utf8')) then there are all five reviews that I want to have. But if I save them into the file with tree.write("test.xml", encoding='unicode') then I only see one review... Here is my code:

  1. import requests
  2. from bs4 import BeautifulSoup
  3. import re
  4. import json
  5. import xml.etree.cElementTree as ET
  6. source = requests.get('https://www.tripadvisor.ch/Hotel_Review-g188113-d228146-Reviews-Coronado_Hotel-Zurich.html#REVIEWS').text
  7. soup = BeautifulSoup(source, 'lxml')
  8. pattern = re.compile(r'window.__WEB_CONTEXT__={pageManifest:(\{.*\})};')
  9. script = soup.find("script", text=pattern)
  10. dictData = pattern.search(script.text).group(1)
  11. jsonData = json.loads(dictData)
  12. def get_countrycitydata():
  13. countrycity_dict = dict()
  14. country_data = jsonData['urqlCache']['3960485871']['data']['locations']
  15. for data in country_data:
  16. data1 = data['parents']
  17. countrycity_dict["country_name"] = data1[2]['name']
  18. countrycity_dict["tripadvisorid_city"] = data1[0]['locationId']
  19. countrycity_dict["city_name"] = data1[0]['name']
  20. return countrycity_dict
  21. def get_hoteldata():
  22. hotel_dict = dict()
  23. locations = jsonData['urqlCache']['669061039']['data']['locations']
  24. for data in locations:
  25. hotel_dict["tripadvisorid_hotel"] = data['locationId']
  26. hotel_dict["hotel_name"] = data['name']
  27. return hotel_dict
  28. def get_reviews():
  29. all_dictionaries = []
  30. for locations in jsonData['urqlCache']['669061039']['data']['locations']:
  31. for reviews in locations['reviewListPage']['reviews']:
  32. review_dict = {}
  33. review_dict["reviewid"] = reviews['id']
  34. review_dict["reviewurl"] = reviews['absoluteUrl']
  35. review_dict["reviewlang"] = reviews['language']
  36. review_dict["reviewtitle"] = reviews['title']
  37. reviewtext = reviews['text']
  38. clean_reviewtext = reviewtext.replace('\n', ' ')
  39. review_dict["reviewtext"] = clean_reviewtext
  40. all_dictionaries.append(review_dict)
  41. return all_dictionaries
  42. def xml_tree(new_dict): # should I change something here???
  43. root = ET.Element("countries")
  44. country = ET.SubElement(root, "country")
  45. ET.SubElement(country, "name").text = new_dict["country_name"]
  46. city = ET.SubElement(country, "city")
  47. ET.SubElement(city, "tripadvisorid").text = str(new_dict["tripadvisorid_city"])
  48. ET.SubElement(city, "name").text = new_dict["city_name"]
  49. hotels = ET.SubElement(city, "hotels")
  50. hotel = ET.SubElement(hotels, "hotel")
  51. ET.SubElement(hotel, "tripadvisorid").text = str(new_dict["tripadvisorid_hotel"])
  52. ET.SubElement(hotel, "name").text = new_dict["hotel_name"]
  53. reviews = ET.SubElement(hotel, "reviews")
  54. review = ET.SubElement(reviews, "review")
  55. ET.SubElement(review, "reviewid").text = str(new_dict["reviewid"])
  56. ET.SubElement(review, "reviewurl").text = new_dict["reviewurl"]
  57. ET.SubElement(review, "reviewlang").text = new_dict["reviewlang"]
  58. ET.SubElement(review, "reviewtitle").text = new_dict["reviewtitle"]
  59. ET.SubElement(review, "reviewtext").text = new_dict["reviewtext"]
  60. tree = ET.ElementTree(root)
  61. tree.write("test.xml", encoding='unicode')
  62. print(ET.tostring(root, encoding='utf8').decode('utf8'))
  63. ##########################################################
  64. def main():
  65. city_dict = get_countrycitydata()
  66. hotel_dict = get_hoteldata()
  67. review_list = get_reviews()
  68. for index in range(len(review_list)):
  69. new_dict = {**city_dict, **hotel_dict, **review_list[index]}
  70. xml_tree(new_dict)
  71. if __name__ == "__main__":
  72. main()

How can I change the XML tree so that all five reviews are saved in the file? The XML file should look like this:

  1. <countries>
  2. <country>
  3. <name>Schweiz</name>
  4. <city>
  5. <tripadvisorid>188113</tripadvisorid>
  6. <name>Zürich</name>
  7. <hotels>
  8. <hotel>
  9. <tripadvisorid>228146</tripadvisorid>
  10. <name>Hotel Coronado</name>
  11. <reviews>
  12. <review>
  13. <reviewid>672052111</reviewid>
  14. <reviewurl>https://www.tripadvisor.ch/ShowUserReviews-g188113-d228146-r672052111-Coronado Hotel-Zurich.html</reviewurl>
  15. <reviewlang>de</reviewlang>
  16. <reviewtitle>Optimale Lage und Preis</reviewtitle>
  17. <reviewtext>Hervorragendes Hotel.Beste Erfahrun mit Service und Zimme.Die Qalität der Betten ist optimalr. Zimmer sind trotz geringer Größe sehr gut ausgestattet.Der Föhn war in diesem Fall (nicht in früheren)etwas lahm</reviewtext>
  18. </review>
  19. <review>
  20. second review here ...
  21. </review>
  22. <review>
  23. third review here ...
  24. </review>
  25. ...
  26. </reviews>
  27. </hotel>
  28. </hotels>
  29. </city>
  30. </country>
  31. </countries>

Thank you in advance for all suggestions!

答案1

得分: 2

因为你的 xml_tree(new_dict) 存在于一个 for 循环内,tree.write() 方法被多次调用,覆盖了你的文件。

open() 中以 a(追加)模式打开你的文件:

  1. tree.write(open('test.xml', 'a'), encoding='unicode')

请查看文档 此处

英文:

Because your xml_tree(new_dict) exists inside of a for loop, the tree.write() method is being called multiple times overwriting your file.

Open your file in a (append) mode with open():

  1. tree.write(open('test.xml', 'a'), encoding='unicode')

See documentation here

huangapple
  • 本文由 发表于 2020年1月6日 22:25:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/59613778.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定