解析 CSV 文件时,使用 Python 按钮按下。

huangapple go评论104阅读模式
英文:

Parsing csv file from a button press using python

问题

我有以下的URL https://pubmed.ncbi.nlm.nih.gov/?term=IBD,我想要从中解析数据。

(我没有找到有关网站爬取的规定),该网站是公开的,有一个'导出'按钮可以下载一些历史数据以CSV格式,我想要使用Python自动下载文件内容。

我在过去的一天里尝试了很多选项,这是我最近的尝试:

  1. def parse_history():
  2. url = "https://pubmed.ncbi.nlm.nih.gov/?term=IBD"
  3. web_page = requests.get(url)
  4. soup = BeautifulSoup(web_page.content, "html.parser")
  5. form = soup.find('form', id='side-export-search-by-year-form')
  6. download_url = form.get('action')
  7. form_data = {}
  8. for input_field in form.find_all('input'):
  9. form_data[input_field.get('name')] = input_field.get('value')
  10. csrf_token = get_csrf_token(url)
  11. form_data["csrfmiddlewaretoken"] = csrf_token
  12. response = requests.post(f"{url}{download_url}", data=form_data)
  13. # 保存下载的文件
  14. with open('history.csv', 'wb') as f:
  15. f.write(response.content)

我得到了403错误,错误消息是“无效的安全令牌”。

有什么想法吗?我宁愿不使用Selenium。

英文:

I have the following url https://pubmed.ncbi.nlm.nih.gov/?term=IBD which I want to parse data from

(I found nothing against scraping in their terms), the site is public, there's an 'export' button to download some historic data in csv, I want to automate downloading the file content with python.

I tried many options in the past day, this is my recent one

def parse_history():
url = "https://pubmed.ncbi.nlm.nih.gov/?term=IBD"

  1. web_page = requests.get(url)
  2. soup = BeautifulSoup(web_page.content, "html.parser")
  3. form = soup.find('form', id='side-export-search-by-year-form')
  4. download_url = form.get('action')
  5. form_data = {}
  6. for input_field in form.find_all('input'):
  7. form_data[input_field.get('name')] = input_field.get('value')
  8. csrf_token = get_csrf_token(url)
  9. form_data["csrfmiddlewaretoken"] = csrf_token
  10. response = requests.post(f"{url}{download_url}", data=form_data)
  11. # Save the downloaded file
  12. with open('history.csv', 'wb') as f:
  13. f.write(response.content)

and I get error 403 with "invalid security token" message in the HTML value.

Any ideas? I prefer not to use selenium.

答案1

得分: 1

以下是代码的翻译部分:

  1. import requests
  2. from bs4 import BeautifulSoup
  3. def main():
  4. base_url = "https://pubmed.ncbi.nlm.nih.gov"
  5. url = f"{base_url}/?term=IBD"
  6. with requests.Session() as s:
  7. web_page = requests.get(url, timeout=5)
  8. soup = BeautifulSoup(web_page.content, "html.parser")
  9. form = soup.find("form", id="side-export-search-by-year-form")
  10. action_url = form.get("action")
  11. download_url = f"{base_url}{action_url}"
  12. form_data = {}
  13. for input_field in form.find_all("input"):
  14. form_data[input_field.get("name")] = input_field.get("value")
  15. form_data["term"] = "IBD"
  16. cookies = web_page.cookies
  17. cookies.update({"pm-iosp": ""})
  18. headers = {"Content-Type": "application/x-www-form-urlencoded",
  19. "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.63",
  20. "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
  21. "Accept-Encoding": "gzip, deflate, br",
  22. "Accept-Language": "en-GB,en;q=0.9,en-US;q=0.8",
  23. "origin": "https://pubmed.ncbi.nlm.nih.gov",
  24. "referer": "https://pubmed.ncbi.nlm.nih.gov/?term=IBD",
  25. "sec-ch-ua": "'Chromium';v='110', 'Not A(Brand';v='24', 'Microsoft Edge';v='110'",
  26. "sec-fetch-dest": "document",
  27. "sec-fetch-mode": "navigate",
  28. "sec-fetch-site": "same-origin",
  29. "sec-fetch-user": "?1",
  30. "sec-gpc": "1",
  31. "upgrade-insecure-requests": "1",
  32. "dnt": "1",
  33. "Cache-Control": "max-age=0"}
  34. response = requests.post(url=download_url, data=form_data, cookies=cookies, headers=headers, timeout=5)
  35. with open("history.csv", "wb") as f:
  36. f.write(response.content)
  37. if __name__ == '__main__':
  38. main()

请注意,我已经去掉了HTML实体引用和HTML标签,只保留了代码的文本部分。

英文:
  1. import requests
  2. from bs4 import BeautifulSoup
  3. def main():
  4. base_url = "https://pubmed.ncbi.nlm.nih.gov"
  5. url = f"{base_url}/?term=IBD"
  6. with requests.Session() as s:
  7. web_page = requests.get(url, timeout=5)
  8. soup = BeautifulSoup(web_page.content, "html.parser")
  9. form = soup.find("form", id="side-export-search-by-year-form")
  10. action_url = form.get("action")
  11. download_url = f"{base_url}{action_url}"
  12. form_data = {}
  13. for input_field in form.find_all("input"):
  14. form_data[input_field.get("name")] = input_field.get("value")
  15. form_data["term"] = "IBD"
  16. cookies = web_page.cookies
  17. cookies.update({"pm-iosp": ""})
  18. headers = {"Content-Type": "application/x-www-form-urlencoded",
  19. "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.63",
  20. "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
  21. "Accept-Encoding": "gzip, deflate, br",
  22. "Accept-Language": "en-GB,en;q=0.9,en-US;q=0.8",
  23. "origin": "https://pubmed.ncbi.nlm.nih.gov",
  24. "referer": "https://pubmed.ncbi.nlm.nih.gov/?term=IBD",
  25. "sec-ch-ua": "'Chromium';v='110', 'Not A(Brand';v='24', 'Microsoft Edge';v='110'",
  26. "sec-fetch-dest": "document",
  27. "sec-fetch-mode": "navigate",
  28. "sec-fetch-site": "same-origin",
  29. "sec-fetch-user": "?1",
  30. "sec-gpc": "1",
  31. "upgrade-insecure-requests": "1",
  32. "dnt": "1",
  33. "Cache-Control": "max-age=0"}
  34. response = requests.post(url=download_url, data=form_data, cookies=cookies, headers=headers, timeout=5)
  35. with open("history.csv", "wb") as f:
  36. f.write(response.content)
  37. if __name__ == '__main__':
  38. main()

答案2

得分: 0

  1. import subprocess
  2. import re
  3. BASE_NCBI_URL = "https://pubmed.ncbi.nlm.nih.gov"
  4. YEAR_COUNT_DATA_REGEX = r'yearCounts:\s*"\[(.*?)\]"'
  5. def get_term_history_amount_stat(term: str):
  6. url = f"{BASE_NCBI_URL}?term={term}"
  7. output = subprocess.check_output(['curl', url])
  8. year_counts_pattern = r'yearCounts:\s*"\[(.*?)\]"'
  9. year_counts_match = re.search(year_counts_pattern, output.decode(), re.DOTALL)
  10. if year_counts_match:
  11. return eval(year_counts_match.group(1))
  12. else:
  13. raise Exception(f"Cant parse historical data for term {term}")

Implemented it with subprocess and curl, for some reason curl does not get any permission errors, it just less convenient to parse it but it works.

英文:
  1. import subprocess
  2. import re
  3. BASE_NCBI_URL = "https://pubmed.ncbi.nlm.nih.gov"
  4. YEAR_COUNT_DATA_REGEX = r'yearCounts:\s*"\[(.*?)\]"'
  5. def get_term_history_amount_stat(term: str):
  6. url = f"{BASE_NCBI_URL}?term={term}"
  7. output = subprocess.check_output(['curl', url])
  8. year_counts_pattern = r'yearCounts:\s*"\[(.*?)\]"'
  9. year_counts_match = re.search(year_counts_pattern, output.decode(), re.DOTALL)
  10. if year_counts_match:
  11. return eval(year_counts_match.group(1))
  12. else:
  13. raise Exception(f"Cant parse historical data for term {term}")

Implemented it with subprocess and curl, for some reason curl does not get any permission errors, it just less convenient to parse it but it works.

huangapple
  • 本文由 发表于 2023年3月9日 18:41:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/75683480.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定