2023年3月9日 18:41:39go评论104阅读模式

英文:

Parsing csv file from a button press using python

问题

我有以下的URL https://pubmed.ncbi.nlm.nih.gov/?term=IBD，我想要从中解析数据。

（我没有找到有关网站爬取的规定），该网站是公开的，有一个'导出'按钮可以下载一些历史数据以CSV格式，我想要使用Python自动下载文件内容。

我在过去的一天里尝试了很多选项，这是我最近的尝试：

def parse_history():
    url = "https://pubmed.ncbi.nlm.nih.gov/?term=IBD"
    web_page = requests.get(url)
    soup = BeautifulSoup(web_page.content, "html.parser")
    form = soup.find('form', id='side-export-search-by-year-form')
    download_url = form.get('action')
    form_data = {}
    for input_field in form.find_all('input'):
        form_data[input_field.get('name')] = input_field.get('value')
    csrf_token = get_csrf_token(url)
    form_data["csrfmiddlewaretoken"] = csrf_token
    response = requests.post(f"{url}{download_url}", data=form_data)
    # 保存下载的文件
    with open('history.csv', 'wb') as f:
        f.write(response.content)

我得到了403错误，错误消息是“无效的安全令牌”。

有什么想法吗？我宁愿不使用Selenium。

英文:

I have the following url https://pubmed.ncbi.nlm.nih.gov/?term=IBD which I want to parse data from

(I found nothing against scraping in their terms), the site is public, there's an 'export' button to download some historic data in csv, I want to automate downloading the file content with python.

I tried many options in the past day, this is my recent one

def parse_history():
url = "https://pubmed.ncbi.nlm.nih.gov/?term=IBD"

web_page = requests.get(url)
soup = BeautifulSoup(web_page.content, &quot;html.parser&quot;)
form = soup.find(&#39;form&#39;, id=&#39;side-export-search-by-year-form&#39;)
download_url = form.get(&#39;action&#39;)
form_data = {}
for input_field in form.find_all(&#39;input&#39;):
    form_data[input_field.get(&#39;name&#39;)] = input_field.get(&#39;value&#39;)
csrf_token = get_csrf_token(url)
form_data[&quot;csrfmiddlewaretoken&quot;] = csrf_token
response = requests.post(f&quot;{url}{download_url}&quot;, data=form_data)
# Save the downloaded file
with open(&#39;history.csv&#39;, &#39;wb&#39;) as f:
    f.write(response.content)

and I get error 403 with "invalid security token" message in the HTML value.

Any ideas? I prefer not to use selenium.

答案1

得分: 1

以下是代码的翻译部分：

import requests
from bs4 import BeautifulSoup
def main():
    base_url = "https://pubmed.ncbi.nlm.nih.gov"
    url = f"{base_url}/?term=IBD"
    with requests.Session() as s:
        web_page = requests.get(url, timeout=5)
        soup = BeautifulSoup(web_page.content, "html.parser")
        form = soup.find("form", id="side-export-search-by-year-form")
        action_url = form.get("action")
        download_url = f"{base_url}{action_url}"
        form_data = {}
        for input_field in form.find_all("input"):
            form_data[input_field.get("name")] = input_field.get("value")
        form_data["term"] = "IBD"
        cookies = web_page.cookies
        cookies.update({"pm-iosp": ""})
        headers = {"Content-Type": "application/x-www-form-urlencoded",
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.63",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "en-GB,en;q=0.9,en-US;q=0.8",
            "origin": "https://pubmed.ncbi.nlm.nih.gov",
            "referer": "https://pubmed.ncbi.nlm.nih.gov/?term=IBD",
            "sec-ch-ua": "'Chromium';v='110', 'Not A(Brand';v='24', 'Microsoft Edge';v='110'",
            "sec-fetch-dest": "document",
            "sec-fetch-mode": "navigate",
            "sec-fetch-site": "same-origin",
            "sec-fetch-user": "?1",
            "sec-gpc": "1",
            "upgrade-insecure-requests": "1",
            "dnt": "1",
            "Cache-Control": "max-age=0"}
        response = requests.post(url=download_url, data=form_data, cookies=cookies, headers=headers, timeout=5)
        with open("history.csv", "wb") as f:
            f.write(response.content)
if __name__ == '__main__':
    main()

请注意，我已经去掉了HTML实体引用和HTML标签，只保留了代码的文本部分。

英文:

import requests
from bs4 import BeautifulSoup
def main():
base_url = &quot;https://pubmed.ncbi.nlm.nih.gov&quot;
url = f&quot;{base_url}/?term=IBD&quot;
with requests.Session() as s:
web_page = requests.get(url, timeout=5)
soup = BeautifulSoup(web_page.content, &quot;html.parser&quot;)
form = soup.find(&quot;form&quot;, id=&quot;side-export-search-by-year-form&quot;)
action_url = form.get(&quot;action&quot;)
download_url = f&quot;{base_url}{action_url}&quot;
form_data = {}
for input_field in form.find_all(&quot;input&quot;):
form_data[input_field.get(&quot;name&quot;)] = input_field.get(&quot;value&quot;)
form_data[&quot;term&quot;] = &quot;IBD&quot;
cookies = web_page.cookies
cookies.update({&quot;pm-iosp&quot;: &quot;&quot;})
headers = {&quot;Content-Type&quot;: &quot;application/x-www-form-urlencoded&quot;,
&quot;User-Agent&quot;: &quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.63&quot;,
&quot;Accept&quot;: &quot;text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7&quot;,
&quot;Accept-Encoding&quot;: &quot;gzip, deflate, br&quot;,
&quot;Accept-Language&quot;: &quot;en-GB,en;q=0.9,en-US;q=0.8&quot;,
&quot;origin&quot;: &quot;https://pubmed.ncbi.nlm.nih.gov&quot;,
&quot;referer&quot;: &quot;https://pubmed.ncbi.nlm.nih.gov/?term=IBD&quot;,
&quot;sec-ch-ua&quot;: &quot;&#39;Chromium&#39;;v=&#39;110&#39;, &#39;Not A(Brand&#39;;v=&#39;24&#39;, &#39;Microsoft Edge&#39;;v=&#39;110&#39;&quot;,
&quot;sec-fetch-dest&quot;: &quot;document&quot;,
&quot;sec-fetch-mode&quot;: &quot;navigate&quot;,
&quot;sec-fetch-site&quot;: &quot;same-origin&quot;,
&quot;sec-fetch-user&quot;: &quot;?1&quot;,
&quot;sec-gpc&quot;: &quot;1&quot;,
&quot;upgrade-insecure-requests&quot;: &quot;1&quot;,
&quot;dnt&quot;: &quot;1&quot;,
&quot;Cache-Control&quot;: &quot;max-age=0&quot;}
response = requests.post(url=download_url, data=form_data, cookies=cookies, headers=headers, timeout=5)
with open(&quot;history.csv&quot;, &quot;wb&quot;) as f:
f.write(response.content)
if __name__ == &#39;__main__&#39;:
main()

答案2

得分: 0

import subprocess
import re
BASE_NCBI_URL = "https://pubmed.ncbi.nlm.nih.gov"
YEAR_COUNT_DATA_REGEX = r'yearCounts:\s*"\[(.*?)\]"'
def get_term_history_amount_stat(term: str):
    url = f"{BASE_NCBI_URL}?term={term}"
    output = subprocess.check_output(['curl', url])
    year_counts_pattern = r'yearCounts:\s*"\[(.*?)\]"'
    year_counts_match = re.search(year_counts_pattern, output.decode(), re.DOTALL)
    if year_counts_match:
        return eval(year_counts_match.group(1))
    else:
        raise Exception(f"Cant parse historical data for term {term}")

Implemented it with subprocess and curl, for some reason curl does not get any permission errors, it just less convenient to parse it but it works.

英文:

import subprocess
import re
BASE_NCBI_URL = &quot;https://pubmed.ncbi.nlm.nih.gov&quot;
YEAR_COUNT_DATA_REGEX = r&#39;yearCounts:\s*&quot;\[(.*?)\]&quot;&#39;
def get_term_history_amount_stat(term: str):
url = f&quot;{BASE_NCBI_URL}?term={term}&quot;
output = subprocess.check_output([&#39;curl&#39;, url])
year_counts_pattern = r&#39;yearCounts:\s*&quot;\[(.*?)\]&quot;&#39;
year_counts_match = re.search(year_counts_pattern, output.decode(), re.DOTALL)
if year_counts_match:
return eval(year_counts_match.group(1))
else:
raise Exception(f&quot;Cant parse historical data for term {term}&quot;)

Implemented it with subprocess and curl, for some reason curl does not get any permission errors, it just less convenient to parse it but it works.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

解析 CSV 文件时，使用 Python 按钮按下。

问题

答案1

答案2

需要输入搜索行的Excel

在Python中，是否有一个好的解决方案用于异步写入NetCDF文件？

Web Scraping News Articles Python（使用Python进行网页抓取新闻文章）

Remove all rows in a Pandas DataFrame where a column is True.

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。