问题

I'm here to assist with the translation. Here's the translated code portion:

我正在尝试从一个常见爬取数据集的 .WARC（Web ARChive）文件中提取网站URL [commoncrawl.org](https://commoncrawl.org/2023/04/mar-apr-2023-crawl-archive-now-available/)。
在解压缩文件并编写读取此文件的代码后，我附上了这段代码：

import pandas as pd
from warcio.archiveiterator import ArchiveIterator
import http.client

# 解析WARC文件并提取URL的函数
def extract_urls_from_warc(file_path):
    urls = []
    with open(file_path, 'rb') as file:
        for record in ArchiveIterator(file):
            if record.rec_type == 'response':
                payload = record.content_stream().read()
                http_response = http.client.HTTPResponse(
                    io.BytesIO(payload),
                    method='GET'
                )
                http_response.begin()
                url = http_response.getheader('WARC-Target-URI')
                urls.append(url)
    
    # 创建包含提取的URL的DataFrame
    df = pd.DataFrame(urls, columns=['URL'])
    return df

# 提供WARC文件的路径
warc_file_path = r"./commoncrawl.warc/commoncrawl.warc"

# 调用函数从WARC文件中提取URL并创建DataFrame
df = extract_urls_from_warc(warc_file_path)

# 显示包含URL的DataFrame
print(df)

请注意，我只翻译了您提供的代码部分，没有包括其他信息。如果需要更多帮助，请随时提问。

英文:

I'm trying to extract website URLs from a .WARC (Web ARChive) file from a common crawl dataset commoncrawl.org.
After decompressing the file and writing the code to read this file, I attached the code:

import pandas as pd
from warcio.archiveiterator import ArchiveIterator
import http.client

# Function to parse WARC file and extract URLs
def extract_urls_from_warc(file_path):
    urls = []
    with open(file_path, &#39;rb&#39;) as file:
        for record in ArchiveIterator(file):
            if record.rec_type == &#39;response&#39;:
                payload = record.content_stream().read()
                http_response = http.client.HTTPResponse(
                    io.BytesIO(payload),
                    method=&#39;GET&#39;
                )
                http_response.begin()
                url = http_response.getheader(&#39;WARC-Target-URI&#39;)
                urls.append(url)
    
    # Create DataFrame with extracted URLs
    df = pd.DataFrame(urls, columns=[&#39;URL&#39;])
    return df

# Provide the path to WARC file
warc_file_path = r&quot;./commoncrawl.warc/commoncrawl.warc&quot;

# Call the function to extract URLs from the WARC file and create a DataFrame
df = extract_urls_from_warc(warc_file_path)

# Display the DataFrame with URLs
print(df)

after running this code I received this error message:

ArchiveLoadFailed: Unknown archive format, first line: ['crawl-data/CC-MAIN-2023-14/segments/1679296943471.24/warc/CC-MAIN-20230320083513-20230320113513-00000.warc.gz']

I using Python 3.10.9 in Jupyter.

I want to read and extract URLs pages from .WARC file by using Jupyter

答案1

得分: 1

错误消息表明输入文件不是WARC文件，而是WARC文件位置的列表。一个Common Crawl主数据集包括数万个WARC文件，并且该列表引用了它们所有。要处理WARC文件：

在列表中选择一个或多个WARC文件（在笔记本电脑、台式电脑或Jupyter笔记本上无法处理它们所有）。
在每个WARC文件路径前面添加https://data.commoncrawl.org/，以获取下载URL。有关更多详细信息，请参阅https://commoncrawl.org/access-the-data/。

英文:

The error message indicates that the input file is not a WARC file but a listing of WARC file locations. One Common Crawl main dataset consists of several 10,000 WARC files and the listing references all of them. To process the WARC files:

select one or some of the WARC files in the listing (processing all of them is not possible on a laptop, a desktop computer or a Jupyter notebook).
add https://data.commoncrawl.org/ in front of every WARC file path which gives you the download URL(s). For further details, please see https://commoncrawl.org/access-the-data/

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

未知的存档格式！我如何在Jupyter中从WARC文件中提取URL？

问题

答案1

Add icon in alert-block in Jupyter notebook.

Glue PySpark kernel not showing in VS Code

Evaluating forward references with typing.get_type_hints in Python for a class defined inside another method/class

连接到交互式经纪商TWS通过API在Jupyter Notebooks中

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论