问题

尝试使用try和catch来避免在抓取数据时出现不可用的URL，但我得到了一个0x0的DataFrame。我的代码如下：

using HTTP
using Gumbo
using Cascadia
using DataFrames
using DelimitedFiles
using CSV

links = ["0100111338", "0100105077", "0100110528", "0107464283", "0105342089"]
base_url = "https://infodoanhnghiep.com/tim-kiem/ma-so-thue/"

urls = []
for link in links
    push!(urls, base_url * link * "/")
end

print(urls)

information = []

for url in urls

    r = HTTP.get(url)

    # 解析HTML
    h = parsehtml(String(r.body))

    try
        companies = eachmatch(sel".company-item", h.root)
        name = string(companies[1][2][1][1])
        tax = string(companies[1][4][1])
        address = string(companies[1][5][1])
        push!(df, (name = name, tax = tax, address = address))
    catch
        push!(df, (name = "", tax = "", address = ""))
    end
end

# 从收集到的信息创建DataFrame
df = DataFrame(information)

# 以UTF-8编码将DataFrame导出为CSV文件
output_file = "information.csv"
CSV.write(output_file, df, writeheader=true, delim=',', quotechar='"', utf8=true)

然而，如果我不使用try和catch，我会得到一个5x3的DataFrame，如下所示：

information = []
for url in urls
    r = HTTP.get(url)

    # 解析HTML
    h = parsehtml(String(r.body))

    companies = eachmatch(sel".company-item", h.root)
    name = string(companies[1][2][1][1])
    tax = string(companies[1][4][1])
    address = string(companies[1][5][1])

    push!(information, (name = name, tax = tax, address = address))
end

# 从收集到的信息创建DataFrame
df = DataFrame(information)

# 以UTF-8编码将DataFrame导出为CSV文件
output_file = "information.csv"
CSV.write(output_file, df, writeheader=true, delim=',', quotechar='"', utf8=true)

当我抓取网站时，我希望能够忽略不可用的URL，如果有任何建议，请告诉我！谢谢大家！

英文:

I try using try and catch to avoid unavailable URLs when scraping data, but I got the 0x0 DataFrame. My code as follows:

using HTTP
using Gumbo
using Cascadia
using DataFrames
using DelimitedFiles
using CSV

links = [&quot;0100111338&quot;, &quot;0100105077&quot;, &quot;0100110528&quot;, &quot;0107464283&quot;, &quot;0105342089&quot;]
base_url = &quot;https://infodoanhnghiep.com/tim-kiem/ma-so-thue/&quot;

urls = []
for link in links
    push!(urls, base_url * link * &quot;/&quot;)
end

print(urls)

information = []

for url in urls

    r = HTTP.get(url)

    # Parse HTML
    h = parsehtml(String(r.body))

    try
        companies = eachmatch(sel&quot;.company-item&quot;, h.root)
        name = string(companies[1][2][1][1])
        tax = string(companies[1][4][1])
        address = string(companies[1][5][1])
        push!(df, (name = name, tax = tax, address = address))
    catch
        push!(df, (name = &quot;&quot;, tax = &quot;&quot;, address = &quot;&quot;))
    end
end

# Create a DataFrame from the collected information
df = DataFrame(information)

# Export the DataFrame to a CSV file with UTF-8 encoding
output_file = &quot;information.csv&quot;
CSV.write(output_file, df, writeheader=true, delim=&#39;,&#39;, quotechar=&#39;&quot;&#39;, utf8=true)

However, if I do not use try and catch, I will have 5x3 DataFram, as follow:

information = []
for url in urls
    r = HTTP.get(url)

    # Parse HTML
    h = parsehtml(String(r.body))

    companies = eachmatch(sel&quot;.company-item&quot;, h.root)
    name = string(companies[1][2][1][1])
    tax = string(companies[1][4][1])
    address = string(companies[1][5][1])

    push!(information, (name = name, tax = tax, address = address))
end

# Create a DataFrame from the collected information
df = DataFrame(information)

# Export the DataFrame to a CSV file with UTF-8 encoding
output_file = &quot;information.csv&quot;
CSV.write(output_file, df, writeheader=true, delim=&#39;,&#39;, quotechar=&#39;&quot;&#39;, utf8=true)

I would appreciate any suggestions to ignore unavailable URls when scraping website! Thanks all!

答案1

得分: 1

在你的循环中，你将数据推送到尚未定义的 df 中。你应该将数据推送到 information 中，即：

try
    companies = eachmatch(sel".company-item", h.root)
    name = string(companies[1][2][1][1])
    tax = string(companies[1][4][1])
    address = string(companies[1][5][1])
    push!(information, (name = name, tax = tax, address = address))
catch
    push!(information, (name = "", tax = "", address = ""))
end

实际上，你应该会得到一个编译错误，因为df 未定义，但你可能是在一个之前已经定义了 df 的会话中运行的。有时，当你遇到这些错误时，关闭会话并在新进程中重新编译是一个好主意，以捕捉到那些实际上未定义但没有触发错误的变量。

英文:

You push to the not yet defined df in your loop. You should push to information, ie:

try
    companies = eachmatch(sel&quot;.company-item&quot;, h.root)
    name = string(companies[1][2][1][1])
    tax = string(companies[1][4][1])
    address = string(companies[1][5][1])
    push!(information, (name = name, tax = tax, address = address))
catch
    push!(information, (name = &quot;&quot;, tax = &quot;&quot;, address = &quot;&quot;))
end

You actually should have gotten a compilation error that df was undefined, but you probably were running a session in which df was already defined previously in your workspace. Sometimes, when you have these sorts of errors, it is a good idea to close the session and recompile in a new process to catch any actually undefined variables that do not trigger errors because they were previously defined in that way.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

0x0 使用Julia时使用try和catch的数据框架

问题

答案1

如何使用WireMock模拟一个包含框架HTML的页面？

Flexbox not working not working on my nav bar, trying to align the contents in a row, but this is what I get, Can someone please help me?

Vue2中@click在另一个组件中与v-html不起作用。

CSS “background-image: linear-gradient” 以180度角度创建了不希望的重复颜色条纹。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论