0x0 使用Julia时使用try和catch的数据框架

huangapple go评论73阅读模式
英文:

0x0 Dataframe when using try and catch by using Julia

问题

尝试使用try和catch来避免在抓取数据时出现不可用的URL,但我得到了一个0x0的DataFrame。我的代码如下:

using HTTP
using Gumbo
using Cascadia
using DataFrames
using DelimitedFiles
using CSV

links = ["0100111338", "0100105077", "0100110528", "0107464283", "0105342089"]
base_url = "https://infodoanhnghiep.com/tim-kiem/ma-so-thue/"

urls = []
for link in links
    push!(urls, base_url * link * "/")
end

print(urls)

information = []

for url in urls

    r = HTTP.get(url)

    # 解析HTML
    h = parsehtml(String(r.body))

    try
        companies = eachmatch(sel".company-item", h.root)
        name = string(companies[1][2][1][1])
        tax = string(companies[1][4][1])
        address = string(companies[1][5][1])
        push!(df, (name = name, tax = tax, address = address))
    catch
        push!(df, (name = "", tax = "", address = ""))
    end
end

# 从收集到的信息创建DataFrame
df = DataFrame(information)

# 以UTF-8编码将DataFrame导出为CSV文件
output_file = "information.csv"
CSV.write(output_file, df, writeheader=true, delim=',', quotechar='"', utf8=true)

然而,如果我不使用try和catch,我会得到一个5x3的DataFrame,如下所示:

information = []
for url in urls
    r = HTTP.get(url)

    # 解析HTML
    h = parsehtml(String(r.body))

    companies = eachmatch(sel".company-item", h.root)
    name = string(companies[1][2][1][1])
    tax = string(companies[1][4][1])
    address = string(companies[1][5][1])

    push!(information, (name = name, tax = tax, address = address))
end

# 从收集到的信息创建DataFrame
df = DataFrame(information)

# 以UTF-8编码将DataFrame导出为CSV文件
output_file = "information.csv"
CSV.write(output_file, df, writeheader=true, delim=',', quotechar='"', utf8=true)

当我抓取网站时,我希望能够忽略不可用的URL,如果有任何建议,请告诉我!谢谢大家!

英文:

I try using try and catch to avoid unavailable URLs when scraping data, but I got the 0x0 DataFrame. My code as follows:

using HTTP
using Gumbo
using Cascadia
using DataFrames
using DelimitedFiles
using CSV

links = ["0100111338", "0100105077", "0100110528", "0107464283", "0105342089"]
base_url = "https://infodoanhnghiep.com/tim-kiem/ma-so-thue/"

urls = []
for link in links
    push!(urls, base_url * link * "/")
end

print(urls)

information = []

for url in urls

    r = HTTP.get(url)

    # Parse HTML
    h = parsehtml(String(r.body))

    try
        companies = eachmatch(sel".company-item", h.root)
        name = string(companies[1][2][1][1])
        tax = string(companies[1][4][1])
        address = string(companies[1][5][1])
        push!(df, (name = name, tax = tax, address = address))
    catch
        push!(df, (name = "", tax = "", address = ""))
    end
end

# Create a DataFrame from the collected information
df = DataFrame(information)

# Export the DataFrame to a CSV file with UTF-8 encoding
output_file = "information.csv"
CSV.write(output_file, df, writeheader=true, delim=',', quotechar='"', utf8=true)

However, if I do not use try and catch, I will have 5x3 DataFram, as follow:

information = []
for url in urls
    r = HTTP.get(url)

    # Parse HTML
    h = parsehtml(String(r.body))

    companies = eachmatch(sel".company-item", h.root)
    name = string(companies[1][2][1][1])
    tax = string(companies[1][4][1])
    address = string(companies[1][5][1])

    push!(information, (name = name, tax = tax, address = address))
end

# Create a DataFrame from the collected information
df = DataFrame(information)

# Export the DataFrame to a CSV file with UTF-8 encoding
output_file = "information.csv"
CSV.write(output_file, df, writeheader=true, delim=',', quotechar='"', utf8=true)

I would appreciate any suggestions to ignore unavailable URls when scraping website! Thanks all!

答案1

得分: 1

在你的循环中,你将数据推送到尚未定义的 df 中。你应该将数据推送到 information 中,即:

try
    companies = eachmatch(sel".company-item", h.root)
    name = string(companies[1][2][1][1])
    tax = string(companies[1][4][1])
    address = string(companies[1][5][1])
    push!(information, (name = name, tax = tax, address = address))
catch
    push!(information, (name = "", tax = "", address = ""))
end

实际上,你应该会得到一个编译错误,因为df 未定义,但你可能是在一个之前已经定义了 df 的会话中运行的。有时,当你遇到这些错误时,关闭会话并在新进程中重新编译是一个好主意,以捕捉到那些实际上未定义但没有触发错误的变量。

英文:

You push to the not yet defined df in your loop. You should push to information, ie:

try
    companies = eachmatch(sel".company-item", h.root)
    name = string(companies[1][2][1][1])
    tax = string(companies[1][4][1])
    address = string(companies[1][5][1])
    push!(information, (name = name, tax = tax, address = address))
catch
    push!(information, (name = "", tax = "", address = ""))
end

You actually should have gotten a compilation error that df was undefined, but you probably were running a session in which df was already defined previously in your workspace. Sometimes, when you have these sorts of errors, it is a good idea to close the session and recompile in a new process to catch any actually undefined variables that do not trigger errors because they were previously defined in that way.

huangapple
  • 本文由 发表于 2023年6月16日 13:37:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76487225.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定