英文:
0x0 Dataframe when using try and catch by using Julia
问题
尝试使用try和catch来避免在抓取数据时出现不可用的URL,但我得到了一个0x0的DataFrame。我的代码如下:
using HTTP
using Gumbo
using Cascadia
using DataFrames
using DelimitedFiles
using CSV
links = ["0100111338", "0100105077", "0100110528", "0107464283", "0105342089"]
base_url = "https://infodoanhnghiep.com/tim-kiem/ma-so-thue/"
urls = []
for link in links
push!(urls, base_url * link * "/")
end
print(urls)
information = []
for url in urls
r = HTTP.get(url)
# 解析HTML
h = parsehtml(String(r.body))
try
companies = eachmatch(sel".company-item", h.root)
name = string(companies[1][2][1][1])
tax = string(companies[1][4][1])
address = string(companies[1][5][1])
push!(df, (name = name, tax = tax, address = address))
catch
push!(df, (name = "", tax = "", address = ""))
end
end
# 从收集到的信息创建DataFrame
df = DataFrame(information)
# 以UTF-8编码将DataFrame导出为CSV文件
output_file = "information.csv"
CSV.write(output_file, df, writeheader=true, delim=',', quotechar='"', utf8=true)
然而,如果我不使用try和catch,我会得到一个5x3的DataFrame,如下所示:
information = []
for url in urls
r = HTTP.get(url)
# 解析HTML
h = parsehtml(String(r.body))
companies = eachmatch(sel".company-item", h.root)
name = string(companies[1][2][1][1])
tax = string(companies[1][4][1])
address = string(companies[1][5][1])
push!(information, (name = name, tax = tax, address = address))
end
# 从收集到的信息创建DataFrame
df = DataFrame(information)
# 以UTF-8编码将DataFrame导出为CSV文件
output_file = "information.csv"
CSV.write(output_file, df, writeheader=true, delim=',', quotechar='"', utf8=true)
当我抓取网站时,我希望能够忽略不可用的URL,如果有任何建议,请告诉我!谢谢大家!
英文:
I try using try and catch to avoid unavailable URLs when scraping data, but I got the 0x0 DataFrame. My code as follows:
using HTTP
using Gumbo
using Cascadia
using DataFrames
using DelimitedFiles
using CSV
links = ["0100111338", "0100105077", "0100110528", "0107464283", "0105342089"]
base_url = "https://infodoanhnghiep.com/tim-kiem/ma-so-thue/"
urls = []
for link in links
push!(urls, base_url * link * "/")
end
print(urls)
information = []
for url in urls
r = HTTP.get(url)
# Parse HTML
h = parsehtml(String(r.body))
try
companies = eachmatch(sel".company-item", h.root)
name = string(companies[1][2][1][1])
tax = string(companies[1][4][1])
address = string(companies[1][5][1])
push!(df, (name = name, tax = tax, address = address))
catch
push!(df, (name = "", tax = "", address = ""))
end
end
# Create a DataFrame from the collected information
df = DataFrame(information)
# Export the DataFrame to a CSV file with UTF-8 encoding
output_file = "information.csv"
CSV.write(output_file, df, writeheader=true, delim=',', quotechar='"', utf8=true)
However, if I do not use try and catch, I will have 5x3 DataFram, as follow:
information = []
for url in urls
r = HTTP.get(url)
# Parse HTML
h = parsehtml(String(r.body))
companies = eachmatch(sel".company-item", h.root)
name = string(companies[1][2][1][1])
tax = string(companies[1][4][1])
address = string(companies[1][5][1])
push!(information, (name = name, tax = tax, address = address))
end
# Create a DataFrame from the collected information
df = DataFrame(information)
# Export the DataFrame to a CSV file with UTF-8 encoding
output_file = "information.csv"
CSV.write(output_file, df, writeheader=true, delim=',', quotechar='"', utf8=true)
I would appreciate any suggestions to ignore unavailable URls when scraping website! Thanks all!
答案1
得分: 1
在你的循环中,你将数据推送到尚未定义的 df
中。你应该将数据推送到 information
中,即:
try
companies = eachmatch(sel".company-item", h.root)
name = string(companies[1][2][1][1])
tax = string(companies[1][4][1])
address = string(companies[1][5][1])
push!(information, (name = name, tax = tax, address = address))
catch
push!(information, (name = "", tax = "", address = ""))
end
实际上,你应该会得到一个编译错误,因为df
未定义,但你可能是在一个之前已经定义了 df
的会话中运行的。有时,当你遇到这些错误时,关闭会话并在新进程中重新编译是一个好主意,以捕捉到那些实际上未定义但没有触发错误的变量。
英文:
You push to the not yet defined df in your loop. You should push to information
, ie:
try
companies = eachmatch(sel".company-item", h.root)
name = string(companies[1][2][1][1])
tax = string(companies[1][4][1])
address = string(companies[1][5][1])
push!(information, (name = name, tax = tax, address = address))
catch
push!(information, (name = "", tax = "", address = ""))
end
You actually should have gotten a compilation error that df was undefined, but you probably were running a session in which df
was already defined previously in your workspace. Sometimes, when you have these sorts of errors, it is a good idea to close the session and recompile in a new process to catch any actually undefined variables that do not trigger errors because they were previously defined in that way.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论