英文:
Julia: Return Minimum Date in DataFrame
问题
The question is fairly simple. How do I return the minimum purchase date for each customer using Tidier?
#params
f = "path"
df = CSV.File(f) |> DataFrame
df = @chain df begin
@select(SHOPIFY_ORDER_ID, CUSTOMER_ID, SHIPMONTH, GROSS_REVENUE, Country)
@rename(order_id = SHOPIFY_ORDER_ID,
customer_id = CUSTOMER_ID,
date = SHIPMONTH,
revenue = GROSS_REVENUE,
country = Country)
@filter(country != "CA")
@filter(!ismissing(date))
@filter(revenue != 0.0)
end
# logic to calculate summary stats
df_sum = @chain df begin
@group_by(customer_id)
@mutate(
cohort = min(date)
)
end
min(df[!, :date])
For df_sum I receive the following error:
ERROR: ArgumentError: argument is not a permutation Stacktrace: [1]
invperm(a::Vector{Int64}) @ Base .\combinatorics.jl:282 [2]
groupby(df::DataFrame, cols::Cols{Tuple{Symbol}}; sort::Bool, skipmissing::Bool) @ DataFrames
C:path\\.julia\packages\DataFrames\LteEl\src\groupeddataframe\groupeddataframe.jl:264
[3] top-level scope @ path.jl:453
When attempting to identify the min date in the data.frame I receive the error:
ERROR: MethodError: no method matching min(::Vector{Union{Missing, Dates.Date}})
Closest candidates are:
min(::Any, ::Missing) @ Base missing.jl:134
min(::Any, ::Any) @ Base operators.jl:481
min(::Any, ::Any, ::Any, ::Any...) @ Base operators.jl:578 ...
Stacktrace: [1] top-level scope @ c:\path\script.jl:28
Which indicates to me that min doesn't work where there is a Missing data type, but I'm not sure how to solve from there.
英文:
The question is fairly simple. How do I return the minimum purchase date for each customer using Tidier?
using Tidier, DataFrames, Plots, CSV
#params
f = "path"
df = CSV.File(f) |> DataFrame
df = @chain df begin
@select(SHOPIFY_ORDER_ID, CUSTOMER_ID, SHIPMONTH, GROSS_REVENUE, Country)
@rename(order_id = SHOPIFY_ORDER_ID,
customer_id = CUSTOMER_ID,
date = SHIPMONTH,
revenue = GROSS_REVENUE,
country = Country)
@filter(country != "CA")
@filter(!ismissing(date))
@filter(revenue != 0.0)
end
# logic to calculate summary stats
df_sum = @chain df begin
@group_by(customer_id)
@mutate(
cohort = min(date)
)
end
min(df[!, :date])
for df_sum I receive the following error:
> ERROR: ArgumentError: argument is not a permutation Stacktrace: [1]
> invperm(a::Vector{Int64}) @ Base .\combinatorics.jl:282 [2]
> groupby(df::DataFrame, cols::Cols{Tuple{Symbol}}; sort::Bool,
> skipmissing::Bool) @ DataFrames
> C:path\.julia\packages\DataFrames\LteEl\src\groupeddataframe\groupeddataframe.jl:264
> [3] top-level scope @
> path.jl:453
When attemtping to identify the min date in the data.frame I receive the error:
> ERROR: MethodError: no method matching min(::Vector{Union{Missing,
> Dates.Date}})
>
> Closest candidates are: min(::Any, ::Missing) @ Base
> missing.jl:134 min(::Any, ::Any) @ Base operators.jl:481
> min(::Any, ::Any, ::Any, ::Any...) @ Base operators.jl:578 ...
>
> Stacktrace: [1] top-level scope @
> c:\path\script.jl:28
Which indicates to me that min doesn't work where there is a Missing data type, but I'm not sure how to solve from there.
答案1
得分: 1
你可能需要使用 minimum
而不是 min
。我看不到你的数据。如果你有缺失值,那么 minimum
应该仍然可以正常工作,但如果你想要 maximum
,你需要先使用 skipmissing
。
英文:
You probably need to use minimum
instead of min
. I do not see your data. If you have missing values then minimum
should still just work, but if you wanted maximum
you would need to skipmissing
first.
答案2
得分: 1
Elaborating on Bogumił Kamiński's answer, you could try the following code:
df_sum = @chain df begin
@group_by(customer_id)
@mutate(
minimum_date = minimum(skipmissing(date))
)
end
The other thing to consider is whether you want to add a column to your existing dataset, or whether you simply want to return the minimum date only for each customer.
Here are two alternative approaches:
The first one will return only the customer_id and the minimum date for each customer.
df_sum = @chain df begin
@group_by(customer_id)
@summarize(
minimum_date = minimum(skipmissing(date))
)
end
In case you want to return the whole row, here's the second approach:
df_sum = @chain df begin
@group_by(customer_id)
@filter(
date == minimum(skipmissing(date))
)
@ungroup
end
Without having access to the original dataset, it's hard to confirm if these will work for you. If these don't work, please let us know!
英文:
Elaborating on Bogumił Kamiński’s answer, you could try the following code:
df_sum = @chain df begin
@group_by(customer_id)
@mutate(
minimum_date = minimum(skipmissing(date))
)
end
The other thing to consider is whether you want to add a column to your existing dataset, or whether you simply want to return the minimum date only for each customer.
Here are two alternative approaches:
The first one will return only the customer_id and the minimum date for each customer.
df_sum = @chain df begin
@group_by(customer_id)
@summarize(
minimum_date = minimum(skipmissing(date))
)
end
In case you want to return the whole row, here’s the second approach:
df_sum = @chain df begin
@group_by(customer_id)
@filter(
date == minimum(skipmissing(date))
)
@ungroup
end
Without having access to the original dataset, it’s hard to confirm if these will work for you. If these don’t work, please let us know!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论