英文:
Find rows in Julia dataframe with substrings in several columns
问题
I'm new to Julia (and StackOverflow).
I want to extract rows, originally from an Excel file, that contain a substring in more than one column.
I found the following answer which solves the first of three aspects of my issue: https://stackoverflow.com/questions/66376876/finding-rows-in-a-julia-dataframe-where-substrings-are-not-found
Example extended with more columns:
using DataFrames
df1 = DataFrame(col1=[1,62,13,24], col2=["ab","bc","cd","de"], col3=["de","bc","cd","de"], col4=["de","bc","cd","de"])
What I additionally like to do is:
1.) How do I specify a (consecutive) range of columns to look through ("second column":"last used column")?
I can list the columns explicitly, but I would like to mention them relatively. I can't rely on the names to be the same with the new version of the Excel file.
df2 = filter(r -> any(occursin.(["e"], [r.col3, r.col4])), df1)
2.) Select only rows that contain more than one occurrence of the substring.
That's what I tried:
df2 = filter(r -> (sum(occursin.(["e"], r.col2)) > 1), df1)
which returns only an empty DataFrame.
empty DataFrame
Is a DataFrame the "Julian" way to represent my Excel file at all? In my mind, coming from the "old" programming language world, using column names contained in the Excel file as designators in the script source code feels odd.
英文:
I'm new to Julia (and StackOverflow).
I want to extract rows, originally from an Excel file, that contain a substring in more than one column.
I found the following answer which solves the first of three aspects of my issue: https://stackoverflow.com/questions/66376876/finding-rows-in-a-julia-dataframe-where-substrings-are-not-found
Example extended with more columns:
using DataFrames
df1 = DataFrame(col1=[1,62,13,24], col2=["ab","bc","cd","de"], col3=["de","bc","cd","de"], col4=["de","bc","cd","de"])
What I additionally like to do is:
1.) How do I specify a (consecutive) range of columns to look through ("second column":"last used column")?
I can list the columns explicitly, but I would like to mention them relatively. I can't rely on the names to be the same with new version of the Excel file.
df2 = filter(r -> any(occursin.(["e"], [r.col3,r.col4])), df1)
2.) Select only rows that contain more than one occurance of the substring.
That's what I tried:
df2 = filter(r -> (sum(occursin.(["e"], r.col2)) > 1), df1)
which returns only an empty DataFrame.
empty DataFrame(https://i.stack.imgur.com/MSxBp.png)
Is a DataFrame the "Julian" way to represent my Excel file at all? In my mind coming from the "old" programming language world using column names contained in the Excel file as designators in the script source code feels odd.
答案1
得分: 1
这是你可能想要的内容:
julia> df2 = filter(r -> count(contains("e"), r[Between(:col2, :col4)]) > 1, df1)
2×4 DataFrame
Row │ col1 col2 col3 col4
│ Int64 String String String
─────┼───────────────────────────────
1 │ 1 ab de de
2 │ 24 de de de
julia> df2 = filter(r -> count(contains("e"), r[Between(:col2, :col3)]) > 1, df1)
1×4 DataFrame
Row │ col1 col2 col3 col4
│ Int64 String String String
─────┼───────────────────────────────
1 │ 24 de de de
解释:
r[Between(:col2, :col4)]
保留了位于列:col2
和:col4
之间的所有列(包括它们自己);在第二个示例中,我仅包括了列直到:col3
,这会删除一行,符合预期。contains("e")
返回一个函数,接受一个字符串,并返回true
,如果该字符串包含 "e",例如contains("e")("abc")
为false
,contains("e")("abec")
为true
。count
计算条件函数(通常称为谓词)作为第一个参数返回true
的次数。
英文:
You most likely want this:
julia> df2 = filter(r -> count(contains("e"), r[Between(:col2, :col4)]) > 1, df1)
2×4 DataFrame
Row │ col1 col2 col3 col4
│ Int64 String String String
─────┼───────────────────────────────
1 │ 1 ab de de
2 │ 24 de de de
julia> df2 = filter(r -> count(contains("e"), r[Between(:col2, :col3)]) > 1, df1)
1×4 DataFrame
Row │ col1 col2 col3 col4
│ Int64 String String String
─────┼───────────────────────────────
1 │ 24 de de de
Explanation:
r[Between(:col2, :col4)]
keeps all columns between columns:col2
and:col4
(including them); in the second example I included columns only up to:col3
which drops one row as expectedcontains("e")
returns a function taking a string and returningtrue
if this string contains"e"
, examplecontains("e")("abc")
isfalse
andcontains("e")("abec")
istrue
count
counts number of time the condition function (typically called predicate) passed as a first argument returnstrue
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论