在Julia数据框中查找多列中包含子字符串的行。

huangapple go评论50阅读模式
英文:

Find rows in Julia dataframe with substrings in several columns

问题

I'm new to Julia (and StackOverflow).

I want to extract rows, originally from an Excel file, that contain a substring in more than one column.

I found the following answer which solves the first of three aspects of my issue: https://stackoverflow.com/questions/66376876/finding-rows-in-a-julia-dataframe-where-substrings-are-not-found

Example extended with more columns:

using DataFrames
df1 = DataFrame(col1=[1,62,13,24], col2=["ab","bc","cd","de"], col3=["de","bc","cd","de"], col4=["de","bc","cd","de"])

What I additionally like to do is:
1.) How do I specify a (consecutive) range of columns to look through ("second column":"last used column")?
I can list the columns explicitly, but I would like to mention them relatively. I can't rely on the names to be the same with the new version of the Excel file.

df2 = filter(r -> any(occursin.(["e"], [r.col3, r.col4])), df1)

2.) Select only rows that contain more than one occurrence of the substring.
That's what I tried:

df2 = filter(r -> (sum(occursin.(["e"], r.col2)) > 1), df1)

which returns only an empty DataFrame.
empty DataFrame

Is a DataFrame the "Julian" way to represent my Excel file at all? In my mind, coming from the "old" programming language world, using column names contained in the Excel file as designators in the script source code feels odd.

英文:

I'm new to Julia (and StackOverflow).
I want to extract rows, originally from an Excel file, that contain a substring in more than one column.

I found the following answer which solves the first of three aspects of my issue: https://stackoverflow.com/questions/66376876/finding-rows-in-a-julia-dataframe-where-substrings-are-not-found

Example extended with more columns:

using DataFrames
df1 = DataFrame(col1=[1,62,13,24], col2=["ab","bc","cd","de"], col3=["de","bc","cd","de"], col4=["de","bc","cd","de"])

What I additionally like to do is:
1.) How do I specify a (consecutive) range of columns to look through ("second column":"last used column")?
I can list the columns explicitly, but I would like to mention them relatively. I can't rely on the names to be the same with new version of the Excel file.

df2 = filter(r -> any(occursin.(["e"], [r.col3,r.col4])), df1)

2.) Select only rows that contain more than one occurance of the substring.
That's what I tried:

df2 = filter(r -> (sum(occursin.(["e"], r.col2)) > 1), df1)

which returns only an empty DataFrame.
empty DataFrame(https://i.stack.imgur.com/MSxBp.png)

Is a DataFrame the "Julian" way to represent my Excel file at all? In my mind coming from the "old" programming language world using column names contained in the Excel file as designators in the script source code feels odd.

答案1

得分: 1

这是你可能想要的内容:

julia> df2 = filter(r -> count(contains("e"), r[Between(:col2, :col4)]) > 1, df1)
2×4 DataFrame
 Row │ col1   col2    col3    col4
Int64  String  String  String
─────┼───────────────────────────────
   11  ab      de      de
   224  de      de      de

julia> df2 = filter(r -> count(contains("e"), r[Between(:col2, :col3)]) > 1, df1)
1×4 DataFrame
 Row │ col1   col2    col3    col4
Int64  String  String  String
─────┼───────────────────────────────
   124  de      de      de

解释:

  • r[Between(:col2, :col4)] 保留了位于列 :col2:col4 之间的所有列(包括它们自己);在第二个示例中,我仅包括了列直到 :col3,这会删除一行,符合预期。
  • contains("e") 返回一个函数,接受一个字符串,并返回 true,如果该字符串包含 "e",例如 contains("e")("abc")falsecontains("e")("abec")true
  • count 计算条件函数(通常称为谓词)作为第一个参数返回 true 的次数。
英文:

You most likely want this:

julia> df2 = filter(r -> count(contains("e"), r[Between(:col2, :col4)]) > 1, df1)
2×4 DataFrame
 Row │ col1   col2    col3    col4
     │ Int64  String  String  String
─────┼───────────────────────────────
   1 │     1  ab      de      de
   2 │    24  de      de      de

julia> df2 = filter(r -> count(contains("e"), r[Between(:col2, :col3)]) > 1, df1)
1×4 DataFrame
 Row │ col1   col2    col3    col4
     │ Int64  String  String  String
─────┼───────────────────────────────
   1 │    24  de      de      de

Explanation:

  • r[Between(:col2, :col4)] keeps all columns between columns :col2 and :col4 (including them); in the second example I included columns only up to :col3 which drops one row as expected
  • contains("e") returns a function taking a string and returning true if this string contains "e", example contains("e")("abc") is false and contains("e")("abec") is true
  • count counts number of time the condition function (typically called predicate) passed as a first argument returns true

huangapple
  • 本文由 发表于 2023年4月4日 17:12:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/75927544.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定