Loop a dataframe and check if there is the same name as another column.

huangapple go评论51阅读模式
英文:

how to loop a dataframe and check if there is the same name as another column

问题

我只会翻译代码部分,不包括注释:

so im just one step ahead of cleaning my dataframe but i encountered some problem when i were to clean the "name" column of this dataframe below:

i wanted to check each of the row if the *company values* is equal to the *company name* contained inside the "name" column, if TRUE: ignore it. if FALSE: delete the row
while it may looks simple since the "company name" are just the first index, but my thought is what if there are some names that put the "company name" somewhere in the "name" columns(Not the first index)

ive tried my code below but it doesnt seem to work:

for x in file["company"]:#Loop company
  for i in file["name"]:#Loop name
    i = i.title().split(" ")#Capitalize each word and split name
    for j in i:
      if j == x:#if the "company name" is equal to "company values"
        pass #ignore
      else:
        file.drop("""THAT ROW""") #remove that row
英文:

so im just one step ahead of cleaning my dataframe but i encountered some problem when i were to clean the "name" column of this dataframe below:
Loop a dataframe and check if there is the same name as another column.

i wanted to check each of the row if the company values is equal to the company name contained inside the "name" column, if TRUE: ignore it. if FALSE: delete the row
while it may looks simple since the "company name" are just the first index, but my thought is what if there are some names that put the "company name" somewhere in the "name" columns(Not the first index)

ive tried my code below but it doesnt seem to work:

for x in file["company"]:#Loop company
  for i in file["name"]:#Loop name
    i = i.title().split(" ")#Capitalize each word and split name
    for j in i:
      if j == x:#if the "company name" is equal to "company values"
        pass #ignore
      else:
        file.drop("""THAT ROW""") #remove that row

答案1

得分: 1

你想要的是删除那些name值不包含其company值的行。

让我们从一个示例数据框开始:

data = [
    {
        'name': 'Hyundai something',
        'company': 'Hyundai',
    },
    {
        'name': 'Tesla Model Y',
        'company': 'Tesla',
    },
    {
        'name': 'something Hyundai',
        'company': 'Hyundai',
    },
    {
        'name': 'something',
        'company': 'Hyundai',
    },
    {
        'name': 'XYZ Ford Car',
        'company': 'Ford',
    },
]
df = pd.DataFrame(data)

接下来,我们使用apply来迭代行,并在company值在name值内时返回True。请注意,我添加了.lower()以忽略大小写。您可以根据需要进行调整。

contained = df.apply(
    lambda row: row['company'].lower() in row['name'].lower(), 
    axis=1
)

最后,您可以根据条件筛选数据框,或者删除False索引。

df.drop(contained[contained == False].index)

结果如下所示:

	name	            company
0	Hyundai something	Hyundai
1	Tesla Model Y	    Tesla
2	something Hyundai	Hyundai
4	XYZ Ford Car	    Ford
英文:

What you want is to drop rows whose name value does not contain its company value.

Let's start with a toy dataframe:

data = [
    {
        'name': 'Hyundai something',
        'company': 'Hyundai',
    },
    {
        'name': 'Tesla Model Y',
        'company': 'Tesla',
    },
    {
        'name': 'something Hyundai',
        'company': 'Hyundai',
    },
    {
        'name': 'something',
        'company': 'Hyundai',
    },
    {
        'name': 'XYZ Ford Car',
        'company': 'Ford',
    },
]
df = pd.DataFrame(data)

Next, we use apply to iterate rows and return True if company value is within name value. Note that I added .lower() to ignore case. You can adjust this to fit what you need.

contained = dd.apply(
    lambda row: row['company'].lower() in row['name'].lower(), 
    axis=1
)

Finally, you can either filter your dataframe by condition, or you can drop the False indices.

df.drop(contained[contained == False].index)

	name	            company
0	Hyundai something	Hyundai
1	Tesla Model Y	    Tesla
2	something Hyundai	Hyundai
4	XYZ Ford Car	    Ford

答案2

得分: 1

使用boolean indexing,如果包含小写后拆分name列并在列表推导中选择行以提高性能:

df = file[[x.lower() in y.lower().split() for x, y in zip(file['company'], file['name'])]]
英文:

Use boolean indexing with select rows if contain substring after lowercase and splitting name column in list comprehension for improve performance:

df = file[[x.lower() in y.lower().split() for x, y in zip(file['company'], file['name'])]]

huangapple
  • 本文由 发表于 2023年6月1日 10:42:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76378354.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定