英文:
how to loop a dataframe and check if there is the same name as another column
问题
我只会翻译代码部分,不包括注释:
so im just one step ahead of cleaning my dataframe but i encountered some problem when i were to clean the "name" column of this dataframe below:
i wanted to check each of the row if the *company values* is equal to the *company name* contained inside the "name" column, if TRUE: ignore it. if FALSE: delete the row
while it may looks simple since the "company name" are just the first index, but my thought is what if there are some names that put the "company name" somewhere in the "name" columns(Not the first index)
ive tried my code below but it doesnt seem to work:
for x in file["company"]:#Loop company
for i in file["name"]:#Loop name
i = i.title().split(" ")#Capitalize each word and split name
for j in i:
if j == x:#if the "company name" is equal to "company values"
pass #ignore
else:
file.drop("""THAT ROW""") #remove that row
英文:
so im just one step ahead of cleaning my dataframe but i encountered some problem when i were to clean the "name" column of this dataframe below:
i wanted to check each of the row if the company values is equal to the company name contained inside the "name" column, if TRUE: ignore it. if FALSE: delete the row
while it may looks simple since the "company name" are just the first index, but my thought is what if there are some names that put the "company name" somewhere in the "name" columns(Not the first index)
ive tried my code below but it doesnt seem to work:
for x in file["company"]:#Loop company
for i in file["name"]:#Loop name
i = i.title().split(" ")#Capitalize each word and split name
for j in i:
if j == x:#if the "company name" is equal to "company values"
pass #ignore
else:
file.drop("""THAT ROW""") #remove that row
答案1
得分: 1
你想要的是删除那些name
值不包含其company
值的行。
让我们从一个示例数据框开始:
data = [
{
'name': 'Hyundai something',
'company': 'Hyundai',
},
{
'name': 'Tesla Model Y',
'company': 'Tesla',
},
{
'name': 'something Hyundai',
'company': 'Hyundai',
},
{
'name': 'something',
'company': 'Hyundai',
},
{
'name': 'XYZ Ford Car',
'company': 'Ford',
},
]
df = pd.DataFrame(data)
接下来,我们使用apply
来迭代行,并在company
值在name
值内时返回True
。请注意,我添加了.lower()
以忽略大小写。您可以根据需要进行调整。
contained = df.apply(
lambda row: row['company'].lower() in row['name'].lower(),
axis=1
)
最后,您可以根据条件筛选数据框,或者删除False
索引。
df.drop(contained[contained == False].index)
结果如下所示:
name company
0 Hyundai something Hyundai
1 Tesla Model Y Tesla
2 something Hyundai Hyundai
4 XYZ Ford Car Ford
英文:
What you want is to drop rows whose name
value does not contain its company
value.
Let's start with a toy dataframe:
data = [
{
'name': 'Hyundai something',
'company': 'Hyundai',
},
{
'name': 'Tesla Model Y',
'company': 'Tesla',
},
{
'name': 'something Hyundai',
'company': 'Hyundai',
},
{
'name': 'something',
'company': 'Hyundai',
},
{
'name': 'XYZ Ford Car',
'company': 'Ford',
},
]
df = pd.DataFrame(data)
Next, we use apply
to iterate rows and return True
if company
value is within name
value. Note that I added .lower()
to ignore case. You can adjust this to fit what you need.
contained = dd.apply(
lambda row: row['company'].lower() in row['name'].lower(),
axis=1
)
Finally, you can either filter your dataframe by condition, or you can drop the False
indices.
df.drop(contained[contained == False].index)
name company
0 Hyundai something Hyundai
1 Tesla Model Y Tesla
2 something Hyundai Hyundai
4 XYZ Ford Car Ford
答案2
得分: 1
使用boolean indexing
,如果包含小写后拆分name
列并在列表推导中选择行以提高性能:
df = file[[x.lower() in y.lower().split() for x, y in zip(file['company'], file['name'])]]
英文:
Use boolean indexing
with select rows if contain substring after lowercase and splitting name
column in list comprehension for improve performance:
df = file[[x.lower() in y.lower().split() for x, y in zip(file['company'], file['name'])]]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论