在pandas数据集中基于另一列字符串创建新列。

huangapple go评论53阅读模式
英文:

Make a new column in pandas dataset base on another column string

问题

我理解你的问题。你想根据“Product Type”列的特定字符串值添加一个新列,但出现问题,因为如果第一行包含“HW-VG54-NAH”,整列将被设置为“Gateway”,即使有不同的产品类型行也会如此。如果第一行是“HW-CM31”,则整列将被添加为“Camera”。

你可以尝试将添加新列的逻辑更改为仅在每行的“Product Type”匹配条件时才设置新列的值。这可以通过以下方式完成:

# 添加新列
df['Device Type'] = np.where(df['Product Type'].str.contains('VG'), 'Gateway',
                              np.where(df['Product Type'].str.contains('AG'), 'Asset',
                                       np.where(df['Product Type'].str.contains('CM3'), 'Camera', '')))

这将根据每一行的“Product Type”值设置“Device Type”列的值,而不是仅仅依赖于第一行的值。这应该解决你的问题。

英文:

I am trying to add a new column to a panda dataset depending on another column containing a specific string. I am doing this while going through a list of csv files.
I have this

csv_files = glob.glob(os.path.join(path, "*.csv"))
print('Found',len(csv_files),'files')

#Headers to be checked
header_list = ['Created Date', 'Order Number', 'Shipping Address', 'Shipping Contact', 'Shipping Email', 'Product Type', 'Quantity', 'Serial', 'Activation Status']

list_of_df = []



for f in csv_files: 
    
    print('File Name:', f.split("\\")[-1])
    
    #read the file
    df = pd.read_csv(f, index_col=None, header=0)

    #check type
    conditions = [
        (df['Product Type'].str.contains('VG').any()),
        (df['Product Type'].str.contains('AG').any()),
        (df['Product Type'].str.contains('CM3').any())
        ]
    values = ['Gateway', 'Asset', 'Camera']

    # check the headers
    import_headers = df.axes[1]
    a = [i for i in import_headers if i not in header_list]
    
    if not a:
        print('Headers are OK, file is imported')
        #Prepare the data
        #Drop columns
        df.drop(df.columns[[2,3,4]], axis=1, inplace=True)

        #Fill Activation Status with 0
        df["Activation Status"] = df["Activation Status"].fillna("0")
        df['Activation Status'] = df['Activation Status'].replace('Activated', '1')
        
        #remove unwanted rows
        df = df.loc[df['Product Type'].str.contains('HW-', regex=True, na=True)]
        
        #add new column
        df['Device Type'] =np.select(conditions, values)
        
        list_of_df.append(df)
    else:
        print('Headers are not OK, file is not imported')
        print('Headers not found:', a)
        print('Headers found:', import_headers)

df = pd.concat(list_of_df, axis=0, ignore_index=True)

Product Type column contains this kind of data: HW-VG54-NAH, HW-CM31, HW-AG46

The new column is added but if the first row contains HW-VG54-NAH it will put the whole column as Gateway, even though there are rows with different product types.
If the first row is HW-CM31 all column will be added as Camera.

答案1

得分: 0

以下是已翻译的代码部分:

import pandas as pd

df = pd.DataFrame(
    {
        "Order Number": [9, 8, 0, 5],
        "Product Type": ["HW-VG54-NAH", "HW-CM31", "HW-TU35", "HW-AG46"],
    }
)

print(df)
# 输出
   订单号 产品类型
0     9 HW-VG54-NAH
1     8     HW-CM31
2     0     HW-TU35
3     5     HW-AG46

device_types = {"VG": "网关", "AG": "资产", "CM3": "摄像头"}

df["设备类型"] = (
    df["产品类型"]
    .str.extract(f"({'|'.join(device_types.keys())})")  # 对第一行返回 'VG'
    .applymap(lambda x: device_types.get(x, pd.NA))  # 返回 device_types['VG']
)

# 然后:
   订单号 产品类型 设备类型
0     9 HW-VG54-NAH   网关
1     8     HW-CM31  摄像头
2     0     HW-TU35  <NA>
3     5     HW-AG46   资产

请注意,我已将英文中的列名称和字典键值翻译成了中文。如果您需要更多帮助,请告诉我。

英文:

With the following toy dataframe:

import pandas as pd

df = pd.DataFrame(
    {
        &quot;Order Number&quot;: [9, 8, 0, 5],
        &quot;Product Type&quot;: [&quot;HW-VG54-NAH&quot;, &quot;HW-CM31&quot;, &quot;HW-TU35&quot;, &quot;HW-AG46&quot;],
    }
)

print(df)
# Output
   Order Number Product Type
0             9  HW-VG54-NAH
1             8      HW-CM31
2             0      HW-TU35
3             5      HW-AG46

Here is one way to do it, where a dictionary replaces conditions and values in your code:

device_types = {&quot;VG&quot;: &quot;Gateway&quot;, &quot;AG&quot;: &quot;Asset&quot;, &quot;CM3&quot;: &quot;Camera&quot;}

And you add the new column by using str.extract with a regex expression:

df[&quot;Device Type&quot;] = (
    df[&quot;Product Type&quot;]
    .str.extract(f&quot;({&#39;|&#39;.join(device_types.keys())})&quot;)  # will return &#39;VG&#39; for first row
    .applymap(lambda x: device_types.get(x, pd.NA))  # will return device_types[&#39;VG&#39;]
)

Then:

   Order Number Product Type Device Type
0             9  HW-VG54-NAH     Gateway
1             8      HW-CM31      Camera
2             0      HW-TU35        &lt;NA&gt;
3             5      HW-AG46       Asset

huangapple
  • 本文由 发表于 2023年4月19日 22:18:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76055599.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定