我提取字符串从一个csv文件并将它们写成一个字符串列表

huangapple go评论65阅读模式
英文:

How Do I extract Strings from a csv.file and write them as a list of strings

问题

以下是您要翻译的代码部分:

import pandas as pd

def process_csv(file_name):
    # 读取CSV文件
    df = pd.read_csv(file_name)

    # 假设列名为 'Column5', 'Column4' 和 'Column3'
    # 将 'Column5' 转换为数字
    df['Column5'] = pd.to_numeric(df['Column5'], errors='coerce')

    # 提取 'Column5' 大于等于 18 的行
    extracted_rows = df[df['Column5'] >= 18]

    # 创建新的字符串,通过连接 'Column4' 和 'Column3'(为了我的目的,这两列需要倒序连接)
    combined_strings = extracted_rows['Column4'] + " " + extracted_rows['Column3']
    
    print(combined_strings)

    # 将合并的字符串写入文本文件
    with open('file.txt', 'w') as f:
        for item in combined_strings:
            f.write('%s\n' % item)

process_csv('file.csv')

更新后的代码如下:

import pandas as pd

def process_csv(file_name):
    # 读取CSV文件
    df = pd.read_csv(file_name)

    # 检查列5中的字符串是否包含'-'
    # 如果包含,就在'-'处分割并取第一部分
    # 否则保留原始字符串
    df.iloc[:, 4] = df.iloc[:, 4].apply(lambda x: x.split('-')[0] if len(str(x)) > 3 and '-' in str(x) else x)

    # 将列5转换为数字,将无效解析设为NaN
    df.iloc[:, 4] = pd.to_numeric(df.iloc[:, 4], errors='coerce')

    # 用负数替换NaN(由于无效解析而产生)
    df.iloc[:, 4].fillna(-1, inplace=True)

    # 提取列5大于等于18的行
    extracted_rows = df[df.iloc[:, 4] >= 18]

    # 通过连接列4和列3创建新的字符串
    combined_strings = extracted_rows.iloc[:, 3] + " " + extracted_rows.iloc[:, 2]

    print(combined_strings)

    # 将合并的字符串写入文本文件
    with open('file.txt', 'w') as f:
        for item in combined_strings:
            f.write("%s\n" % item)

process_csv('file.csv')
英文:

I would like to extract some strings from certain columns in a csv-file if one condition in another column is met. Then I want to write the extracted strings in a list in a txt.file.

I am new to pandas, so there is probably an obvious solution for this, but my file generated with the code below turns up empty. If I print my variable "extracted rows" in line 12 I only get this: "Series([], dtype: object)" Any ideas?

import pandas as pd

def process_csv(file_name):
    # Read the CSV file
    df = pd.read_csv(file_name)

    # Assuming the columns are named as 'Column5', 'Column4' and 'Column3'
    # Convert 'Column5' to numeric
    df['Column5'] = pd.to_numeric(df['Column5'], errors='coerce')

    # Extract rows where 'Column5' is >= 18
    extracted_rows = df[df['Column5'] >= 18]

    # Create new strings by concatenating 'Column4' and 'Column3' (which need to be reverse order in generated string for my purpose 
    combined_strings = extracted_rows['Column4'] + " " + extracted_rows['Column3']
    
    print(combined_strings)

    # Write the combined strings to a txt file
    with open('file.txt', 'w') as f:
        for item in combined_strings:
            f.write('%s\n' % item)

process_csv('file.csv')

UPDATE: Taking up a suggestion I worked with apply and tried to find a solution for cases in which rows in column five contained two numbers and '-'. But now I only get those rows out that actually contained '-'. Drives me a little crazy:

import pandas as pd

def process_csv(file_name):
    # Read the CSV file
    df = pd.read_csv(file_name)

    # Check if strings in column 5 contain '-'
    # If so split at '-' and take the first part
    # Otherwise, keep the original string
    df.iloc[:, 4] = df.iloc[:, 4].apply(lambda x: x.split('-')[0] if len(str(x)) > 3 and '-' in str(x) else x)

    # Convert column 5 to numeric, set invalid parsing as NaN
    df.iloc[:, 4] = pd.to_numeric(df.iloc[:, 4], errors='coerce')

    # Replace NaNs (resulted from invalid parsing) with a negative number
    df.iloc[:, 4].fillna(-1, inplace=True)

    # Extract rows where column 5 is >= 18
    extracted_rows = df[df.iloc[:, 4] >= 18]

    # Create new strings by concatenating column 4 and column 3
    combined_strings = extracted_rows.iloc[:, 3] + " " + extracted_rows.iloc[:, 2]

   print(combined_strings)
   Write the combined strings to a txt file
   with open('file.txt', 'w') as f:
        for item in combined_strings:
            f.write("%s\n" % item)

process_csv('file.csv')

答案1

得分: 0

你可以使用 apply。有关更多信息和文档,请参考:(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html)

import pandas as pd

df = pd.DataFrame({'Col1': ['a', 'b', 'c'], 'Col2': ['a', 'b', 'e'], 'Col3': ['e', 'f', 'g']})

def do_something(row):
    # 在这个函数中,第一个输入参数是 DataFrame 的 "row"
    # 你可以有更多的输入参数,但这可能会相当复杂。
    if row['Col1'] == row['Col2']:
        return row['Col1'] + ' ' + row['Col3']

df.apply(do_something, axis=1)

以下是输出:

0     a e
1     b f
2    None
dtype: object

当然,你可以通过以下方式将输出重定向到 DataFrame 的一部分:

df.loc[:, 'output'] = df.apply(do_something, axis=1)

希望这有所帮助! 我提取字符串从一个csv文件并将它们写成一个字符串列表

英文:

You could use apply. For more info and documentation: (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html)

import pandas as pd

df = pd.DataFrame({'Col1': ['a', 'b', 'c'], 'Col2': ['a', 'b', 'e'], 'Col3': ['e', 'f', 'g']})


def do_something(row):
# In this function, the first input parameter is the "row"
# of the DataFrame, you could have more input parameters,
# but this could be quite complicated.
    if row['Col1'] == row['Col2']:
        return row['Col1'] + ' ' + row['Col3']
        

df.apply(do_something, axis=1)

The following is the output:

>>> 
0     a e
1     b f
2    None
dtype: object

You could of course redirect the output into part of your DataFrame by doing this:

df.loc[:, 'output'] = df.apply(do_something, axis=1)

我提取字符串从一个csv文件并将它们写成一个字符串列表

Hope this helps! 我提取字符串从一个csv文件并将它们写成一个字符串列表

huangapple
  • 本文由 发表于 2023年7月24日 18:33:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76753610.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定