英文:
How Do I extract Strings from a csv.file and write them as a list of strings
问题
以下是您要翻译的代码部分:
import pandas as pd
def process_csv(file_name):
# 读取CSV文件
df = pd.read_csv(file_name)
# 假设列名为 'Column5', 'Column4' 和 'Column3'
# 将 'Column5' 转换为数字
df['Column5'] = pd.to_numeric(df['Column5'], errors='coerce')
# 提取 'Column5' 大于等于 18 的行
extracted_rows = df[df['Column5'] >= 18]
# 创建新的字符串,通过连接 'Column4' 和 'Column3'(为了我的目的,这两列需要倒序连接)
combined_strings = extracted_rows['Column4'] + " " + extracted_rows['Column3']
print(combined_strings)
# 将合并的字符串写入文本文件
with open('file.txt', 'w') as f:
for item in combined_strings:
f.write('%s\n' % item)
process_csv('file.csv')
更新后的代码如下:
import pandas as pd
def process_csv(file_name):
# 读取CSV文件
df = pd.read_csv(file_name)
# 检查列5中的字符串是否包含'-'
# 如果包含,就在'-'处分割并取第一部分
# 否则保留原始字符串
df.iloc[:, 4] = df.iloc[:, 4].apply(lambda x: x.split('-')[0] if len(str(x)) > 3 and '-' in str(x) else x)
# 将列5转换为数字,将无效解析设为NaN
df.iloc[:, 4] = pd.to_numeric(df.iloc[:, 4], errors='coerce')
# 用负数替换NaN(由于无效解析而产生)
df.iloc[:, 4].fillna(-1, inplace=True)
# 提取列5大于等于18的行
extracted_rows = df[df.iloc[:, 4] >= 18]
# 通过连接列4和列3创建新的字符串
combined_strings = extracted_rows.iloc[:, 3] + " " + extracted_rows.iloc[:, 2]
print(combined_strings)
# 将合并的字符串写入文本文件
with open('file.txt', 'w') as f:
for item in combined_strings:
f.write("%s\n" % item)
process_csv('file.csv')
英文:
I would like to extract some strings from certain columns in a csv-file if one condition in another column is met. Then I want to write the extracted strings in a list in a txt.file.
I am new to pandas, so there is probably an obvious solution for this, but my file generated with the code below turns up empty. If I print my variable "extracted rows" in line 12 I only get this: "Series([], dtype: object)" Any ideas?
import pandas as pd
def process_csv(file_name):
# Read the CSV file
df = pd.read_csv(file_name)
# Assuming the columns are named as 'Column5', 'Column4' and 'Column3'
# Convert 'Column5' to numeric
df['Column5'] = pd.to_numeric(df['Column5'], errors='coerce')
# Extract rows where 'Column5' is >= 18
extracted_rows = df[df['Column5'] >= 18]
# Create new strings by concatenating 'Column4' and 'Column3' (which need to be reverse order in generated string for my purpose
combined_strings = extracted_rows['Column4'] + " " + extracted_rows['Column3']
print(combined_strings)
# Write the combined strings to a txt file
with open('file.txt', 'w') as f:
for item in combined_strings:
f.write('%s\n' % item)
process_csv('file.csv')
UPDATE: Taking up a suggestion I worked with apply and tried to find a solution for cases in which rows in column five contained two numbers and '-'. But now I only get those rows out that actually contained '-'. Drives me a little crazy:
import pandas as pd
def process_csv(file_name):
# Read the CSV file
df = pd.read_csv(file_name)
# Check if strings in column 5 contain '-'
# If so split at '-' and take the first part
# Otherwise, keep the original string
df.iloc[:, 4] = df.iloc[:, 4].apply(lambda x: x.split('-')[0] if len(str(x)) > 3 and '-' in str(x) else x)
# Convert column 5 to numeric, set invalid parsing as NaN
df.iloc[:, 4] = pd.to_numeric(df.iloc[:, 4], errors='coerce')
# Replace NaNs (resulted from invalid parsing) with a negative number
df.iloc[:, 4].fillna(-1, inplace=True)
# Extract rows where column 5 is >= 18
extracted_rows = df[df.iloc[:, 4] >= 18]
# Create new strings by concatenating column 4 and column 3
combined_strings = extracted_rows.iloc[:, 3] + " " + extracted_rows.iloc[:, 2]
print(combined_strings)
Write the combined strings to a txt file
with open('file.txt', 'w') as f:
for item in combined_strings:
f.write("%s\n" % item)
process_csv('file.csv')
答案1
得分: 0
你可以使用 apply
。有关更多信息和文档,请参考:(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html)
import pandas as pd
df = pd.DataFrame({'Col1': ['a', 'b', 'c'], 'Col2': ['a', 'b', 'e'], 'Col3': ['e', 'f', 'g']})
def do_something(row):
# 在这个函数中,第一个输入参数是 DataFrame 的 "row"
# 你可以有更多的输入参数,但这可能会相当复杂。
if row['Col1'] == row['Col2']:
return row['Col1'] + ' ' + row['Col3']
df.apply(do_something, axis=1)
以下是输出:
0 a e
1 b f
2 None
dtype: object
当然,你可以通过以下方式将输出重定向到 DataFrame 的一部分:
df.loc[:, 'output'] = df.apply(do_something, axis=1)
希望这有所帮助!
英文:
You could use apply
. For more info and documentation: (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html)
import pandas as pd
df = pd.DataFrame({'Col1': ['a', 'b', 'c'], 'Col2': ['a', 'b', 'e'], 'Col3': ['e', 'f', 'g']})
def do_something(row):
# In this function, the first input parameter is the "row"
# of the DataFrame, you could have more input parameters,
# but this could be quite complicated.
if row['Col1'] == row['Col2']:
return row['Col1'] + ' ' + row['Col3']
df.apply(do_something, axis=1)
The following is the output:
>>>
0 a e
1 b f
2 None
dtype: object
You could of course redirect the output into part of your DataFrame by doing this:
df.loc[:, 'output'] = df.apply(do_something, axis=1)
Hope this helps!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论