英文:
How to write to txt from python wherein date from header is copied to every record in txt file
问题
我有一个文本文件,看起来像下面这样(仅为快照):
DC000D20221110012022100019
DC011D AV0019000300080180003340501031800481200000
DC011D AV0019000300083180003361901031900071900000
DC011D AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D CG0019000300080220100264401000000000000000
DC011D CG0019000300080220400885101000039990700000
DC011D CG0019000300080220400885101000040013000000
我有大约100万条这样的记录,为了3年。第一条记录 'DC000D20221110012022100019' 是标题,日期位于位置[6:14]。我需要将这些数据导入数据框以进行探索性分析,因此我需要每条记录的日期,而不是标题。所以我需要像这样的东西:
DC000D20221110012022100019
DC011D20221110 AV0019000300080180003340501031800481200000
DC011D20221110 AV0019000300083180003361901031900071900000
DC011D20221110 AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D20221209 CG0019000300080220100264401000000000000000
DC011D20221209 CG0019000300080220400885101000039990700000
DC011D20221209 CG0019000300080220400885101000040013000000
这样我就可以轻松将其导入 pandas 数据框。
从上面的主文本文件中,我正在准备另一个仅包含DC011记录的文本文件,如下所示:
# 为子模块DC011创建文件
File1 = open(r"\path\CRGDEC\CRGDEC.txt")
File2 = open(r"\path\CRGDEC_DC011.txt", "w")
for line in File1.readlines():
if (line.startswith('DC011')):
File2.write(line)
但这会删除DC000头记录,我不能在数据框中使用fillna()选项来准备日期列。非常感谢您的帮助!
附注:我还有其他类似的模块,例如DC012、DC013、DC014。
英文:
I have a text file which looks like below (snapshot only):
DC000D20221110012022100019
DC011D AV0019000300080180003340501031800481200000
DC011D AV0019000300083180003361901031900071900000
DC011D AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D CG0019000300080220100264401000000000000000
DC011D CG0019000300080220400885101000039990700000
DC011D CG0019000300080220400885101000040013000000
I have ~1m records this way for 3 years. The First record 'DC000D20221110012022100019' is the header for which has date at position [6:14]. I need to import this data into a dataframe for my exploratory analysis for which I need the date present for each record, not header. So I need something like this:
DC000D20221110012022100019
DC011D20221110 AV0019000300080180003340501031800481200000
DC011D20221110 AV0019000300083180003361901031900071900000
DC011D20221110 AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D20221209 CG0019000300080220100264401000000000000000
DC011D20221209 CG0019000300080220400885101000039990700000
DC011D20221209 CG0019000300080220400885101000040013000000
That way I can import it into pandas df easily.
Form the main text file as above I am preparing another txt file with only DC011 records as below:
#File for submodule DC011
File1=open(r"\path\CRGDEC\CRGDEC.txt")
File2=open(r"\path\CRGDEC_DC011.txt", "w")
for line in File1.readlines():
if (line.startswith('DC011')):
File2.write(line)
But this eliminates the DC000 header records and I can't use the fillna() option in my df to prepare the Date column.
Help is much appreciated!!
N.B. I have other modules in a similar way as well (such as DC012, DC013, DC014).
答案1
得分: 0
如果你只是添加一个条件来检查头记录,你也可以编写它(并记住要附加到后续行的日期)
File1 = open(r"\path\CRGDEC\CRGDEC.txt")
File2 = open(r"\path\CRGDEC_DC011.txt", "w")
saveDate = "" # 以防万一
for line in File1.readlines():
if line.startswith('DC000'):
saveDate = line[6:14]
File2.write(line)
if line.startswith('DC011'):
line2write = line[:6] + saveDate + line[7:]
File2.write(line2write)
# 我没有测试这个,但应该接近正确。
请注意,这是你提供的代码的翻译部分,不包括其他内容。
英文:
If you just add an if to check for the header record, you can also write it (and memorize the date to append to the subsequent lines)
File1=open(r"\path\CRGDEC\CRGDEC.txt")
File2=open(r"\path\CRGDEC_DC011.txt", "w")
saveDate = "" # just in case
for line in File1.readlines():
if (line.startswith('DC000'):
saveDate = line[6:14]
File2.write(line)
if (line.startswith('DC011')):
line2write = line[:6] + saveDate + line[7:]
File2.write(line2write)
I didn't test this, but it should be near correct.
答案2
得分: 0
你可以通过在 read_csv
中指定分隔符和列名来直接导入pandas。然后进行一些操作以获取所需的格式:
import pandas as pd
from io import StringIO
data = StringIO('''DC000D20221110012022100019
DC011D AV0019000300080180003340501031800481200000
DC011D AV0019000300083180003361901031900071900000
DC011D AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D CG0019000300080220100264401000000000000000
DC011D CG0019000300080220400885101000039990700000
DC011D CG0019000300080220400885101000040013000000
''')
df = pd.read_csv(data, sep=r"\s+", engine="python", names=['ref', 'value'])
df['date'] = df.loc[df['value'].isna(), 'ref'].str[6:14]
df['date'] = df['date'].ffill()
mask = ~df['value'].isna()
df.loc[mask, 'ref'] = df[mask]['ref'] + df[mask]['date']
print(df.drop(columns='date'))
输出:
ref value
0 DC000D20221110012022100019 None
1 DC011D20221110 AV0019000300080180003340501031800481200000
2 DC011D20221110 AV0019000300083180003361901031900071900000
3 DC011D20221110 AV0019000300089180003378701032100515800000
4 DC000D20221209012022100019 None
5 DC011D20221209 CG0019000300080220100264401000000000000000
6 DC011D20221209 CG0019000300080220400885101000039990700000
7 DC011D20221209 CG0019000300080220400885101000040013000000
英文:
You can import in pandas directly by specifying the separator and the column names in read_csv
. Then proceed with some operations to get the desired format:
import pandas as pd
from io import StringIO
data = StringIO("""DC000D20221110012022100019
DC011D AV0019000300080180003340501031800481200000
DC011D AV0019000300083180003361901031900071900000
DC011D AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D CG0019000300080220100264401000000000000000
DC011D CG0019000300080220400885101000039990700000
DC011D CG0019000300080220400885101000040013000000
""")
df = pd.read_csv(data, sep=r"\s+", engine="python", names=['ref', 'value'])
df['date'] = df.loc[df['value'].isna(), 'ref'].str[6:14]
df['date'] = df['date'].ffill()
mask = ~df['value'].isna()
df.loc[mask, 'ref'] = df[mask]['ref'] + df[mask]['date']
print(df.drop(columns='date'))
Output:
ref value
0 DC000D20221110012022100019 None
1 DC011D20221110 AV0019000300080180003340501031800481200000
2 DC011D20221110 AV0019000300083180003361901031900071900000
3 DC011D20221110 AV0019000300089180003378701032100515800000
4 DC000D20221209012022100019 None
5 DC011D20221209 CG0019000300080220100264401000000000000000
6 DC011D20221209 CG0019000300080220400885101000039990700000
7 DC011D20221209 CG0019000300080220400885101000040013000000
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论