如何从Python写入txt文件,其中标题中的日期被复制到txt文件中的每条记录。

huangapple go评论73阅读模式
英文:

How to write to txt from python wherein date from header is copied to every record in txt file

问题

我有一个文本文件,看起来像下面这样(仅为快照):

DC000D20221110012022100019
DC011D           AV0019000300080180003340501031800481200000
DC011D           AV0019000300083180003361901031900071900000
DC011D           AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D           CG0019000300080220100264401000000000000000
DC011D           CG0019000300080220400885101000039990700000
DC011D           CG0019000300080220400885101000040013000000

我有大约100万条这样的记录,为了3年。第一条记录 'DC000D20221110012022100019' 是标题,日期位于位置[6:14]。我需要将这些数据导入数据框以进行探索性分析,因此我需要每条记录的日期,而不是标题。所以我需要像这样的东西:

DC000D20221110012022100019
DC011D20221110   AV0019000300080180003340501031800481200000
DC011D20221110   AV0019000300083180003361901031900071900000
DC011D20221110   AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D20221209   CG0019000300080220100264401000000000000000
DC011D20221209   CG0019000300080220400885101000039990700000
DC011D20221209   CG0019000300080220400885101000040013000000

这样我就可以轻松将其导入 pandas 数据框。

从上面的主文本文件中,我正在准备另一个仅包含DC011记录的文本文件,如下所示:

# 为子模块DC011创建文件
File1 = open(r"\path\CRGDEC\CRGDEC.txt")
File2 = open(r"\path\CRGDEC_DC011.txt", "w")

for line in File1.readlines():
    if (line.startswith('DC011')):
        File2.write(line)

但这会删除DC000头记录,我不能在数据框中使用fillna()选项来准备日期列。非常感谢您的帮助!

附注:我还有其他类似的模块,例如DC012、DC013、DC014。

英文:

I have a text file which looks like below (snapshot only):

DC000D20221110012022100019
DC011D           AV0019000300080180003340501031800481200000
DC011D           AV0019000300083180003361901031900071900000
DC011D           AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D           CG0019000300080220100264401000000000000000
DC011D           CG0019000300080220400885101000039990700000
DC011D           CG0019000300080220400885101000040013000000

I have ~1m records this way for 3 years. The First record 'DC000D20221110012022100019' is the header for which has date at position [6:14]. I need to import this data into a dataframe for my exploratory analysis for which I need the date present for each record, not header. So I need something like this:

DC000D20221110012022100019
DC011D20221110   AV0019000300080180003340501031800481200000
DC011D20221110   AV0019000300083180003361901031900071900000
DC011D20221110   AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D20221209   CG0019000300080220100264401000000000000000
DC011D20221209   CG0019000300080220400885101000039990700000
DC011D20221209   CG0019000300080220400885101000040013000000

That way I can import it into pandas df easily.

Form the main text file as above I am preparing another txt file with only DC011 records as below:

#File for submodule DC011

File1=open(r"\path\CRGDEC\CRGDEC.txt")
File2=open(r"\path\CRGDEC_DC011.txt", "w")

for line in File1.readlines():
    if (line.startswith('DC011')):
        File2.write(line)

But this eliminates the DC000 header records and I can't use the fillna() option in my df to prepare the Date column.
Help is much appreciated!!

N.B. I have other modules in a similar way as well (such as DC012, DC013, DC014).

答案1

得分: 0

如果你只是添加一个条件来检查头记录,你也可以编写它(并记住要附加到后续行的日期)

File1 = open(r"\path\CRGDEC\CRGDEC.txt")
File2 = open(r"\path\CRGDEC_DC011.txt", "w")

saveDate = ""   # 以防万一
for line in File1.readlines():
    if line.startswith('DC000'):
        saveDate = line[6:14]
        File2.write(line)
    if line.startswith('DC011'):
        line2write = line[:6] + saveDate + line[7:]
        File2.write(line2write)

# 我没有测试这个,但应该接近正确。

请注意,这是你提供的代码的翻译部分,不包括其他内容。

英文:

If you just add an if to check for the header record, you can also write it (and memorize the date to append to the subsequent lines)

File1=open(r"\path\CRGDEC\CRGDEC.txt")
File2=open(r"\path\CRGDEC_DC011.txt", "w")

saveDate = ""   # just in case
for line in File1.readlines():
    if (line.startswith('DC000'):
        saveDate = line[6:14]
        File2.write(line)
    if (line.startswith('DC011')):
       line2write = line[:6] + saveDate + line[7:]
        File2.write(line2write)

I didn't test this, but it should be near correct.

答案2

得分: 0

你可以通过在 read_csv 中指定分隔符和列名来直接导入pandas。然后进行一些操作以获取所需的格式:

import pandas as pd
from io import StringIO

data = StringIO('''DC000D20221110012022100019
DC011D           AV0019000300080180003340501031800481200000
DC011D           AV0019000300083180003361901031900071900000
DC011D           AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D           CG0019000300080220100264401000000000000000
DC011D           CG0019000300080220400885101000039990700000
DC011D           CG0019000300080220400885101000040013000000
''')

df =  pd.read_csv(data, sep=r"\s+", engine="python", names=['ref', 'value'])
df['date'] = df.loc[df['value'].isna(), 'ref'].str[6:14]
df['date'] = df['date'].ffill()
mask = ~df['value'].isna()
df.loc[mask, 'ref'] = df[mask]['ref'] + df[mask]['date']
print(df.drop(columns='date'))

输出:

                          ref                                       value
0  DC000D20221110012022100019                                        None
1              DC011D20221110  AV0019000300080180003340501031800481200000
2              DC011D20221110  AV0019000300083180003361901031900071900000
3              DC011D20221110  AV0019000300089180003378701032100515800000
4  DC000D20221209012022100019                                        None
5              DC011D20221209  CG0019000300080220100264401000000000000000
6              DC011D20221209  CG0019000300080220400885101000039990700000
7              DC011D20221209  CG0019000300080220400885101000040013000000
英文:

You can import in pandas directly by specifying the separator and the column names in read_csv. Then proceed with some operations to get the desired format:

import pandas as pd
from io import StringIO

data = StringIO("""DC000D20221110012022100019
DC011D           AV0019000300080180003340501031800481200000
DC011D           AV0019000300083180003361901031900071900000
DC011D           AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D           CG0019000300080220100264401000000000000000
DC011D           CG0019000300080220400885101000039990700000
DC011D           CG0019000300080220400885101000040013000000
""")

df =  pd.read_csv(data, sep=r"\s+", engine="python", names=['ref', 'value'])
df['date'] = df.loc[df['value'].isna(), 'ref'].str[6:14]
df['date'] = df['date'].ffill()
mask = ~df['value'].isna()
df.loc[mask, 'ref'] = df[mask]['ref'] + df[mask]['date']
print(df.drop(columns='date'))

Output:

                          ref                                       value
0  DC000D20221110012022100019                                        None
1              DC011D20221110  AV0019000300080180003340501031800481200000
2              DC011D20221110  AV0019000300083180003361901031900071900000
3              DC011D20221110  AV0019000300089180003378701032100515800000
4  DC000D20221209012022100019                                        None
5              DC011D20221209  CG0019000300080220100264401000000000000000
6              DC011D20221209  CG0019000300080220400885101000039990700000
7              DC011D20221209  CG0019000300080220400885101000040013000000

huangapple
  • 本文由 发表于 2023年3月15日 19:50:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75744307.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定