2023年3月15日 19:50:24go评论88阅读模式

英文:

How to write to txt from python wherein date from header is copied to every record in txt file

问题

我有一个文本文件，看起来像下面这样（仅为快照）：

DC000D20221110012022100019
DC011D           AV0019000300080180003340501031800481200000
DC011D           AV0019000300083180003361901031900071900000
DC011D           AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D           CG0019000300080220100264401000000000000000
DC011D           CG0019000300080220400885101000039990700000
DC011D           CG0019000300080220400885101000040013000000

我有大约100万条这样的记录，为了3年。第一条记录 'DC000D20221110012022100019' 是标题，日期位于位置[6:14]。我需要将这些数据导入数据框以进行探索性分析，因此我需要每条记录的日期，而不是标题。所以我需要像这样的东西：

DC000D20221110012022100019
DC011D20221110   AV0019000300080180003340501031800481200000
DC011D20221110   AV0019000300083180003361901031900071900000
DC011D20221110   AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D20221209   CG0019000300080220100264401000000000000000
DC011D20221209   CG0019000300080220400885101000039990700000
DC011D20221209   CG0019000300080220400885101000040013000000

这样我就可以轻松将其导入 pandas 数据框。

从上面的主文本文件中，我正在准备另一个仅包含DC011记录的文本文件，如下所示：

# 为子模块DC011创建文件
File1 = open(r"\path\CRGDEC\CRGDEC.txt")
File2 = open(r"\path\CRGDEC_DC011.txt", "w")

for line in File1.readlines():
    if (line.startswith('DC011')):
        File2.write(line)

但这会删除DC000头记录，我不能在数据框中使用fillna()选项来准备日期列。非常感谢您的帮助！

附注：我还有其他类似的模块，例如DC012、DC013、DC014。

英文:

I have a text file which looks like below (snapshot only):

DC000D20221110012022100019
DC011D           AV0019000300080180003340501031800481200000
DC011D           AV0019000300083180003361901031900071900000
DC011D           AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D           CG0019000300080220100264401000000000000000
DC011D           CG0019000300080220400885101000039990700000
DC011D           CG0019000300080220400885101000040013000000

I have ~1m records this way for 3 years. The First record 'DC000D20221110012022100019' is the header for which has date at position [6:14]. I need to import this data into a dataframe for my exploratory analysis for which I need the date present for each record, not header. So I need something like this:

DC000D20221110012022100019
DC011D20221110   AV0019000300080180003340501031800481200000
DC011D20221110   AV0019000300083180003361901031900071900000
DC011D20221110   AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D20221209   CG0019000300080220100264401000000000000000
DC011D20221209   CG0019000300080220400885101000039990700000
DC011D20221209   CG0019000300080220400885101000040013000000

That way I can import it into pandas df easily.

Form the main text file as above I am preparing another txt file with only DC011 records as below:

#File for submodule DC011

File1=open(r&quot;\path\CRGDEC\CRGDEC.txt&quot;)
File2=open(r&quot;\path\CRGDEC_DC011.txt&quot;, &quot;w&quot;)

for line in File1.readlines():
    if (line.startswith(&#39;DC011&#39;)):
        File2.write(line)

But this eliminates the DC000 header records and I can't use the fillna() option in my df to prepare the Date column.
Help is much appreciated!!

N.B. I have other modules in a similar way as well (such as DC012, DC013, DC014).

答案1

得分: 0

如果你只是添加一个条件来检查头记录，你也可以编写它（并记住要附加到后续行的日期）

File1 = open(r"\path\CRGDEC\CRGDEC.txt")
File2 = open(r"\path\CRGDEC_DC011.txt", "w")

saveDate = ""   # 以防万一
for line in File1.readlines():
    if line.startswith('DC000'):
        saveDate = line[6:14]
        File2.write(line)
    if line.startswith('DC011'):
        line2write = line[:6] + saveDate + line[7:]
        File2.write(line2write)

# 我没有测试这个，但应该接近正确。

请注意，这是你提供的代码的翻译部分，不包括其他内容。

英文:

If you just add an if to check for the header record, you can also write it (and memorize the date to append to the subsequent lines)

File1=open(r&quot;\path\CRGDEC\CRGDEC.txt&quot;)
File2=open(r&quot;\path\CRGDEC_DC011.txt&quot;, &quot;w&quot;)

saveDate = &quot;&quot;   # just in case
for line in File1.readlines():
    if (line.startswith(&#39;DC000&#39;):
        saveDate = line[6:14]
        File2.write(line)
    if (line.startswith(&#39;DC011&#39;)):
       line2write = line[:6] + saveDate + line[7:]
        File2.write(line2write)

I didn't test this, but it should be near correct.

答案2

得分: 0

你可以通过在 read_csv 中指定分隔符和列名来直接导入pandas。然后进行一些操作以获取所需的格式：

import pandas as pd
from io import StringIO

data = StringIO('''DC000D20221110012022100019
DC011D           AV0019000300080180003340501031800481200000
DC011D           AV0019000300083180003361901031900071900000
DC011D           AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D           CG0019000300080220100264401000000000000000
DC011D           CG0019000300080220400885101000039990700000
DC011D           CG0019000300080220400885101000040013000000
''')

df =  pd.read_csv(data, sep=r"\s+", engine="python", names=['ref', 'value'])
df['date'] = df.loc[df['value'].isna(), 'ref'].str[6:14]
df['date'] = df['date'].ffill()
mask = ~df['value'].isna()
df.loc[mask, 'ref'] = df[mask]['ref'] + df[mask]['date']
print(df.drop(columns='date'))

输出：

                          ref                                       value
0  DC000D20221110012022100019                                        None
1              DC011D20221110  AV0019000300080180003340501031800481200000
2              DC011D20221110  AV0019000300083180003361901031900071900000
3              DC011D20221110  AV0019000300089180003378701032100515800000
4  DC000D20221209012022100019                                        None
5              DC011D20221209  CG0019000300080220100264401000000000000000
6              DC011D20221209  CG0019000300080220400885101000039990700000
7              DC011D20221209  CG0019000300080220400885101000040013000000

英文:

You can import in pandas directly by specifying the separator and the column names in read_csv. Then proceed with some operations to get the desired format:

import pandas as pd
from io import StringIO

data = StringIO(&quot;&quot;&quot;DC000D20221110012022100019
DC011D           AV0019000300080180003340501031800481200000
DC011D           AV0019000300083180003361901031900071900000
DC011D           AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D           CG0019000300080220100264401000000000000000
DC011D           CG0019000300080220400885101000039990700000
DC011D           CG0019000300080220400885101000040013000000
&quot;&quot;&quot;)

df =  pd.read_csv(data, sep=r&quot;\s+&quot;, engine=&quot;python&quot;, names=[&#39;ref&#39;, &#39;value&#39;])
df[&#39;date&#39;] = df.loc[df[&#39;value&#39;].isna(), &#39;ref&#39;].str[6:14]
df[&#39;date&#39;] = df[&#39;date&#39;].ffill()
mask = ~df[&#39;value&#39;].isna()
df.loc[mask, &#39;ref&#39;] = df[mask][&#39;ref&#39;] + df[mask][&#39;date&#39;]
print(df.drop(columns=&#39;date&#39;))

Output:

                          ref                                       value
0  DC000D20221110012022100019                                        None
1              DC011D20221110  AV0019000300080180003340501031800481200000
2              DC011D20221110  AV0019000300083180003361901031900071900000
3              DC011D20221110  AV0019000300089180003378701032100515800000
4  DC000D20221209012022100019                                        None
5              DC011D20221209  CG0019000300080220100264401000000000000000
6              DC011D20221209  CG0019000300080220400885101000039990700000
7              DC011D20221209  CG0019000300080220400885101000040013000000

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何从Python写入txt文件，其中标题中的日期被复制到txt文件中的每条记录。

问题

答案1

答案2

如何对 Pandas 数据框进行子采样，以使其变量分布适应另一个分布？

使用DataFrame的groupby方法，并将所有行连接成一行。

无法使用Pandas读取Excel文件

在Pygame中，一个精灵在无限循环中上下移动。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论