Writing Mainframe format file through Pyspark.

huangapple go评论64阅读模式
英文:

Writing Mainframe format file through Pyspark

问题

我有一个需求,需要读取一个Hive数据库表,并将该信息以EBCDIC编码的文本格式写入,作为主机处理的输入。我们使用的编码是cp037。
为了实现这个目标,我首先将Hive表读入一个数据框(df),然后调用一个自定义的UDF来将数值转换为COMP3格式(用于在主机上存储数值),然后使用cp037对它们进行解码以匹配主机的EBCDIC字符集。问题是,在写入文本文件或显示值时,会添加一些额外的垃圾字符。

首先,我确保自定义UDF正常工作。为此,我使用cp037对自定义UDF的输出进行编码,下面是完美和预期的输出。newcol1是field1的MF COMP3等效项,newcol2是field2的MF COMP3等效项。

但是,当我尝试在解码后写入/显示相同的内容时,会出现垃圾字符。

如果你注意到字节X'78'是添加的垃圾字符。

我正在使用的pyspark文件写入命令是:

df.coalesce(1).write.format('text').option("encoding","cp037").mode('overwrite').save('/some_location/test/comp3_outputdata/')

似乎在写入文本格式时不支持编码选项。

根据以下链接,似乎pyspark在调用text方法时不支持编码。

https://stackoverflow.com/questions/66605006/pass-encoding-option-in-pyspark-text-method

我无法使用其他格式,因为我的输出文件将直接被主机处理。能否有人帮助我解决这个问题?

英文:

I have a requirement to read a hive db table and write that information in text format in EBCDIC encoding as that will be used as an input to a mainframe process. We are using the cp037 encoding.
To do that i am reading the hive table into a df and then calling a custom udf to convert the numeric value into COMP3 format(used to store numeric value in Mainframe) and then decoding them using cp037 for mainframe EBCDIC charset. Problem is while writing into a text file or displaying the value some extra junk characters are getting added.

First i made sure that the custom udf is working fine. For that i encoded back the output of the custom udf using cp037 and the below output is coming which is perfect and expected. newcol1 here is the MF comp3 equivalent of the field1 and newcol2 is the MF COMP3 equivalent of the field2.

root
 |-- field1: long (nullable = true)
 |-- newcol1: binary (nullable = true)
 |-- field2: long (nullable = true)
 |-- newcol2: binary (nullable = true)
+---------------+----------------------+---------------------------+----------------+
|field1         |newcol1               |field2                     |newcol2         |
+---------------+----------------------+---------------------------+----------------+
|2023097000     |[00 02 02 30 97 00 0C]|320264                     |[00 03 20 26 4F]|
|2023097000     |[00 02 02 30 97 00 0C]|343012                     |[00 03 43 01 2F]|
|2023100000     |[00 02 02 31 00 00 0C]|343012                     |[00 03 43 01 2F]|
|2023100000     |[00 02 02 31 00 00 0C]|320264                     |[00 03 20 26 4F]|
+---------------+----------------------+---------------------------+----------------+

But when i am trying to write/display the same after doing decoding junk charcters are coming in

after doing decoding
root
 |-- field1: long (nullable = true)
 |-- newcol11: string (nullable = true)
 |-- field2: long (nullable = true)
 |-- newcol21: string (nullable = true)

+---------------+--------+---------------------------+--------+
|field1        |newcol11|field2                     |newcol21|
+---------------+--------+---------------------------+--------+
|2023097000     |^@^B^B~Pp^@^L |320264                     |^@^C~@^W|   |
|2023097000     |^@^B^B~Pp^@^L |343012                     |^@^C?^A^G   |
|2023100000     |^@^B^B~Q^@^@^L |343012                     |^@^C?^A^G   |
|2023100000     |^@^B^B~Q^@^@^L |320264                     |^@^C~@^W|   |
+---------------+--------+---------------------------+--------+

The Mainframe file is like below (Note that i have put HEX ON to show the actual nibbiles)

********************************
                                
                                
 -------------------------------
2023097000...Ì.p..320264..Ì..|  
FFFFFFFFFF00073900FFFFFF00722444
20230970000228070C32026403806F00
 -------------------------------

If you notice the byte X'78' is the junk character that is getting added.

the pyspark file write command i am using is
df.coalesce(1).write.format('text').option("encoding","cp037").mode('overwrite').save('/some_location/test/comp3_outputdata/')

Seems like while writing into text format it is not supporting the encoding option.

According to the below link it seems like pyspark doesnt support encoding while calling text method.

https://stackoverflow.com/questions/66605006/pass-encoding-option-in-pyspark-text-method

i can't afford to have any other format as my output file will be directly picked up by a mainframe process.
Can somebody please help me with this problem ??

答案1

得分: 1

COMP-3是COBOL中用来指代IBM Z硬件上的压缩十进制格式的术语。其他格式包括二进制的COMP,可以将其视为int32。

在十进制格式中,有几个十进制指令可以处理数字以进行各种算术操作。根据您的问题,不太清楚您想要做什么,但由于您提到COMP-3,可以合理假设您想让一个COBOL程序处理您的数字数据作为压缩十进制,并使用CP-037代码页来处理文本信息。顺便提一下,CP-037通常用于欧洲系统,CP-1047用于北美系统。

压缩十进制的二进制结构与代码页无关。

在您的情况下,您需要根据数据类型处理每一列。根据您的示例,您的情况可能类似于以下内容:

field1 - str<br>
newcol1 - comp-3<br>
field2 - str<br>
newcol2 - comp-3<br>

与其使用CP 037编码将整个记录进行转换(仅适用于field1和field2),您需要使用转换来转换field1和field2。对于newcol1和newcol2,您希望保持它们不变。要么将它们复制到新的输出记录中,要么保持它们不变。由于它们是二进制编码,将会导致类似于您在x'78'中看到的奇怪翻译。

英文:

COMP-3 is a term used in COBOL to refer to packed decimal format on IBM Z hardware. Other formats include COMP which is binary. Think of that as an int32.

In the decimal format there are several decimal instructions that can process the number for various arithmetic operations. It's not clear what you want to do in your question but since you refer to COMP-3 its a reasonable assumption that you want to have a COBOL program process your numeric data as packed decimal and textual information using CP-037 codepage. A side note, CP-037 is generally used in European systems and CP-1047 is used in North American systems.

The binary structure of packed decimal is unrelated to the Code Page.

In your case, you'll need to process each column based on its data type. In your case you'll have something like this based on your example:

field1 - str<br>
newcol1 - comp-3<br>
field2 - str<br>
newcol2 - comp-3<br>

Rather than converting the entire record using a CP 037 encoding (which is only valid for field1 and field2) you would need to convert field1 and field2 with the conversion. For newcol1 and newcol2 you want to leave them untouched. Either copy them to the new output record or leave them untouched. Since they are binary encoding will result in funky translations like you are seeing with x'78'.

答案2

得分: 1

以下是已翻译的内容:

  1. 问题1 -> 在MF文件中获取到垃圾值(这是此线程的原始问题陈述)。 针对此问题,如@Hogstrom建议,我们仅对编码组件特定字段(使用cp037)进行了编码,而不是对整个文件进行编码。 尽管如此,当我们面临相同问题时,我们修复了sft配置文件。 我们使用了带有参数(链接如下)的MBCS SFT配置文件,以明确指定源编码和目标编码。 当我们将源编码设置为utf-8,目标编码设置为cp037时,问题得到解决。

https://www.ibm.com/docs/en/cpl?topic=zos-mbcs-conversion-during-microsoft-windows-copy

  1. 问题2 -> 在代码页cp037中,X'15'和X'25'是可互换的,都表示换行。 我们面临的问题是,如果作为comp3字段的一部分存在x'15'的十六进制字节,该值将更改为x'25'。 我们进行了大量研究,似乎没有直接解决方法,这在这个特定的代码页中有点问题。 我们与IBM讨论了这个问题,他们建议我们使用代码页IBM1047。 请注意,@Hogstrom已在他的回复中提到过这一点,但我不知道。 请注意,整个文件的编码仍然是utf8,但目标编码已从cp037更改为cp1047。

  2. 问题3 -> 这次我们注意到在每一行中都添加了一些额外的空白字符。 没有明确的逻辑...例如,2.7没有添加额外的空格,但2.5添加了额外的字符。 尽管所有与数据相关的问题都已解决,但这对我们造成了布局问题。 为了解决这个问题,我们将整个pyspark输出文件强制从utf8编码为cp1047。 由于在pyspark中我是以文本模式编写的,并且该API不支持编码,因此我们完全编写了一个单独的Python代码来处理这个问题。

  3. 问题4 -> 在这一点上,我们注意到每个记录的末尾都添加了额外的字符x'15'。 这是显而易见的,因为pyspark文本写入API使用\n作为默认行分隔符。 为了解决这个问题,我们使用了自定义行分隔符,并在以下Python代码中使用rstrip来删除该自定义行分隔符。 这解决了问题。

  4. 问题5 -> 最后,在MBCS SFT配置文件设置中,输出文件将始终处于VB模式,而在我们的情况下,我们需要支持VB和FB两种模式。 因此,我们删除了MBCS SFT配置文件,并使用了简单的SFT配置文件,模式=二进制。 这满足了要求,创建了完美的文件和完美的布局。

我希望上述步骤将帮助任何人解决类似要求的问题。

英文:

In solving this particular problem statement we have encountered multiple issues and as such writing here a detail process that we followed to resolve the issue for any future reference on this

  1. Problem 1 -> Getting junk value in the MF file (which was the original problem statement of this thread).
    For this, as suggested by @Hogstrom we have just encoded the comp specific fields(using cp037) rather than encoding the entire file. Still when we faced the same issue we fixed the sft profile. We have used a MBCS SFT profile with parameters (link given below) to explicitly specify the source and destination encoding. When we did that with source encoding as utf-8 and destination as cp037 the issue got resolved.

https://www.ibm.com/docs/en/cpl?topic=zos-mbcs-conversion-during-microsoft-windows-copy

  1. Problem 2 -> In code page cp037, X'15' and X'25' are interchangeable both indicating NewLine. The issue that we faced is that if there is a hex byte of x'15' as part of a comp3 field that value was getting change to x'25'. We did a lot of research and seems like there is no straightforward way to solve it and it's kind of bug in this particular codepage. We discussed with IBM on this and they suggested us to use the code page IBM1047. Note @Hogstrom mentioned this already in his response but somehow i didn't notice that. Note the entire file encoding was still utf8 but the destination encoding is changed from cp037 to cp1047.

  2. Problem 3 -> This time we noticed that in each line some extra white spaces are getting added. There is no definite logic for that...for example there was no extra space being added for 2.7 but for 2.5 extra character is getting added. This resulted into a layout issue for us though all the data related issues are resolved. To solved this we forcefully encoded the entire pyspark output file from utf8 to cp1047. Since in pyspark i was writing in text mode and encoding is not supported in that API we have written a separate python code all together for that

  3. Problem 4 -> At this point, we noticed that we are getting an extra character x'15' is being added at the end of each rec. This was obvious as the pyspark text write API uses \n as default line-seperator. To resolve this we have used a custom line separator and in the following python code used rstrip to remove that custom line-seperator. this fixed the issue

  4. Problem 5 -> Finally in the MBCS sft profile setting, the output file will always be on VB mode while in our case we need to support both VB and FB. As such we have removed the MBCS SFT profile and used simple SFT profile with mode=binary. That satisfied the requirement and created perfect file with perfect layout.

I hope the above steps will help anybody with anytype of issue they might face working on similar requirements.

huangapple
  • 本文由 发表于 2023年4月17日 13:25:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76031941.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定