unix 从Cobol中的PATH读取UTF-8格式的文件

rkkpypqq  于 2022-12-03  发布在  Unix
关注(0)|答案(1)|浏览(261)

In an experimental feature exploration, I found some astounding behaviour when I try to load a Unix file that is UTF-8 on a mainframe system using Cobol and declaring the FD record as Unicode alphanumeric.
If my record length is 10 (ie. 10 chunk PIC U(10) , first 10 characters are correctly loaded. Then 30 (apparently 3x the length according to exploration) is skipped, then it reads again 10 characters. Then the next 10 characters are read into my next record.
Source Code of my program:

IDENTIFICATION DIVISION.
       PROGRAM-ID.   loadutf8.
       ENVIRONMENT DIVISION.
       INPUT-OUTPUT SECTION.
       FILE-CONTROL.
           SELECT XMLFILE ASSIGN TO "XML".
       DATA DIVISION.
       FILE SECTION.
       FD XMLFILE RECORDING MODE F.
          01  chunk PIC  U(10).
       WORKING-STORAGE SECTION.
          01  EOF   PIC  X.
       PROCEDURE DIVISION.
       START-PROGRAM.
           OPEN INPUT XMLFILE.
           PERFORM WITH TEST after  UNTIL EOF = 'T'
              READ XMLFILE
              AT END MOVE 'T' TO EOF
              NOT AT END
               DISPLAY FUNCTION DISPLAY-of (chunk)
              END-READ
           END-PERFORM.
           CLOSE XMLFILE.
           GOBACK.
       END PROGRAM loadutf8.

JOB-CARD:

//COBOL     EXEC IGYWCLG,LNGPRFX=IGY630 
//SYSIN       DD DISP=SHR,DSN=COB.SRC(loadutf8) 
//GO.XML      DD PATH='/u/utf8.xml'

My UTF-8 file:

<?xml  ?>
<!-- 0 --><!-- 1 --><!-- 2 --><!-- 3 --><!-- 4 --><!-- 5 --><x>???</x>

Output observed:

<?xml  ?>   
<!-- 3 -->

To me, it looks like consistently reading the chunk according to size, skip 3 times and then goes to next chunk size, etc..
What could be causing this?
Is there a best practice to it, how to load a Unix file to a XML and use a variable that has the usage UTF-8? Preferably w/o any hacks, just using 'standard' language features.
Just asking this out of curiosity, any idea on how to explain the observed outcome is appreciated.

9gm1akwq

9gm1akwq1#

IBM Enterprise Cobol V6.3似乎已经引入了本地UTF-8支持。我没有这方面的经验,但通过阅读手册,我可以解释发生了什么。然而,我 * 不能 * 说这是预期的行为还是一个bug。
无论如何,在程序员指南(v6.4)的主题Defining UTF-8 data items中,可以看到:

固定字符长度的UTF-8数据项。

当PICTURE子句包含一个或多个“U”字符,或单个“U”字符后跟重复因子,并且既未指定PICTURE子句的BYTE-LENGTH短语,也未指定DYNAMIC LENGTH子句时,将定义此类型的UTF-8数据项。
并且还
对于固定字符长度的UTF-8数据项,内存中为数据项保留的字节数为4 × n,其中n是在项的定义中指定的字符数。注意,由于UTF-8编码的可变长度特性,即使将n个字符移动到长度为n的UTF-8数据项中,不一定需要所有4 × n个保留字节来保存数据,这取决于数据中每个字符的大小。
在关于Processing QSAM files的章节中,我们可以读到
您还可以使用QSAM访问z/OS UNIX文件系统中的字节流文件。这些文件是面向字节的二进制顺序文件,没有记录结构。您在COBOL程序中编写的记录定义和您读取和写入的变量的长度决定了传输的数据量。
我由此得出结论,Cobol只是建议底层I/O例程(QSAM)读取许多字节,* 保留给接收变量 *,在您的示例中,一次读取40个字节。它只是读取给定数量的字节而不是字符,并将它们放入输入缓冲区。
只有在以后使用该变量时(例如在DISPLAY语句中),字节才会被解释为UTF-8字符。然后,将考虑 defined 的UTF-8字符数的变量长度。
我做了一些快速测试,并在一些文件中读取,其中包含的字符需要一个以上的UTF-8字节,显示的数据相应地转移。
还不确定如何用COBOL成功地处理UTF-8 UNIX文件。

相关问题