linux sed grep -P将字符串替换为换行符并考虑下一行

x7yiwoj4  于 2022-12-11  发布在  Linux
关注(0)|答案(4)|浏览(155)

我创建了一个文件,我需要用“”替换最后一个“,”,这样它才是有效的JSON。问题是我不知道如何用sed或者甚至用grep/piping来做这件事。我真的被难住了。任何帮助都将不胜感激。
test.json

[
{MANY OTHER RECORDS, MAKING FILE 3.5Gig (making sed fail because of memory, so newlines were added)},
{"ID":"57705e4a-158c-4d4e-9e07-94892acd98aa","USERNAME":"jmael","LOGINTIMESTAMP":"2021-11-30"},
{"ID":"b8b67609-50ed-4cdc-bbb4-622c7e6a8cd2","USERNAME":"henrydo","LOGINTIMESTAMP":"2021-12-15"},
{"ID":"a44973d0-0ec1-4252-b9e6-2fd7566c6f7d","USERNAME":"null","LOGINTIMESTAMP":"2021-10-31"},
]

当然,将grep-P一起使用符合我需要替换的内容

grep -Pzo '"},\n]' test.json
6psbrbz9

6psbrbz91#

一个有效的解决方案是使用perl来读取文件的最后n字节,然后确定这些字节中多余逗号的位置(例如,使用正则表达式),然后用空格字符替换此逗号:

perl -e '
    $n = 16;                         # how many bytes to read
    open $fh, "+<", $ARGV[0];        # open file in read & write mode
    seek $fh, -$n, 2;                # go to the end minus some bytes
    $n = read $fh, $str, $n;         # load the end of the file
    if ( $str =~ /,\s*]\s*$/s ) {    # get position of comma
        seek $fh, -($n - $-[0]), 1;  # go to position of comma
        print $fh " ";               # replace comma with space char
    }
    close $fh;                       # close file
' log.json

这个解决方案的优点是它只读取文件的几个字节来进行替换**=〉**这使得内存消耗几乎为0,并且避免了阅读整个文件。

pgvzfuti

pgvzfuti2#

使用GNU sed

$ sed -Ez 's/([^]]*),/\1/' test.json
[
{MANY OTHER RECORDS, MAKING FILE 3.5Gig (making sed fail because of memory, so newlines were added)},
{"ID":"57705e4a-158c-4d4e-9e07-94892acd98aa","USERNAME":"jmael","LOGINTIMESTAMP":"2021-11-30"},
{"ID":"b8b67609-50ed-4cdc-bbb4-622c7e6a8cd2","USERNAME":"henrydo","LOGINTIMESTAMP":"2021-12-15"},
{"ID":"a44973d0-0ec1-4252-b9e6-2fd7566c6f7d","USERNAME":"null","LOGINTIMESTAMP":"2021-10-31"}
]
vwkv1x7d

vwkv1x7d3#

使用GNU sed删除文件中的最后一个逗号:

sed -zE 's/,([^,]*)$/\1/' file

输出到标准输出:

[
{MANY OTHER RECORDS, MAKING FILE 3.5Gig (making sed fail because of memory, so newlines were added)},
{"ID":"57705e4a-158c-4d4e-9e07-94892acd98aa","USERNAME":"jmael","LOGINTIMESTAMP":"2021-11-30"},
{"ID":"b8b67609-50ed-4cdc-bbb4-622c7e6a8cd2","USERNAME":"henrydo","LOGINTIMESTAMP":"2021-12-15"},
{"ID":"a44973d0-0ec1-4252-b9e6-2fd7566c6f7d","USERNAME":"null","LOGINTIMESTAMP":"2021-10-31"}
]

请参阅:man sedThe Stack Overflow Regular Expressions FAQ

hec6srdp

hec6srdp4#

所以下面是我用的最后一个解决方案,不是最漂亮的,但它没有内存问题,它做了我需要的。感谢Cyrus的帮助。希望这能帮助一些人。

find *.json | while read file; do

  _FILESIZE=$(stat -c%s "$file")

  if [[ $_FILESIZE -gt 2050000000 ]] ;then

    echo "${file} is too large = $(stat -c%s "${file}") bytes. will be split to work on."

    #get the name of the file without extension
    _FILENAME=$( echo "${file}" | sed -r "s/(.+)(\..+)/\1/" )

    #Split the large file with 3 extension, 1G size, no zero byte files, numeric suffix
    split -a 3 -e -d -b1G ${file} ${_FILENAME}_

    #Because pipe runs in new shell you must do it this way.
    _FINAL_FILE_NAME_SPLIT=
    while read file_split; do
      _FINAL_FILE_NAME_SPLIT=${file_split}
    done < <(find ${_FILENAME}_* | sort -z)

    #The last file has the change we need to make @@ "null"}, \n ] @@ to @@ "null"} \n ] @@
    sed -i -zE 's/},([^,]*)$/}\1/' ${_FINAL_FILE_NAME_SPLIT}

    #Rebuild the split files to replace the final file.
    cat ${_FILENAME}_* > ${file}

    #Remove the split files
    rm -r *_00*

  else

    sed -i -zE 's/},([^,]*)$/}\1/' ${file}

  fi

  #Check that the file is a valid json file.
  cat ${file} | jq '. | length'

  #view the change
  tail -c 50 ${file}

  echo " "
  echo " "

done

相关问题