Powershell：将字符串拆分为子字符串数组，使得在编码为Uft8时，没有一个子字符串的字节长度>n

rn0zuynd 于 2023-04-30 发布在 Shell

关注(0)|答案(2)|浏览(285)

我如何将一个字符串拆分成一个子字符串数组，使得当编码为Uft8且尾部为null时，没有子字符串的字节长度〉n？
n总是〉= 5，因此4字节编码字符+ null将适合。
我现在唯一能想到的办法就是：

拆分、编码、测试长度，如果太大，则用较小的拆分重复。
编码、遍历、掩码和计数字节，并跟踪边界，手动拆分

有更好的办法吗？
输出数组也可以是一个utf8字节数组，而不是字符串，如果这更容易。
我已经知道代码点，编码，代理等等。这具体涉及utf8编码的字节长度。

来源：https://stackoverflow.com/questions/75898613/powershell-spilt-a-string-to-an-array-of-sub-strings-such-that-none-when-encode

2条答案

按热度按时间

sshcrbum1#

好的，首先，我所知道的关于UTF-8的二进制编码过程的一切都是刚刚从UTF-8 Wikipedia page中学到的，所以注意：-）。也就是说，与@mklement0的答案相比，下面的代码似乎给予了一堆测试数据的正确结果（我毫不怀疑这是正确的），所以也许其中有一些里程碑。..
很高兴听到里面是否有任何咆哮的声音-我认为它 * 应该 * 在原则上工作，即使下面的实现是错误的地方：-）。
在任何情况下，核心函数是下面的函数，它返回utf-8编码字节数组中块的 * 位置 *。我想，一旦你知道了位置，你可能想使用字节的地方（e。例如，使用Stream.Write(byte[] buffer, int offset, int count)或类似的方法），所以我避免了将块的副本提取到一个新列表中，但是如果需要的话，使用输出来做这件事非常容易（请参阅下面的内容）。

# extracts the *positions* of chunks of bytes in a ut8 byte array
# such that no multi-byte codepoints are split across chunks, and
# all chunks are a maximum of $MaxLen bytes
#
# note - assumes $Utf8Bytes is a *valid* utf8 byte array, so may
# need error handling if you expect invalid data to be passed in.
function Get-UTF8ChunkPositions
{
    param( [byte[]] $Utf8Bytes, [int] $MaxLen )

    # from https://en.wikipedia.org/wiki/UTF-8
    #
    # Code point ↔ UTF-8 conversion
    # -----------------------------
    # First code point  Last code point  Byte 1    Byte 2    Byte 3    Byte 4    Code points
    # U+0000            U+007F           0xxxxxxx                                        128
    # U+0080            U+07FF           110xxxxx  10xxxxxx                             1920
    # U+0800            U+FFFF           1110xxxx  10xxxxxx  10xxxxxx                  61440
    # U+10000           U+10FFFF         11110xxx  10xxxxxx  10xxxxxx  10xxxxxx      1048576

    # stores the start position of each chunk
    $startPositions = [System.Collections.Generic.List[int]]::new();

    $i = 0;
    while( $i -lt $Utf8Bytes.Length )
    {

        # remember the start position for the current chunk
        $startPositions.Add($i);

        # jump past the end of the current chunk, optimistically assuming we won't land in
        # the middle of a multi-byte codepoint (but we'll deal with that in a minute)
        $i += $MaxLen;

        # if we've gone past the end of the array then we're done as there's no more
        # chunks after the current one, so there's no more start positions to record
        if( $i -ge $Utf8Bytes.Length )
        {
            break;
        }

        # if we're in the middle of a multi-byte codepoint, backtrack until we're not.
        # we're then at the start of the *next* chunk, and the chunk length is definitely
        # smaller then $MaxLen
        #
        # 0xC0 = [Convert]::ToInt32("11000000", 2);
        # 0x80 = [Convert]::ToInt32("10000000", 2);
        while( ($Utf8Bytes[$i] -band 0xC0) -eq 0x80 )
        {
            $i -= 1;
        }

    }

    # we know all the start positions now, so turn them into ranges.
    # (we'll add a dummy item to help build the last range)
    $startPositions.Add($utf8Bytes.Length);
    for( $i = 1; $i -lt $startPositions.Count; $i++ )
    {
        [PSCustomObject] @{
            "Start"  = $startPositions[$i-1];
            "Length" = $startPositions[$i] - $startPositions[$i-1];
        };
    }

}

此函数利用了 * 有效 * utf8字节流为self-synchronizing这一事实，这意味着实际上我们 * 不必 * 遍历流中的每个字节-我们可以跳到任何我们喜欢的地方，并通过找到从00xxxxxx、01xxxxxx或11xxxxxx开始的最近字节来找到编码码点的开始（或者等价地，* doesn 't * start 10xxxxxx），这是 next 块的开始。
例如，如果我们从当前块的开头跳转n字节，并找到一个以10xxxxxx开头的字节：

n-2       n-1        n        n+1
... 11110xxx  10xxxxxx  10xxxxxx  10xxxxxx ...
                        ^^^^^^^^

然后我们回溯到n-2作为 * 下一个 * 块的开始（所以当前块的结束逻辑上比n-3的结束早一个字节）：

n-2       n-1        n        n+1
... 11110xxx  10xxxxxx  10xxxxxx  10xxxxxx ...
    ^^^^^^^^

示例：

# set up some test data
$str  = "1234abc€défg👍abüb";

$utf8 = [System.Text.Encoding]::UTF8.GetBytes($str);
"$utf8"
# 49 50 51 52 97 98 99 226 130 172 100 195 169 102 103 240 159 145 141 97 98 195 188 98

$maxLen = 5 - 1; # exclude NUL terminator
$positions = Get-UTF8ChunkPositions -Utf8Bytes $utf8 -MaxLen $maxLen;
$positions
# Start Length
# ----- ------
#     0      4
#     4      3
#     7      4
#    11      4
#    15      4
#    19      4
#    23      1

块

如果你真的想要一个带有null终止符的分块字节数组，你可以像这样把位置转换成块：

$chunks = $positions | foreach-object {
    $chunk = new-object byte[] ($_.Length + 1); # allow room for NUL terminator
    [Array]::Copy($utf8, $_.Start, $chunk, 0, $_.Length);
    $chunk[$chunk.Length - 1] = 0; # utf-8 encoded NUL is 0
    @(, $chunk); # send the chunk to the pipeline
};

字符串

一旦你得到了单独的以null结尾的utf-8编码块，如果你真的需要它们，你可以像这样将它们恢复为以null结尾的字符串：

$strings = $chunks | foreach-object { [System.Text.Encoding]::UTF8.GetString($_) }
$strings

性能

就性能而言，这似乎可以很好地扩展，即使您必须专门将字符串编码为utf8字节数组才能调用它。对于$MaxLen来说，它的执行速度也要快得多，因为它可以为每个块跳跃更长的距离。..
在我的机器上做一个粗略的测试：

$sample = "1234abc€défg👍abüb" * 1000000;
$sample / 1mb;
# 17

Measure-Command {
    $utf8 = [System.Text.Encoding]::UTF8.GetBytes($sample);
    $maxlen = 1024;
    $positions = Get-UTF8ChunkPositions $utf8 $maxlen;
    $chunks = $positions | foreach-object {
        $chunk = [byte[]]::new($_.Length + 1); # allow room for NUL terminator
        [Array]::Copy($utf8, $_.Start, $chunk, 0, $_.Length);
        $chunk[$chunk.Length - 1] = 0; # utf-8 encoded NUL is 0
        @(, $chunk); # send the chunk to the pipeline
    };
};

# TotalMilliseconds : 7986.131

虽然如果性能是你主要关心的问题，那么PowerShell可能不是你的正确选择：-）。..

赞(0）回复(0）举报 2023-04-30

slsn1g292#

注意事项：

这个答案创建了一个（的列表。NET）* 子字符串 *，方法是将输入字符串划分为子字符串，这些子字符串的 * UTF-8字节表示 * 不超过给定的 * 字节 * 计数。
为了创建这些UTF-8字节表示的列表，i.即 * 字节数组 *，参见mclayton's helpful answer。

以下：

逐个字符迭代输入字符串或逐个代理项对迭代输入字符串
请注意.NET [char]示例，其中。NET [string]示例由无符号的16位Unicode * 代码单元 * 组成，因此只能 * 直接 * 编码代码点高达U+FFFF的Unicode字形（所谓的BMP（基本多语言平面）中的字形），并需要一个（代理）* 对 * 代码单元来编码平面外的Unicode字符，即。也就是那些代码点 * 大于 * U+FFFF的字符，特别是包括 emoji 的字符范围，例如👍。
根据UTF-8字节序列长度提取子字符串，根据每个字符的Unicode码位/字符和后续字符是否形成代理对（这意味着码位大于U+FFFF，这反过来意味着4字节）推断，基于此Wikipedia表。

向mclayton致敬，指出跟踪原始字符串中的 indices 与.Substring()调用相结合足以提取块-不需要string builder。

使用System.Collections.Generic.List1`示例收集列表中的所有块。

# Sample string
$str = '1234abc€défg👍'

$maxChunkLen = 4 # the max. *byte* count, excluding the trailing NUL

$chunks = [System.Collections.Generic.List[string]]::new()
$chunkByteCount = 0
$chunkStartNdx = 0
for ($i = 0; $i -lt $str.Length; ++$i) {
  $codePoint = [int] $str[$i]
  $isSurrogatePair = [char]::IsSurrogatePair($str, $i)
  # Note: A surrogate pair encoded as UTF-8 is a *single* non-BMP
  #       character encoded in 4 bytes, not 2 3-byte *surrogates*.
  $thisByteCount = 
    if ($isSurrogatePair) { 4 } 
    elseif ($codePoint -ge 0x800) { 3 } 
    elseif ($codePoint -ge 0x80) { 2 } 
    else { 1 }
  if ($chunkByteCount + $thisbyteCount -gt $maxChunkLen) {
    # Including this char. / surrogate pair would make the chunk too long.
    # Add the current chunk plus a trailing NUL to the list...
    $chunks.Add($str.Substring($chunkStartNdx, ($i - $chunkStartNdx)) + "`0")
    # ... and start a new chunk with this char. / surrogate pair.
    $chunkStartNdx = $i
    $chunkByteCount = $thisByteCount
  }
  else {
    # Still fits into the current chunk.
    $chunkByteCount += $thisByteCount
  }
  if ($isSurrogatePair) { ++$i }
}
# Add a final chunk to the list, if present.
if ($chunkStartNdx -lt $str.Length) { $chunks.Add($str.Substring($chunkStartNdx) + "`0") }

# Output the resulting chunks
$chunks

输出（注解）：

1234 # 1 + 1 + 1 + 1 bytes (+ NUL)
abc  # 1 + 1 + 1 bytes (+ NUL)
€d   # 3 + 1 bytes (+ NUL)
éfg  # 2 + 1 + 1 bytes (+ NUL)
👍   # 4 bytes (+ NUL)

注意，除了第一个块之外的所有块都具有 * 少于 * 4个字符（字节计数限制，不包括NUL），这是由于包含多字节为UTF-8字符和/或由于 * 下一个 * 字符是这样的字符，因此不适合当前块。

赞(0）回复(0）举报 2023-04-30

我来回答

Powershell：将字符串拆分为子字符串数组，使得在编码为Uft8时，没有一个子字符串的字节长度>n

2条答案

块

字符串

性能

相关问题

热门标签

最新问答