Go语言 如何使用unsafe从字符串中获取字节片而不进行内存复制

lf3rwulv  于 2022-12-07  发布在  Go
关注(0)|答案(7)|浏览(301)

我读过有关“https://github.com/golang/go/issues/25484“的文章,其中介绍了从[]bytestring的无拷贝转换。
我想知道是否有一种方法可以将一个字符串转换为一个字节片,而不需要内存复制?
我正在写一个处理terra-bytes数据的程序,如果每个字符串在内存中复制两次,会减慢进程。我不关心可变/不安全,只关心内部使用,我只需要速度越快越好。
示例:

var s string
// some processing on s, for some reasons, I must use string here
// ...
// then output to a writer
gzipWriter.Write([]byte(s))  // !!! Here I want to avoid the memory copy, no WriteString

所以问题是:有没有防止内存复制的方法?2我知道我可能需要不安全的软件包,但是我不知道怎么做。3我已经搜索了一段时间,到现在还没有答案,SO显示的相关答案也不起作用。

lmyy7pcs

lmyy7pcs1#

Getting the content of a string as a []byte without copying in general is only possible using unsafe , because string s in Go are immutable, and without a copy it would be possible to modify the contents of the string (by changing the elements of the byte slice).
So using unsafe , this is how it could look like (corrected, working solution):

func unsafeGetBytes(s string) []byte {
    return (*[0x7fff0000]byte)(unsafe.Pointer(
        (*reflect.StringHeader)(unsafe.Pointer(&s)).Data),
    )[:len(s):len(s)]
}

This solution is from Ian Lance Taylor .
One thing to note here: the empty string "" has no bytes as its length is zero. This means there is no guarantee what the Data field may be, it may be zero or an arbitrary address shared among the zero-size variables. If an empty string may be passed, that must be checked explicitly (although there's no need to get the bytes of an empty string without copying...):

func unsafeGetBytes(s string) []byte {
    if s == "" {
        return nil // or []byte{}
    }
    return (*[0x7fff0000]byte)(unsafe.Pointer(
        (*reflect.StringHeader)(unsafe.Pointer(&s)).Data),
    )[:len(s):len(s)]
}

Original, wrong solution was:

func unsafeGetBytesWRONG(s string) []byte {
    return *(*[]byte)(unsafe.Pointer(&s)) // WRONG!!!!
}

See Nuno Cruces's answer below for reasoning.
Testing it:

s := "hi"
data := unsafeGetBytes(s)
fmt.Println(data, string(data))

data = unsafeGetBytes("gopher")
fmt.Println(data, string(data))

Output (try it on the Go Playground ):

[104 105] hi
[103 111 112 104 101 114] gopher

BUT: You wrote you want this because you need performance. You also mentioned you want to compress the data. Please know that compressing data (using gzip ) requires a lot more computation than just copying a few bytes! You will not see any noticeable performance gain by using this!

Instead when you want to write string s to an io.Writer, it's recommended to do it via io.WriteString() function which if possible will do so without making a copy of the string (by checking and calling WriteString() method which if exists is most likely does it better than copying the string ). For details, see What's the difference between ResponseWriter.Write and io.WriteString?
There are also ways to access the contents of a string without converting it to []byte , such as indexing, or using a loop where the compiler optimizes away the copy:

s := "something"
for i, v := range []byte(s) { // Copying s is optimized away
    // ...
}

Also see related questions:

[]byte(string) vs []byte(*string)
What are the possible consequences of using unsafe conversion from []byte to string in go?
What is the difference between the string and []byte in Go?
Does conversion between alias types in Go create copies?
How does type conversion internally work? What is the memory utilization for the same?

js4nwp54

js4nwp542#

After some extensive investigation, I believe I've discovered the most efficient way of getting a []byte from a string as of Go 1.17 (this is for i386/x86_64 gc ; I haven't tested other architectures.) The trade-off of being efficient code here is being inefficient to code, though.
Before I say anything else, it should be made clear that the differences are ultimately very small and probably inconsequential -- the info below is for fun/educational purposes only.

Summary

With some minor alterations, the accepted answer illustrating the technique of slicing a pointer to array is the most efficient way. That being said, I wouldn't be surprised if unsafe.Slice becomes the (decisively) better choice in the future.

unsafe.Slice

unsafe.Slice currently has the advantage of being slightly more readable, but I'm skeptical about it's performance. It looks like it makes a call to runtime.unsafeslice . The following is the gc amd64 1.17 assembly of the function provided in Atamiri's answer ( FUNCDATA omitted). Note the stack check (lack of NOSPLIT ):

unsafeGetBytes_pc0:
        TEXT    "".unsafeGetBytes(SB), ABIInternal, $48-16
        CMPQ    SP, 16(R14)
        PCDATA  $0, $-2
        JLS     unsafeGetBytes_pc86
        PCDATA  $0, $-1
        SUBQ    $48, SP
        MOVQ    BP, 40(SP)
        LEAQ    40(SP), BP

        PCDATA  $0, $-2
        MOVQ    BX, ""..autotmp_4+24(SP)
        MOVQ    AX, "".s+56(SP)
        MOVQ    BX, "".s+64(SP)
        MOVQ    "".s+56(SP), DX
        PCDATA  $0, $-1
        MOVQ    DX, ""..autotmp_5+32(SP)
        LEAQ    type.uint8(SB), AX
        MOVQ    BX, CX
        MOVQ    DX, BX
        PCDATA  $1, $1
        CALL    runtime.unsafeslice(SB)
        MOVQ    ""..autotmp_5+32(SP), AX
        MOVQ    ""..autotmp_4+24(SP), BX
        MOVQ    BX, CX
        MOVQ    40(SP), BP
        ADDQ    $48, SP
        RET
unsafeGetBytes_pc86:
        NOP
        PCDATA  $1, $-1
        PCDATA  $0, $-2
        MOVQ    AX, 8(SP)
        MOVQ    BX, 16(SP)
        CALL    runtime.morestack_noctxt(SB)
        MOVQ    8(SP), AX
        MOVQ    16(SP), BX
        PCDATA  $0, $-1
        JMP     unsafeGetBytes_pc0

Other unimportant fun facts about the above (easily subject to change): compiled size of 3326 B; has an inline cost of 7 ; correct escape analysis: s leaks to ~r1 with derefs=0 .

Carefully Modifying *reflect.SliceHeader

This method has the advantage/disadvantage of letting one modify the internal state of a slice directly. Unfortunately, due it's multiline nature and use of uintptr, the GC can easily mess things up if one is not careful about keeping a reference to the original string. (Here I avoided creating temporary pointers to reduce inline cost and to avoid needing to add runtime.KeepAlive ):

func unsafeGetBytes(s string) (b []byte) {
    (*reflect.SliceHeader)(unsafe.Pointer(&b)).Data = (*reflect.StringHeader)(unsafe.Pointer(&s)).Data
    (*reflect.SliceHeader)(unsafe.Pointer(&b)).Cap = len(s)
    (*reflect.SliceHeader)(unsafe.Pointer(&b)).Len = len(s)
    return
}

The corresponding assembly on amd64 ( FUNCDATA omitted):

TEXT    "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $32-16
        SUBQ    $32, SP
        MOVQ    BP, 24(SP)
        LEAQ    24(SP), BP

        MOVQ    AX, "".s+40(SP)
        MOVQ    BX, "".s+48(SP)
        MOVQ    $0, "".b(SP)
        MOVUPS  X15, "".b+8(SP)
        MOVQ    "".s+40(SP), DX
        MOVQ    DX, "".b(SP)
        MOVQ    "".s+48(SP), CX
        MOVQ    CX, "".b+16(SP)
        MOVQ    "".s+48(SP), BX
        MOVQ    BX, "".b+8(SP)
        MOVQ    "".b(SP), AX
        MOVQ    24(SP), BP
        ADDQ    $32, SP
        RET

Other unimportant fun facts about the above (easily subject to change): compiled size of 3700 B; has an inline cost of 20 ; subpar escape analysis: s leaks to {heap} with derefs=0 .

Unsafer version of modifying SliceHeader

Adapted from Nuno Cruces' answer . This relies on the inherent structural similarity between StringHeader and SliceHeader , so in a sense it breaks "more easily". Additionally, it temporarily creates an illegal state where cap(b) (being 0 ) is less than len(b) .

func unsafeGetBytes(s string) (b []byte) {
    *(*string)(unsafe.Pointer(&b)) = s
    (*reflect.SliceHeader)(unsafe.Pointer(&b)).Cap = len(s)
    return
}

Corresponding assembly ( FUNCDATA omitted):

TEXT    "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $32-16
        SUBQ    $32, SP
        MOVQ    BP, 24(SP)
        LEAQ    24(SP), BP
        MOVQ    AX, "".s+40(FP)

        MOVQ    $0, "".b(SP)
        MOVUPS  X15, "".b+8(SP)
        MOVQ    AX, "".b(SP)
        MOVQ    BX, "".b+8(SP)
        MOVQ    BX, "".b+16(SP)
        MOVQ    "".b(SP), AX
        MOVQ    BX, CX
        MOVQ    24(SP), BP
        ADDQ    $32, SP
        NOP
        RET

Other unimportant details: compiled size 3636 B, inline cost of 11 , with subpar escape analysis: s leaks to {heap} with derefs=0 .

Slicing a pointer to array

This is the accepted answer (shown here for comparison) -- its primary disadvantage is its ugliness (viz. magic number 0x7fff0000 ). There's also the tiniest possibility of getting a string bigger than the array, and an unavoidable bounds check.

func unsafeGetBytes(s string) []byte {
    return (*[0x7fff0000]byte)(unsafe.Pointer(
        (*reflect.StringHeader)(unsafe.Pointer(&s)).Data),
    )[:len(s):len(s)]
}

Corresponding assembly ( FUNCDATA removed).

TEXT    "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $24-16
        SUBQ    $24, SP
        MOVQ    BP, 16(SP)
        LEAQ    16(SP), BP

        PCDATA  $0, $-2
        MOVQ    AX, "".s+32(SP)
        MOVQ    BX, "".s+40(SP)
        MOVQ    "".s+32(SP), AX
        PCDATA  $0, $-1
        TESTB   AL, (AX)
        NOP
        CMPQ    BX, $2147418112
        JHI     unsafeGetBytes_pc54
        MOVQ    BX, CX
        MOVQ    16(SP), BP
        ADDQ    $24, SP
        RET
unsafeGetBytes_pc54:
        MOVQ    BX, DX
        MOVL    $2147418112, BX
        PCDATA  $1, $1
        NOP
        CALL    runtime.panicSlice3Alen(SB)
        XCHGL   AX, AX

Other unimportant details: compiled size 3142 B, inline cost of 9 , with correct escape analysis: s leaks to ~r1 with derefs=0
Note the runtime.panicSlice3Alen -- this is bounds check that checks that len(s) is within 0x7fff0000 .

Improved slicing pointer to array

This is what I've concluded to be the most efficient method as of Go 1.17. I basically modified the accepted answer to eliminate the bounds check, and found a "more meaningful" constant ( math.MaxInt32 ) to use than 0x7fff0000 . Using MaxInt32 preserves 32-bit compatibility.

func unsafeGetBytes(s string) []byte {
    const MaxInt32 = 1<<31 - 1
    return (*[MaxInt32]byte)(unsafe.Pointer((*reflect.StringHeader)(
                    unsafe.Pointer(&s)).Data))[:len(s)&MaxInt32:len(s)&MaxInt32]
}

Corresponding assembly ( FUNCDATA removed):

TEXT    "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $0-16

        PCDATA  $0, $-2
        MOVQ    AX, "".s+8(SP)
        MOVQ    BX, "".s+16(SP)
        MOVQ    "".s+8(SP), AX
        PCDATA  $0, $-1
        TESTB   AL, (AX)
        ANDQ    $2147483647, BX
        MOVQ    BX, CX
        RET

Other unimportant details: compiled size 3188 B, inline cost of 13 , and correct escape analysis: s leaks to ~r1 with derefs=0

xu3bshqb

xu3bshqb3#

在go 1.17中,我建议使用unsafe.Slice,因为它的可读性更强:

unsafe.Slice((*byte)(unsafe.Pointer((*reflect.StringHeader)(unsafe.Pointer(&s)).Data)), len(s))

我认为这也是可行的(不违反任何unsafe.Pointer规则),其优点是它适用于consts

*(*[]byte)(unsafe.Pointer(&struct{string; int}{s, len(s)}))
  • 下面的注解是关于被接受的答案的原始状态。被接受的答案现在提到了伊恩·兰斯·泰勒的(权威)解决方案。保持它指出的一个常见错误。*

公认的答案是错误的,可能会产生评论中提到的恐慌@RFC,@icza关于GC和keep alive的解释是误导的。
容量为零(甚至是一个任意值)的原因更为平淡无奇。
切片是:

type SliceHeader struct {
    Data uintptr
    Len  int
    Cap  int
}

字符串为:

type StringHeader struct {
    Data uintptr
    Len  int
}

将字节片转换为字符串可以像strings.Builder一样“安全”地完成:

func (b *Builder) String() string {
    return *(*string)(unsafe.Pointer(&b.buf))
}

这将把Data指针和Len从切片复制到字符串。
相反的转换是不“安全”的,因为Cap没有设置为正确的值。

  • 以下内容(我的原创)是错误的,因为它违反了unsafe.Pointer规则#1。*

以下是修复死机的正确代码:

var buf = *(*[]byte)(unsafe.Pointer(&str))
(*reflect.SliceHeader)(unsafe.Pointer(&buf)).Cap = len(str)

又或者:

var buf []byte
*(*string)(unsafe.Pointer(&buf)) = str
(*reflect.SliceHeader)(unsafe.Pointer(&buf)).Cap = len(str)

我应该补充的是,所有这些转换都是不安全的,因为字符串应该是不可变的,而字节数组/切片是可变的。
但是,如果您确信字节片不会发生变化,那么上述转换就不会出现边界(或GC)问题。

nxowjjhe

nxowjjhe4#

在Go语言1.17中,我们现在可以使用unsafe.Slice,因此可以将接受的答案改写为:

func unsafeGetBytes(s string) []byte {
        return unsafe.Slice((*byte)(unsafe.Pointer((*reflect.StringHeader)(unsafe.Pointer(&s)).Data)), len(s))
}
l5tcr1uw

l5tcr1uw5#

我设法通过这样来达到目的:

func TestString(t *testing.T) {

    b := []byte{'a', 'b', 'c', '1', '2', '3', '4'}
    s := *(*string)(unsafe.Pointer(&b))
    sb := *(*[]byte)(unsafe.Pointer(&s))

    addr1 := unsafe.Pointer(&b)
    addr2 := unsafe.Pointer(&s)
    addr3 := unsafe.Pointer(&sb)

    fmt.Print("&b=", addr1, "\n&s=", addr2, "\n&sb=", addr3, "\n")

    hdr1 := (*reflect.StringHeader)(unsafe.Pointer(&b))
    hdr2 := (*reflect.SliceHeader)(unsafe.Pointer(&s))
    hdr3 := (*reflect.SliceHeader)(unsafe.Pointer(&sb))

    fmt.Print("b.data=", hdr1.Data, "\ns.data=", hdr2.Data, "\nsb.data=", hdr3.Data, "\n")

    b[0] = 'X'
    sb[1] = 'Y'  // if sb is from a string directly, this will cause nil panic
    fmt.Print("s=", s, "\nsb=")
    for _, c := range sb {
        fmt.Printf("%c", c)
    }
    fmt.Println()

}

输出量:

=== RUN   TestString
&b=0xc000218000
&s=0xc00021a000
&sb=0xc000218020
b.data=824635867152
s.data=824635867152
sb.data=824635867152
s=XYc1234
sb=XYc1234

这些变量都共享相同的内存。

6jjcrrmo

6jjcrrmo6#

Go 1.20(2023年2月)

您可以使用unsafe.StringData来大大简化YenForYang's answer
StringData返回一个指向str的基础字节的指针。对于空字符串,返回值未指定,可能为nil。
由于Go语言的字符串是不可变的,因此StringData返回的字节不能被修改。

func main() {
    str := "foobar"
    d := unsafe.StringData(str)
    b := unsafe.Slice(d, len(str))
    fmt.Printf("%T, %s\n", b, b) // []uint8, foobar (byte is alias of uint8)
}

围棋提示游戏场:https://go.dev/play/p/FIXe0rb8YHE?v=gotip
记住,你不能给b[n]赋值,因为内存仍然是只读的。

ncecgwcz

ncecgwcz7#

简单,没有反射,我认为它是可移植的。s是字符串,b是字节片

var b []byte
bb:=(*[3]uintptr)(unsafe.Pointer(&b))[:]
copy(bb, (*[2]uintptr)(unsafe.Pointer(&s))[:])
bb[2] = bb[1]
// use b

请记住,不应修改字节值(将死机)。可以重新切片(例如:(一个月一个月)

相关问题