Go语言检查字符串是否只包含ASCII字符

wfsdck30 于 12个月前发布在 Go

关注(0)|答案(4)|浏览(83)

Go语言是否有方法或者建议如何检查一个字符串是否只包含ASCII字符？正确的方法是什么？
根据我的研究，解决方案之一是检查任何大于127的字符。

func isASCII(s string) bool {
    for _, c := range s {
        if c > unicode.MaxASCII {
            return false
        }
    }

    return true
}

字符串

来源：https://stackoverflow.com/questions/53069040/checking-a-string-contains-only-ascii-characters

4条答案

按热度按时间

fnx2tebb1#

在Go中，我们关心性能，因此，我们会对您的代码进行基准测试：

func isASCII(s string) bool {
    for _, c := range s {
        if c > unicode.MaxASCII {
            return false
        }
    }
    return true
}

BenchmarkRange-4    20000000    82.0 ns/op

字符串
一个更快（更好，更习惯）的版本，它避免了不必要的符文转换：

func isASCII(s string) bool {
    for i := 0; i < len(s); i++ {
        if s[i] > unicode.MaxASCII {
            return false
        }
    }
    return true
}

BenchmarkIndex-4    30000000    55.4 ns/op

型
ascii_test.go：

package main

import (
    "testing"
    "unicode"
)

func isASCIIRange(s string) bool {
    for _, c := range s {
        if c > unicode.MaxASCII {
            return false
        }
    }
    return true
}

func BenchmarkRange(b *testing.B) {
    str := ascii()
    b.ResetTimer()
    for N := 0; N < b.N; N++ {
        is := isASCIIRange(str)
        if !is {
            b.Fatal("notASCII")
        }
    }
}

func isASCIIIndex(s string) bool {
    for i := 0; i < len(s); i++ {
        if s[i] > unicode.MaxASCII {
            return false
        }
    }
    return true
}

func BenchmarkIndex(b *testing.B) {
    str := ascii()
    b.ResetTimer()
    for N := 0; N < b.N; N++ {
        is := isASCIIIndex(str)
        if !is {
            b.Log("notASCII")
        }
    }
}

func ascii() string {
    byt := make([]byte, unicode.MaxASCII+1)
    for i := range byt {
        byt[i] = byte(i)
    }
    return string(byt)
}

型
输出量：

$ go test ascii_test.go -bench=.
BenchmarkRange-4    20000000    82.0 ns/op
BenchmarkIndex-4    30000000    55.4 ns/op
$

型

赞(0）回复(0）举报 12个月前

euoag5mw2#

看起来你的方法最好。
ASCII简单定义为：
ASCII将128个指定字符编码为7位整数
因此，字符的值为0-27（或0-127，0x 0 - 0x 7 F）。
Go语言没有提供任何方法来检查字符串中的每个字符（或切片中的字节）是否在特定范围内具有数值，因此您的代码似乎是最好的方法。

赞(0）回复(0）举报 12个月前

yxyvkwin3#

另一个选择：

package main
import "golang.org/x/exp/utf8string"

func main() {
   {
      b := utf8string.NewString("south north").IsASCII()
      println(b) // true
   }
   {
      b := utf8string.NewString("🧡💛💚💙💜").IsASCII()
      println(b) // false
   }
}

字符串
https://pkg.go.dev/golang.org/x/exp/utf8string#String.IsASCII

赞(0）回复(0）举报 12个月前

e5nqia274#

被接受的答案比最初提出的解决方案更快，但我认为最初提出的解决方案更符合习惯。最初的解决方案是习惯的，因为几乎总是，当你在go中查看字符串的内容时，你对代码点比单个utf8编码的字节值更感兴趣。Range在这里是100%习惯的。
在这个特定的例子中，索引字符串和比较字节碰巧更快，并且仍然给我们一个正确的结果。但是它仍然不是一个优化的解决方案。甚至不接近。
所以我们建个自行车棚吧。
因此，对于一个非惯用的（尽管基于stdlib代码）实现，这是更快的方式，我提出了以下内容：

func isASCIIIndexBy8s32(s string) bool {
    // idea adapted from here:
    // https://cs.opensource.google/go/go/+/refs/tags/go1.21.5:src/unicode/utf8/utf8.go;l=528
    for len(s) > 0 {
        if len(s) >= 8 {
            first32 := uint32(s[0]) | uint32(s[1])<<8 | uint32(s[2])<<16 | uint32(s[3])<<24
            second32 := uint32(s[4]) | uint32(s[5])<<8 | uint32(s[6])<<16 | uint32(s[7])<<24
            if (first32|second32)&0x80808080 != 0 {
                return false
            }
            s = s[8:]
            continue
        }
        if s[0] > unicode.MaxASCII {
            return false
        }
        s = s[1:]
    }
    return true
}

字符串
这比索引的例子快了两倍多，并且对64位CPU的进一步优化使我们比那个例子又提高了20%。
完整代码在这里。

package main

import (
    "testing"
    "unicode"
)

func isASCIIRange(s string) bool {
    for _, c := range s {
        if c > unicode.MaxASCII {
            return false
        }
    }
    return true
}

func isASCIIIndex(s string) bool {
    for i := 0; i < len(s); i++ {
        if s[i] > unicode.MaxASCII {
            return false
        }
    }
    return true
}

func isASCIIIndexBy8s32(s string) bool {
    // idea adapted from here:
    // https://cs.opensource.google/go/go/+/refs/tags/go1.21.5:src/unicode/utf8/utf8.go;l=528
    for len(s) > 0 {
        if len(s) >= 8 {
            first32 := uint32(s[0]) | uint32(s[1])<<8 | uint32(s[2])<<16 | uint32(s[3])<<24
            second32 := uint32(s[4]) | uint32(s[5])<<8 | uint32(s[6])<<16 | uint32(s[7])<<24
            if (first32|second32)&0x80808080 != 0 {
                return false
            }
            s = s[8:]
            continue
        }
        if s[0] > unicode.MaxASCII {
            return false
        }
        s = s[1:]
    }
    return true
}

func isASCIIIndexBy8s64(s string) bool {
    // optimizing the 32 bit example above for 64 bit cpus
    for len(s) > 0 {
        if len(s) >= 8 {
            chunk := uint64(s[0]) | uint64(s[1])<<8 | uint64(s[2])<<16 | uint64(s[3])<<24 |
                uint64(s[4])<<32 | uint64(s[5])<<40 | uint64(s[6])<<48 | uint64(s[7])<<56
            if chunk&0x8080808080808080 != 0 {
                return false
            }
            s = s[8:]
            continue
        }
        if s[0] > unicode.MaxASCII {
            return false
        }
        s = s[1:]
    }
    return true
}

var tests = []struct {
    n string
    f func(string) bool
}{
    {
        n: "range",
        f: isASCIIRange,
    }, {
        n: "index",
        f: isASCIIIndex,
    }, {
        n: "index 8 32",
        f: isASCIIIndexBy8s32,
    }, {
        n: "index 8 64",
        f: isASCIIIndexBy8s64,
    },
}

func TestAsciis(t *testing.T) {
    atxt := "this is ascii text"
    utxt := "this is some unicode 💩💩and a lot more ascii text afterwards"
    for _, tst := range tests {
        t.Run(tst.n, func(t *testing.T) {
            if !tst.f(atxt) {
                t.Errorf("%s failed for %s", tst.n, atxt)
            }
            if tst.f(utxt) {
                t.Errorf("%s failed for %s", tst.n, utxt)
            }
        })
    }
}

func BenchmarkAsciis(b *testing.B) {
    str := ascii()
    for _, bnch := range tests {
        b.Run(bnch.n, func(b *testing.B) {
            b.ResetTimer()
            for n := 0; n < b.N; n++ {
                if !bnch.f(str) {
                    b.Errorf("not ascii")
                }
            }
        })
    }
}

func ascii() string {
    byt := make([]byte, unicode.MaxASCII+1)
    for i := range byt {
        byt[i] = byte(i)
    }
    return string(byt)
}

型
运行此命令可得到：

=== RUN   TestAsciis
=== RUN   TestAsciis/range
=== RUN   TestAsciis/index
=== RUN   TestAsciis/index_8_32
=== RUN   TestAsciis/index_8_64
--- PASS: TestAsciis (0.00s)
    --- PASS: TestAsciis/range (0.00s)
    --- PASS: TestAsciis/index (0.00s)
    --- PASS: TestAsciis/index_8_32 (0.00s)
    --- PASS: TestAsciis/index_8_64 (0.00s)
goos: darwin
goarch: amd64
pkg: playground
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkAsciis
BenchmarkAsciis/range
BenchmarkAsciis/range-8             13469913            86.72 ns/op
BenchmarkAsciis/index
BenchmarkAsciis/index-8             24108583            49.56 ns/op
BenchmarkAsciis/index_8_32
BenchmarkAsciis/index_8_32-8        62092825            20.89 ns/op
BenchmarkAsciis/index_8_64
BenchmarkAsciis/index_8_64-8        77044797            15.91 ns/op
PASS

型

赞(0）回复(0）举报 12个月前

我来回答

Go语言检查字符串是否只包含ASCII字符

4条答案

相关问题

热门标签

最新问答

Go语言 检查字符串是否只包含ASCII字符

4条答案

相关问题

热门标签

最新问答

Go语言检查字符串是否只包含ASCII字符