Go vs Rust: String

Both Go and Rust has built-in type string/String, which are both UTF8 encoded. Also, both are implemented as slice of bytes.

In the official document of Go and Rust, they talks about the difference between byte, Unicode Scalar Value (Rust) / Unicode Code Point (Go) (Also see the difference between Scalar Value and Code Point), and grapheme cluster (the so-called character).

Next, let’s see the language behavior for each case.

Index and Loop

Rust doesn’t allow users to directly index or loop a String. Due to Rust’s propensity for exposing possible errors, it asks users to be more specific about what to index/loop. The standard library provide chars() (Scalar Value) and bytes() (bytes):

for c in "नमस्ते".chars() {
    println!("{}", c);
}

// Output:
// न
// म
// स
// ्
// त
// े

for b in "नमस्ते".bytes() {
    println!("{}", b);
}

// Output:
// 224
// 164
// ...
// 165
// 135

In contrast, Go always allows users to index a string, it is actually indexing the byte slice:

s := "नमस्ते"
fmt.Println(s[0])

// Output:
// 224

In regard of looping, Go provides two ways. The first way is to loop by indexing, this apparently is iterating the byte slice. A for range loop, by contrast, decodes one UTF-8-encoded rune (alias to i32, a.k.a Code Point) on each iteration:

for i := 0; i < len(s); i++ {
    fmt.Println(s[i])
}
// Output:
// 224
// 164
// ...
// 165
// 135

for idx, p := range s {
    fmt.Printf("%s (byte offset: %d)\n", string(p), idx)
}
// Output:
// न (byte offset: 0)
// म (byte offset: 3)
// स (byte offset: 6)
//  ्(byte offset: 9)
// त (byte offset: 12)
//  े(byte offset: 15)

Note that there is a type conversion in the for range loop to convert the rune into string, otherwise it will simply output the Code Point value - rather than its string representation. Also note that, the index of each rune increments by the bytes occupied by the Code Point.

Slice

In Rust, it allows users to slice a String into a string slice (i.e. &str). However, this is a dangerous operation since it might panic the program if the slice boundary is not a valid char (Scalar Value) boundary.

let s = "नमस्ते";
&s[..2];
// &s[..3] // works

Will panic:

thread 'main' panicked at 'byte index 2 is not a char boundary; it is inside 'न' (bytes 0..3) of `नमस्ते`', src/main.rs:5:6
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

In Go, it is less strict and allows users to index into the middle of a Code Point. Meanwhile, it provides the unicode/utf8 to allow users to (e.g.) validate a rune.

Loop Per Grapheme Cluster (i.e. Character)

There is no builtin support for this in Rust, whilst there is a crate called unicode-segmentation that supports this:

for g in "नमस्ते्".graphemes(true) {
    println!("{}", g);
}

// output:
// न
// म
// स्
// ते्

In Go, there is also no official support, but there are several community packages available: github.com/rivo/uniseg, github.com/blevesearch/segment.

Normalization

For Rust, see: https://docs.rs/unicode-normalization; For Go, see: https://blog.golang.org/normalization.

*Posts*

谢谢你，2024.md

Crypto-Cookbook.md

Terraform-in-the-browser.md

Manage-ThingsBoard-with-Terraform.md

Go-vs-Rust:-String.md

Go-JSON-Marshal/Unmarshal.md

Terraform-Provider-Contributor-Workspace-Setup.md

Bash:-tee-+-ssh.md

Edit-HCL-For-Terraform.md

Go---Embedded-Field.md

Rust---Package.md

Rust---Trait.md

Rust---`Option`-take.md

Go-AST-Tips.md

Ruby-Tips.md

git-submodule-tips.md

go1.13-错误处理.md

Terraform-Provider-Tips.md

minikube-设置-bootstrap-token.md

浅谈-Go-context.md

go-micro:-Tracing.md

docker-compose指定容器在主机上监听的ip.md

microservice-设计.md

golang-reflect.md

nats-简介.md

go-micro-源码解析---server-&-client.md

go-micro-防坑指南.md

golang-之-变量隐藏(shadowing).md

zookeeper-之-eznode-&&-watcher.md

Dart-之-异步.md

Dart-之-类型系统.md

cgroup笔记.md

leetcode难题笔记.md

kotlin读书笔记.md

Vim-自动补全Golang遇到的问题.md

C/C++目标文件运行段和debug段分离.md

Golang-sql库.md

crontab.md

postgresql学习笔记-(v9.6).md

mysql笔记.md

Golang-随笔.md

理一理Python包管理.md

Ansible-Playbook.md

Dockerfile：sshd-service.md

Dockerfile.md

Shell-Tips.md

UPnP-Server.md

《Fluent-Python》读书笔记.md

12个羽毛球中1个是次品.md

C++11-auto-&&-decltype.md

Computer-Network-读书笔记.md

Archlinux:-Make-laptop-as-AP.md

C++11-std::bind.md

C++11-thread,-mutex,-condition-variable.md

bit-count-algorithm.md

gmock.md

SoX-tips.md

部署flaskbb(ubuntu).md

Where-does-variables-of-.so-resides-in-memory?.md

PulseAudio.md

锁和信号.md

Wireshark-practices.md

Install-ArchLinux-on-thinkpad-t460p.md

GDB-Tips.md

Unit-Test(C++).md

位操作技巧.md

手动修改二进制文件.md

Linux-多线程环境下的Signal.md

todo.txt.md

进程的输入参数.md

aircrack.md

ALSA---PCM接口.md

上海市办理社保卡，医保卡，居住证.md

C++-输入和输出(I/O).md

C++-虚函数(virtual-function).md

sed-tips.md

C++-继承(inheritance).md

C++-组合(composition).md

C++:-操作符重载.md

Posts