Looking for help building the future? Let's connect on LinkedIn

Code

How to Split a String in POSIX Shell

Handling multibyte characters gets tricky

November 12, 2021

The POSIX Shell specification defines a minimalist shell language with few features compared to stalwarts like Bash. Yet POSIX shells are everywhere: Dash is the default on Ubuntu, and many Docker containers only have busybox ash. Minimalist shells use less resources and are faster than featureful shells like Bash and Zsh¹. They are more likely to be secure too, as the reduced feature sets are easier to reason about and provide a smaller attack surface for hackers.

The trouble starts when a developer needs a feature that POSIX shells don’t provide: I wanted to split a string into characters, yet there is no builtin function to do this. Common advice is to pipe the string to sed:

#!/bin/sh
split_string () {
IFS='	'
   for c in $(printf "$1\n" | sed 's/./&\t/g');do
     printf "$c\n"
   done
}
while read -r line;do
IFS='
'
  split_string "$line"
done

This code defines a function called split_string which sets the Internal Field Separator to horizontal tab. It pipes its arg to sed. The substitution regex s/./&\t/g replaces all characters with themselves plus a horizontal tab. Because IFS is set to tab, the for loop splits sed’s output by tab, printing each character one at a time. It then sets IFS to a literal newline (POSIX shell has no character escapes like \n). The while loop then splits input by newline, calling the split_string function for each line of input.

echo foo | ./split-sed.sh
f
o
o

There are a couple of issues here. First, it uses tab as a sentinel which means any tabs present in the input stream are skipped.

echo 'f	oo' | split-sed.sh
f
o
o

You could pick another character which is unlikely to appear as input, but it will always have this flaw. Second, the performance is atrocious. The command substitution launches a subshell, forks twice, and copies the output back. On my laptop it takes over 22 seconds to chow down a 312KB text file.

Here’s a faster way:

#!/bin/sh
IFS='
'
split_string () {
  OPTIND=1;
  while getopts ":" opt "-$1"
    do printf "${OPTARG:-:}\n"
  done
}
while read -r line;do
  split_string "$line"
done

This version of split_string² uses the builtin getopts to treat its arg as if it’s parsing an options string for a program. The optstring pattern : instructs getopts to set OPTARG to the value of every character processed. The arg is prefixed with a dash so that it looks like an options string. If OPTARG is unset/null it’s because the character was a colon, so the code uses parameter expansion to set colon as the default value. This code is about 55x faster at processing the large text file than the sed version. It also has no trouble with tab:

echo 'f	oo' | ./split-getopts.sh
f
	
o
o

However, getopts splits the string bytewise which shreds multibyte characters. Which is why this snowman’s bytes are printed on 3 lines:

echo '⛄' | ./split-getopts.sh



Understanding locale

The locale environment variables tell programs what language and cultural conventions to use (like date and number formats, weights and measures etc).

The locale program prints the current settings:

locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

So my environment’s locale is set to “en_US.UTF-8”. This translates as US English in the UTF-8 character encoding. The LC_CTYPE variable defines the locale programs should use for character classification and case conversion.

Now some detective work is needed. I want to find the definitions for LC_CTYPE for my locale. Since the definitions are separate from the encoding, I’m looking for a file named en_US. The locale manpage says the definition files³ are stored under /usr/share/i18n/locales and sure enough, there is a file named /usr/share/i18n/locales/en_US. Searching for LC_CTYPE:

grep -A1 ^LC_CTYPE /usr/share/i18n/locales/en_US
LC_CTYPE
copy "en_GB"

This says the its LC_CTYPE definition is a copy of the en_GB definition. So I grep that file:

grep -A1 ^LC_CTYPE /usr/share/i18n/locales/en_GB
LC_CTYPE
copy "i18n"

Which is a copy of the i18n definition:

grep -A2 ^LC_CTYPE /usr/share/i18n/locales/i18n
LC_CTYPE

copy "i18n_ctype"

Which is a copy of i18n_ctype definition. Looking at the file /usr/share/i18n/locales/i18n_ctype, it doesn’t have a copy directive, it has the definitions! The entry for the print character class looks like this (truncated for brevity):

print /
   ...
   <U2440>..<U244A>;<U2460>..<U2B73>;<U2B76>..<U2B95>;<U2B98>..<U2C2E>;/
   ...

This is a list of Unicode codepoints that should be considered part of the print character class. The snowman has the codepoint U+26C4, which falls in the range <U2460>..<U2B73> above. So with the correct locale settings, POSIX shells should be able to pattern match snowmen, and indeed all printable Unicode codepoints!

Splitting multibyte characters

Knowing that locale changes the print character class for pattern matching, the obvious fix for this is to buffer bytes until they form a printable character:

#!/bin/sh
IFS='
'
split_string () {
  OPTIND=1;
  while getopts ":" opt "$1";do
    buf="$buf${OPTARG:-:}"
    case "$buf" in
      ([[:print:]])
        printf "$buf\n" && buf=
    esac
  done
  [ -n "$buf" ] && printf "$buf\n" && buf=
}
while read -r line;do
  split_string "-$line"
done

And now the snowman prints as expected:

echo '⛄' | ./split-buffer.sh 
⛄

What about two snowmen?

echo '⛄⛄' | ./split-buffer.sh 
⛄⛄

Uh oh. Dash does not use locale in pattern matching (Bash and Zsh do which is one reason they are slower than Dash). Another way this scheme can fail is with combining characters, as these modify the character preceding them:

echo 'ǹ' | bash scratch/split-buffer.sh 
n

The ǹ is a composition of two codepoints: U+006E (n) and U+0300 (combining grave accent). Even though I’m using Bash, the “n” is a printable character, so the code prints it immediately, separating it from the combining character. In other words, this algorithm splits on printable codepoints, not characters⁴.

Notes

  1. Speed was the primary reason cited by Ubuntu. In testing the code in this article, Dash was 2-3 times faster than Bash and Zsh.
  2. Thanks to Koichi Nakashima for improving my original answer.
  3. man 5 locale.
  4. Dash and busybox ash behavior could be improved to split on ASCII characters by pattern matching OPTARG before appending to buf.

Tags: posix getopts ash bash dash zsh unicode utf8 locale