Splitting Quoted Strings in Bash
December 28, 2021A few weeks ago I was writing a macro loader for my JSON Processor, and I ran into an odd case where code I entered in the terminal was splitting differently to code I loaded from a file. It turns out that Bash’s word splitting behaves differently for arguments than it does for variables.
When processing args Bash will not split quoted words that contain a delimiter:
printf "%s\n" foo 'bar baz'
foo
bar baz
But if those args are in a string:
input='foo "bar baz"'
printf "%s\n" "$input"
foo "bar baz"
Or unquoted:
printf "%s\n" $input
foo
"bar
baz"
What I really want is to split the variable into an array of words. Then printf works as I want it to:
input=(foo "bar baz")
printf "%s\n" "${input[@]}"
foo
bar baz
I can’t use read -a
because that will split input ignoring quotes. In fact the only built in way to accomplish this is via eval:
input='foo "bar baz"'
eval "printf \"%s\n\" $input"
foo
bar baz
But using eval invites a whole set of complications I’d rather avoid. And launching a subshell to another program (awk, xargs) is too slow for my use case. What to do?
Bash really doesn’t have a good way to parse a string into substrings, while respecting quotes.
Bash Quoting
Bash splits words on unescaped spaces, horizontal tabs and newlines. So to determine when to split a word we need to understand Bash escape rules. There are four(ish) kinds:
- Unquoted - tab/space/newline can be escaped with backslash:
foo bar\ baz
. - Single quoted - literal string, no escapes recognized: `‘foo’ ‘bar baz’
- Double quoted - double quotes can be escaped with backslash:
"foo" "bar baz"
- ANSI-C - single quotes can be escaped and many other backslash sequences are recognized:
$'foo' 'bar baz'
Splitting words
Decoding and interpreting ANSI-C escape sequences seems difficult, and I don’t need to support it for my use case. To handle the other three kinds, I need to write code which handles each set of escape rules, scan a string and to keep track which kind of quoting is currently in effect.
#!/bin/bash
wordsplit () {
WORDS=()
WORDC=0
WORDERR=
local idx=0 quo= word= c= esc=
while :;do
c="${1:$idx:1}"
(( idx++ ))
if [ -z $quo ];then
if [ "$c" = '\' ];then
if [ -z "$esc" ];then
esc=1
continue
else
esc=
fi
elif ([[ "$c" == [$' \t\n'] ]]&&[ -z "$esc" ])||[ -z "$c" ];then
if [ -n "$word" ];then
WORDS+=("$word")
word=
(( WORDC++ ))
fi
[ -z "$c" ] && break
continue
elif ([ "$c" = "'" ]||[ "$c" = '"' ])&&[ -z "$esc" ];then
quo="$c"
continue
fi
elif [ -z "$c" ];then
WORDERR="found unterminated string at col $idx: '$1'"
return 1
elif [ "$c" = '\' ]&&[ "$quo" = '"' ];then
if [[ "${1:$idx:1}" == [$\\\`\"] ]] || [ "${1:$idx:1}" = $'\n' ];then
c="${1:$idx:1}"
(( idx++ ))
fi
elif [ "$c" = "$quo" ];then
quo=
continue
fi
esc=
word+="$c"
done
return 0
}
This function wordsplit
accepts a single string argument, the string to split and saves the result in the WORDS
array (Bash doesn’t really support return values). It inspecting the string byte-by-byte, storing the current byte in the local variable c
.
The first conditional block [ -z $quo ]
handles unquoted strings; all word splitting happens outside of quotes, so this is the longest block. The second top-level condition [ -z "$c" ]
handles the unterminated string error condition. The third conditional block handles double quote escapes; only five characters are escape-able in double quotes, but because it’s possible to have n backslash escapes in a row, this block jumps forward one byte instead of counting how many backslash escapes it’s seen so far (imagine: “\\\\\\\\\\\\\\\"). The last top-level condition [ "$c" = "$quo" ]
matches a terminating quote character.
Here it is in action:
wordsplit ' foo bar\ baz "a b\"c" \d';printf "%s\n" "${WORDS[@]}"
foo
bar baz
a b"c
d
I’ve uploaded the code to GitHub. It has a test suite but as shell quoting is a treacherous business, if you find any bugs please let me know!
Notes
- Greg’s wiki page on Bash quoting has more detail than the GNU manual (the whole wiki is pretty great).