Code

Splitting Quoted Strings in Bash

December 28, 2021

A few weeks ago I was writing a macro loader for my JSON Processor, and I ran into an odd case where code I entered in the terminal was splitting differently to code I loaded from a file. It turns out that Bash’s word splitting behaves differently for arguments than it does for variables.

When processing args Bash will not split quoted words that contain a delimiter:

printf "%s\n" foo 'bar baz'
foo
bar baz

But if those args are in a string:

input='foo "bar baz"'
printf "%s\n" "$input"
foo "bar baz"

Or unquoted:

printf "%s\n" $input
foo
"bar
baz"

What I really want is to split the variable into an array of words. Then printf works as I want it to:

input=(foo "bar baz")
printf "%s\n" "${input[@]}"
foo
bar baz

I can’t use read -a because that will split input ignoring quotes. In fact the only built in way to accomplish this is via eval:

input='foo "bar baz"'
eval "printf \"%s\n\" $input"
foo
bar baz

But using eval invites a whole set of complications I’d rather avoid. And launching a subshell to another program (awk, xargs) is too slow for my use case. What to do?

Bash really doesn’t have a good way to parse a string into substrings, while respecting quotes.

superuser

Bash Quoting

Bash splits words on unescaped spaces, horizontal tabs and newlines. So to determine when to split a word we need to understand Bash escape rules. There are four(ish) kinds:

Splitting words

Decoding and interpreting ANSI-C escape sequences seems difficult, and I don’t need to support it for my use case. To handle the other three kinds, I need to write code which handles each set of escape rules, scan a string and to keep track which kind of quoting is currently in effect.

#!/bin/bash
wordsplit () {
  WORDS=()
  WORDC=0
  WORDERR=
  local idx=0 quo= word= c= esc=
  while :;do
    c="${1:$idx:1}"
    (( idx++ ))
    if [ -z $quo ];then
      if [ "$c" = '\' ];then
        if [ -z "$esc" ];then
          esc=1
          continue
        else
          esc=
        fi
      elif ([[ "$c" == [$' \t\n'] ]]&&[ -z "$esc" ])||[ -z "$c" ];then
        if [ -n "$word" ];then
          WORDS+=("$word")
          word=
          (( WORDC++ ))
        fi
        [ -z "$c" ] && break
        continue
      elif ([ "$c" = "'" ]||[ "$c" = '"' ])&&[ -z "$esc" ];then
        quo="$c"
        continue
      fi
    elif [ -z "$c" ];then
      WORDERR="found unterminated string at col $idx: '$1'"
      return 1
    elif [ "$c" = '\' ]&&[ "$quo" = '"' ];then
      if [[ "${1:$idx:1}" == [$\\\`\"] ]] || [ "${1:$idx:1}" = $'\n' ];then
        c="${1:$idx:1}"
        (( idx++ ))
      fi
    elif [ "$c" = "$quo" ];then
      quo=
      continue
    fi
    esc=
    word+="$c"
  done
  return 0
}

This function wordsplit accepts a single string argument, the string to split and saves the result in the WORDS array (Bash doesn’t really support return values). It inspecting the string byte-by-byte, storing the current byte in the local variable c.

The first conditional block [ -z $quo ] handles unquoted strings; all word splitting happens outside of quotes, so this is the longest block. The second top-level condition [ -z "$c" ] handles the unterminated string error condition. The third conditional block handles double quote escapes; only five characters are escape-able in double quotes, but because it’s possible to have n backslash escapes in a row, this block jumps forward one byte instead of counting how many backslash escapes it’s seen so far (imagine: “\\\\\\\\\\\\\\\"). The last top-level condition [ "$c" = "$quo" ] matches a terminating quote character.

Here it is in action:

wordsplit ' foo bar\ baz	"a b\"c" \d';printf "%s\n" "${WORDS[@]}"
foo
bar baz
a b"c
d

I’ve uploaded the code to GitHub. It has a test suite but as shell quoting is a treacherous business, if you find any bugs please let me know!

Notes

  1. Greg’s wiki page on Bash quoting has more detail than the GNU manual (the whole wiki is pretty great).

Tags: bash wordsplit awk