Quoting differences between MySQL and PostgreSQL, and converting between them

TECHNICALDOLTGRES
5 min read

Introduction

We're hard at work building Doltgres, a Postgres-compatible database with git-inspired version control features. Doltgres is built on top of Dolt, which uses MySQL's syntax and wire protocol. Dolt has tens of thousands of tests of its SQL engine, tests we would love to re-use to make sure Doltgres works the same way. So we're in the process of porting those tests to Doltgres.

One of the first challenges we encountered in this process is that many of our tests use MySQL's rules for quoting identifiers and string literals, and Postgres's rules are quite different.

Quoting differences between MySQL and PostgreSQL

There are two things you might want to quote in your SQL query: identifiers, like table and column names; and string literals. For example, here's a MySQL query that quotes things:

INSERT INTO `mydb`.`mytable` VALUES ('hello', "goodbye");

This query works fine in MySQL, but won't parse in Postgres. Instead, it needs to look like this:

INSERT INTO "mydb"."mytable" VALUES ('hello', 'goodbye');

This is because the quote characters (`, ", ') have different meanings in the two dialects.

Quote character MySQL Postgres
Double quote (") String literal Identifier
Single quote (') String literal String literal
Backtick (`) Identifier N/A

In other words: MySQL lets you input string literals with either single or double quotes, and uses backticks for quoting identifiers. Postgres is stricter: you must use only single quotes for string literals and double quotes for identifiers, and backticks have no special meaning.

Dolt's SQL engine tests use both single and double quotes interchangeably, and there are tens of thousands of them. We want all those tests to run on Doltgres, but we don't want to rewrite them all, or maintain two different verions for the two databases.

What if we could convert MySQL's quoting syntax to Postgres's automatically?

Converting from MySQL quoting to Postgres quoting

It doesn't seem like it, but converting between these two formats is actually a semi-challenging algorithm to write, because of escaping. Escaping is how you embed a quote character inside a string. The standard way to do this in SQL (and many other languages) is to double-up the quoted character, like this:

SELECT "This is a ""string literal""";

Each of the pairs of consecutive double quote characters reduce to a single character without breaking the literal. So it produces the result:

This is a "string literal"

Accommodating escapes makes it difficult to use a solution like regular expressions (which bring their own difficulties). There are many ways to approach the problem, but I chose to use a simple state machine. Here it is as an ASCII-art diagram.

               ┌───────────────────*─────────────────────────┐
               │                   ┌─*─┐                     *
               │               ┌───┴───▼──────┐         ┌────┴─────────┐
               │     ┌────"───►│ In double    │◄───"────┤End double    │
               │     │         │ quoted string│────"───►│quoted string?│
               │     │         └──────────────┘         └──────────────┘
               ├─────(──────────────────*───────────────────┐
      ┌─*──┐   ▼     │                                      *
      │    ├─────────┴┐            ┌─*─┐                    │
      └───►│ Not in   │        ┌───┴───▼─────┐          ┌───┴──────────┐
           │ string   ├───'───►│In single    │◄────'────┤End single    │
  ────────►└─────────┬┘        │quoted string│─────'───►│quoted string?│
  START        ▲     │         └─────────────┘          └──────────────┘
               └─────(──────────────────*───────────────────┐
                     │            ┌─*──┐                    *
                     │        ┌───┴────▼────┐           ┌───┴──────────┐
                     └───`───►│In backtick  │◄─────`────┤End backtick  │
                              │quoted string│──────`───►│quoted string?│
                              └─────────────┘           └──────────────┘

Now, there's actually a small complication here: MySQL provides a second way to escape quote characters, by prefacing them by a backslash (\) character. (Postgres doesn't allow backslash escapes with the standard server settings). Rather than doubling the number of states in our state machine, we'll just add a separate state-tracking variable for backslashes and patch the logic into the state machine.

In Go, it looks like this:

type stringParserState byte

const (
	notInString stringParserState = iota
	inDoubleQuote
	maybeEndDoubleQuote
	inSingleQuote
	maybeEndSingleQuote
	inBackticks
	maybeEndBackticks
)

const singleQuote = '\''
const doubleQuote = '"'
const backtick = '`'
const backslash = '\\'

// normalizeStrings normalizes a query string to convert any MySQL syntax to Postgres syntax
func normalizeStrings(q string) string {
	state := notInString
	lastCharWasBackslash := false
	normalized := strings.Builder{}

	for _, c := range q {
		switch state {
		case notInString:
			switch c {
			case singleQuote:
				state = inSingleQuote
				normalized.WriteRune(singleQuote)
			case doubleQuote:
				state = inDoubleQuote
				normalized.WriteRune(singleQuote)
			case backtick:
				state = inBackticks
				normalized.WriteRune(doubleQuote)
			default:
				normalized.WriteRune(c)
			}
		case inDoubleQuote:
			switch c {
			case backslash:
				if lastCharWasBackslash {
					normalized.WriteRune(c)
				}
				lastCharWasBackslash = !lastCharWasBackslash
			case doubleQuote:
				if lastCharWasBackslash {
					normalized.WriteRune(c)
					lastCharWasBackslash = false
				} else {
					state = maybeEndDoubleQuote
				}
			case singleQuote:
				normalized.WriteRune(singleQuote)
				normalized.WriteRune(singleQuote)
				lastCharWasBackslash = false
			default:
				lastCharWasBackslash = false
				normalized.WriteRune(c)
			}
		case maybeEndDoubleQuote:
			switch c {
			case doubleQuote:
				state = inDoubleQuote
				normalized.WriteRune(doubleQuote)
			default:
				state = notInString
				normalized.WriteRune(singleQuote)
				normalized.WriteRune(c)
			}
		case inSingleQuote:
			switch c {
			case backslash:
				if lastCharWasBackslash {
					normalized.WriteRune(c)
				}
				lastCharWasBackslash = !lastCharWasBackslash
			case singleQuote:
				if lastCharWasBackslash {
					normalized.WriteRune(c)
					normalized.WriteRune(c)
					lastCharWasBackslash = false
				} else {
					state = maybeEndSingleQuote
				}
			default:
				lastCharWasBackslash = false
				normalized.WriteRune(c)
			}
		case maybeEndSingleQuote:
			switch c {
			case singleQuote:
				state = inSingleQuote
				normalized.WriteRune(singleQuote)
				normalized.WriteRune(singleQuote)
			default:
				state = notInString
				normalized.WriteRune(singleQuote)
				normalized.WriteRune(c)
			}
		case inBackticks:
			switch c {
			case backtick:
				state = maybeEndBackticks
			default:
				normalized.WriteRune(c)
			}
		case maybeEndBackticks:
			switch c {
			case backtick:
				state = inBackticks
				normalized.WriteRune(backtick)
			default:
				state = notInString
				normalized.WriteRune(doubleQuote)
				normalized.WriteRune(c)
			}
		default:
			panic("unknown state")
		}
	}

	// If reached the end of input unsure whether to unquote a string, do so now
	switch state {
	case maybeEndDoubleQuote:
		normalized.WriteRune(singleQuote)
	case maybeEndSingleQuote:
		normalized.WriteRune(singleQuote)
	case maybeEndBackticks:
		normalized.WriteRune(doubleQuote)
	default: // do nothing
	}

	return normalized.String()
}

Now we can write some tests to verify this works as expected:

/ Test converting MySQL strings to Postgres strings
func TestNormalizeStrings(t *testing.T) {
	type test struct {
		input    string
		expected string
	}
	tests := []test{
		{
			input:    "SELECT \"foo\" FROM `bar`",
			expected: `SELECT 'foo' FROM "bar"`,
		},
		{
			input:    `SELECT "fo""""o"`,
			expected: `SELECT 'fo""o'`,
		},
        // snipped many test cases
		{
			input:    "SELECT \"foo\" from `bar` where `bar`.`baz` = \"qux\"",
			expected: `SELECT 'foo' from "bar" where "bar"."baz" = 'qux'`,
		},
	}

	for _, test := range tests {
		t.Run(test.input, func(t *testing.T) {
			actual := normalizeStrings(test.input)
			require.Equal(t, test.expected, actual)
		})
	}
}

It all works, so we can wire this logic directly into our test harness to run our MySQL dialect tests (including any we write in the future) against our Postgres dialect database.

Conclusion

Getting all Dolt's tests to run on Doltgres is a critical step in making it production quality. There's still a lot more work to be done to get there, but we're working hard every day and making good progress.

Have questions on MySQL v. Postgres quoting? Or maybe you are curious about the world's first version-controlled SQL database? Join us on Discord to talk to our engineering team and meet other Dolt and Doltgres users.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.