Quoting differences between MySQL and PostgreSQL, and converting between them
Introduction
We're hard at work building Doltgres, a Postgres-compatible database with git-inspired version control features. Doltgres is built on top of Dolt, which uses MySQL's syntax and wire protocol. Dolt has tens of thousands of tests of its SQL engine, tests we would love to re-use to make sure Doltgres works the same way. So we're in the process of porting those tests to Doltgres.
One of the first challenges we encountered in this process is that many of our tests use MySQL's rules for quoting identifiers and string literals, and Postgres's rules are quite different.
Quoting differences between MySQL and PostgreSQL
There are two things you might want to quote in your SQL query: identifiers, like table and column names; and string literals. For example, here's a MySQL query that quotes things:
INSERT INTO `mydb`.`mytable` VALUES ('hello', "goodbye");
This query works fine in MySQL, but won't parse in Postgres. Instead, it needs to look like this:
INSERT INTO "mydb"."mytable" VALUES ('hello', 'goodbye');
This is because the quote characters (`
, "
, '
) have different meanings in the two dialects.
Quote character | MySQL | Postgres |
---|---|---|
Double quote (" ) |
String literal | Identifier |
Single quote (' ) |
String literal | String literal |
Backtick (` ) |
Identifier | N/A |
In other words: MySQL lets you input string literals with either single or double quotes, and uses backticks for quoting identifiers. Postgres is stricter: you must use only single quotes for string literals and double quotes for identifiers, and backticks have no special meaning.
Dolt's SQL engine tests use both single and double quotes interchangeably, and there are tens of thousands of them. We want all those tests to run on Doltgres, but we don't want to rewrite them all, or maintain two different verions for the two databases.
What if we could convert MySQL's quoting syntax to Postgres's automatically?
Converting from MySQL quoting to Postgres quoting
It doesn't seem like it, but converting between these two formats is actually a semi-challenging algorithm to write, because of escaping. Escaping is how you embed a quote character inside a string. The standard way to do this in SQL (and many other languages) is to double-up the quoted character, like this:
SELECT "This is a ""string literal""";
Each of the pairs of consecutive double quote characters reduce to a single character without breaking the literal. So it produces the result:
This is a "string literal"
Accommodating escapes makes it difficult to use a solution like regular expressions (which bring their own difficulties). There are many ways to approach the problem, but I chose to use a simple state machine. Here it is as an ASCII-art diagram.
┌───────────────────*─────────────────────────┐
│ ┌─*─┐ *
│ ┌───┴───▼──────┐ ┌────┴─────────┐
│ ┌────"───►│ In double │◄───"────┤End double │
│ │ │ quoted string│────"───►│quoted string?│
│ │ └──────────────┘ └──────────────┘
├─────(──────────────────*───────────────────┐
┌─*──┐ ▼ │ *
│ ├─────────┴┐ ┌─*─┐ │
└───►│ Not in │ ┌───┴───▼─────┐ ┌───┴──────────┐
│ string ├───'───►│In single │◄────'────┤End single │
────────►└─────────┬┘ │quoted string│─────'───►│quoted string?│
START ▲ │ └─────────────┘ └──────────────┘
└─────(──────────────────*───────────────────┐
│ ┌─*──┐ *
│ ┌───┴────▼────┐ ┌───┴──────────┐
└───`───►│In backtick │◄─────`────┤End backtick │
│quoted string│──────`───►│quoted string?│
└─────────────┘ └──────────────┘
Now, there's actually a small complication here: MySQL provides a second way to escape quote
characters, by prefacing them by a backslash (\
) character. (Postgres doesn't allow backslash
escapes with the standard server settings). Rather than doubling the number of states in our state
machine, we'll just add a separate state-tracking variable for backslashes and patch the logic into
the state machine.
In Go, it looks like this:
type stringParserState byte
const (
notInString stringParserState = iota
inDoubleQuote
maybeEndDoubleQuote
inSingleQuote
maybeEndSingleQuote
inBackticks
maybeEndBackticks
)
const singleQuote = '\''
const doubleQuote = '"'
const backtick = '`'
const backslash = '\\'
// normalizeStrings normalizes a query string to convert any MySQL syntax to Postgres syntax
func normalizeStrings(q string) string {
state := notInString
lastCharWasBackslash := false
normalized := strings.Builder{}
for _, c := range q {
switch state {
case notInString:
switch c {
case singleQuote:
state = inSingleQuote
normalized.WriteRune(singleQuote)
case doubleQuote:
state = inDoubleQuote
normalized.WriteRune(singleQuote)
case backtick:
state = inBackticks
normalized.WriteRune(doubleQuote)
default:
normalized.WriteRune(c)
}
case inDoubleQuote:
switch c {
case backslash:
if lastCharWasBackslash {
normalized.WriteRune(c)
}
lastCharWasBackslash = !lastCharWasBackslash
case doubleQuote:
if lastCharWasBackslash {
normalized.WriteRune(c)
lastCharWasBackslash = false
} else {
state = maybeEndDoubleQuote
}
case singleQuote:
normalized.WriteRune(singleQuote)
normalized.WriteRune(singleQuote)
lastCharWasBackslash = false
default:
lastCharWasBackslash = false
normalized.WriteRune(c)
}
case maybeEndDoubleQuote:
switch c {
case doubleQuote:
state = inDoubleQuote
normalized.WriteRune(doubleQuote)
default:
state = notInString
normalized.WriteRune(singleQuote)
normalized.WriteRune(c)
}
case inSingleQuote:
switch c {
case backslash:
if lastCharWasBackslash {
normalized.WriteRune(c)
}
lastCharWasBackslash = !lastCharWasBackslash
case singleQuote:
if lastCharWasBackslash {
normalized.WriteRune(c)
normalized.WriteRune(c)
lastCharWasBackslash = false
} else {
state = maybeEndSingleQuote
}
default:
lastCharWasBackslash = false
normalized.WriteRune(c)
}
case maybeEndSingleQuote:
switch c {
case singleQuote:
state = inSingleQuote
normalized.WriteRune(singleQuote)
normalized.WriteRune(singleQuote)
default:
state = notInString
normalized.WriteRune(singleQuote)
normalized.WriteRune(c)
}
case inBackticks:
switch c {
case backtick:
state = maybeEndBackticks
default:
normalized.WriteRune(c)
}
case maybeEndBackticks:
switch c {
case backtick:
state = inBackticks
normalized.WriteRune(backtick)
default:
state = notInString
normalized.WriteRune(doubleQuote)
normalized.WriteRune(c)
}
default:
panic("unknown state")
}
}
// If reached the end of input unsure whether to unquote a string, do so now
switch state {
case maybeEndDoubleQuote:
normalized.WriteRune(singleQuote)
case maybeEndSingleQuote:
normalized.WriteRune(singleQuote)
case maybeEndBackticks:
normalized.WriteRune(doubleQuote)
default: // do nothing
}
return normalized.String()
}
Now we can write some tests to verify this works as expected:
/ Test converting MySQL strings to Postgres strings
func TestNormalizeStrings(t *testing.T) {
type test struct {
input string
expected string
}
tests := []test{
{
input: "SELECT \"foo\" FROM `bar`",
expected: `SELECT 'foo' FROM "bar"`,
},
{
input: `SELECT "fo""""o"`,
expected: `SELECT 'fo""o'`,
},
// snipped many test cases
{
input: "SELECT \"foo\" from `bar` where `bar`.`baz` = \"qux\"",
expected: `SELECT 'foo' from "bar" where "bar"."baz" = 'qux'`,
},
}
for _, test := range tests {
t.Run(test.input, func(t *testing.T) {
actual := normalizeStrings(test.input)
require.Equal(t, test.expected, actual)
})
}
}
It all works, so we can wire this logic directly into our test harness to run our MySQL dialect tests (including any we write in the future) against our Postgres dialect database.
Conclusion
Getting all Dolt's tests to run on Doltgres is a critical step in making it production quality. There's still a lot more work to be done to get there, but we're working hard every day and making good progress.
Have questions on MySQL v. Postgres quoting? Or maybe you are curious about the world's first version-controlled SQL database? Join us on Discord to talk to our engineering team and meet other Dolt and Doltgres users.