Postgres is a performant and reliable Open Source database engine. It has similar use cases as MySQL and MariaDB.

I gave a talk on this subject at Symfony Con Vienna in December 2024. This article is based on that talk.

What do we even mean with the term NoSQL?

NoSQL is understood to mean "non-SQL", or often "not only SQL" databases. What we actually care about is not so much the query language though, but the fact that the database is not designed in the classical relational style.

Even Wikipedia only offers a negative definition:

A database that focuses on data that is modeled in means other than tabular relations

Thus, NoSQL is not a standard, but a vague concept and sometimes buzzword. I understand it as a database that is not oriented towards strictly structured relational data, but instead built for storing other data: Json or XML documents, whole graphs of data or simple key-value pairs. The goal is usually to store heterogeneous data. In the case of Json and XML, a schema can be used to define what the data looks like, but many systems also work without a schema.

The reason why key-value stores, Json and XML databases can be scaled horizontally so well is that the data is not interconnected. Because joins are not supported, distributing the database onto multiple servers - sharding - is much simpler. Additional database servers can be added without needing complicated synchronization processes between the nodes. The query languages reflect this by consciously not providing the powerful join constructs of SQL.

Storing key-value pairs in a regular Postgres table works fine unless you have very high request rates or huge amounts of data, at which point you best move to a dedicated key-value store. If you need a graph database, you could check out the "Property Graph Query Language" (PGQ) defined in the SQL:2023 standard, or look for example into Neo4J.

Side note: Don't mix up graph databases with the graphql concept of building APIs. You can provide a graphql API from a relational database.

For this article, I will focus on Json. XML is supported in Postgres as well.

Storing data in a Json column has advantages:

  • Nested data where there is no reason to normalize it, or that has a dynamic structure that would be difficult to map into relational tables.
  • Sparse data: If there can be many properties, but most usually don't have a value in your data, you can avoid creating lots of mostly-empty columns.

SQL Json Syntax

To define a column that you can use with Json, use the JSONB type

CREATE TABLE auction_jsonb (
    id SERIAL NOT NULL,
    title VARCHAR(255) NOT NULL,
    current_price INT DEFAULT NULL,
    item JSONB NOT NULL,
    PRIMARY KEY(id)
);

Postgres offers two types of Json columns. The JSONB type stores binary Json for better performance. The JSON type stores the data verbatim, leading to less efficiency but preserving the original formatting.

Unless you rely on non-standard behaviour or need to preserve original formatting, use JSONB. The full differences are:

JSON: Untyped, stored raw JSONB: Stored structured
  • Preserve whitespace
  • Preserve object key order
  • Can handle invalid Json, preserve duplicate keys
  • Parsed Json
  • Faster for querying

Storing a whole Json object into a JSON(B) column instead of normalizing it into tables and columns can lead to higher storage consumption. A column has its name defined only once. In Json, every object contains each of its keys, and Postgres does not optimize this away. According to this post on heap.io, when you have a key once every 80 Json objects or more frequent, a separate column would use less disk space.

Querying with criteria about the data inside a Json column may be less optimized, because the query planner has less information to work with than for a regular column.

Extracting Values from a Json Object

item->'author' is "David"

item->>'author' is David

For objects in a JSON(B) column, provide the key as string. If you have a Json array, you can access elements by index, by providing a number.

Notice the difference between item-> and item->>. We will encounter the single- and double-arrows in other constructs too, and they always have the same meaning. A single arrow gives you the Json element, while the double arrow converts the Json element directly into a string, number or null. If the target is a sub-object or array, it is converted to the string representation of the Json fragment. If you need to do further operations on the Json, you therefore want to use the single arrow.

You can use this construct to filter rows by their data in JSON(B):

SELECT COUNT(*)
FROM auction_jsonb
WHERE item->>'author' = 'David';

This is potentially compute expensive. I did some measurements with test data to get an idea up to which amounts of data such a query can be viable:

Rows JSON JSONB
10k 136 31
100k 319 46
1M 1'280 157
10M 15'505 2'768

The above table shows the query times in milliseconds with a JSON and JSONB column and 10'000 to 10 million rows.
You find the code I used for these measurements in my repository github.com/dbu/postgres-json.

Don't get hung up about the precise numbers, they obviously depend strongly on the computing power of your database server, and you can't have my laptop as your server :-)

The interesting part are the relative differences. We can see that JSONB makes quite the difference, being 5-10 times faster. However, even with JSONB, 10x more rows make the query at least 10x slower, if not more, indicating a linear growth of query time. For data sets at less than 100'000 rows, the JSONB seems viable - at least if you don't need to run queries very frequently.

I noticed a significant difference between querying item->>'author' = 'David' and item->'author' = '"David"'. While semantically identical, the single arrow query took 3 times as much time as the double arrow.

If you know that you will query on a specific path in the JSON(B) frequently, you should create an index. This works the same as an index on a column, and brings the same performance benefits:

CREATE INDEX json_author
ON auction_json_indexed ((item->>'author'));
Rows JSON JSON indexed JSONB JSONB indexed
10k 136 25 31 20
100k 319 12 46 13
1M 1'280 3 157 3
10M 15'505 7 2'768 6

The above table shows query time in millisecond, and shows that indexes on JSON(B) columns are just as amazing as regular indexes. As is to be expected, with an index there is no difference whether you use the JSON or a JSONB type.

To define useful indexes, you of course need to know in advance what path you will query on. When you can't know that in advance, you can't create indexes.

Side Note: Tuning the Query Planner

At 10 million rows, Postgres stopped using the index on the JSON(B) column and instead went for full table scan. It guessed that accessing many random pages from the index would be worse than loading everything sequentially. Of course that was not the correct guess. After some research and then asking for help on stackoverflow, Frank Heikens pointed me to the right solution:

-- Value <= 1.15 makes postgres use the index
ALTER TABLESPACE pg_default
SET (random_page_cost=1.1);

Lowering the value of random_page_cost from its default value of 4 makes Postgres prefer the index in more cases.

Extracting a Column

Postgres supports "generated columns", which are filled automatically with the instruction you provide. You can use this to extract a value from a path inside the JSON(B) into a standalone column:

CREATE TABLE auction_generated_fields (
    id SERIAL NOT NULL,
    author VARCHAR(255) generated always
        as (item->>'author') stored,
    item JSONB NOT NULL, PRIMARY KEY(id)
);

Lets compare the query times:

Rows JSONB indexed generated generated indexed
10k 20 21 18
100k 13 30 10
1M 3 88 3
10M 6 786 4

As we can see from these query times in milliseconds, the index performs the same as on the JSON(B) column. Without index, querying is faster than querying with a Json path.
This is also an example where the query planner works better with regular columns. It used the index on the generated column directly, without me having to tune the configuration.

Side Note: Extract a Date-time from Json

In my example, I additionally wanted to extract a date-time column. This turned out to be way more complicated than expected.
The expressions to compute the generated columns need to be immutable, which makes sense. The problem is that Postgres does not have an immutable function to convert a string to a date. A date can have a timezone, and the timezone is by definition mutable because it can be set on each connection. Postgres is missing functions where one can specify to not use the timezone or to convert a timezoned date to one without in an immutable way.

The solution I found thanks to jamie@zomglol.wtf on Mastodon is to define a function that makes sure to convert the date without a timezone:

CREATE FUNCTION text_to_timestamp(text) RETURNS TIMESTAMP
LANGUAGE sql IMMUTABLE AS
$$
SELECT CASE
WHEN $1 ~ '^\d{4}-\d{2}-\d{2}[ |T]\d{2}:\d{2}:\d{2}(\.\d+)?(\+00:00)?$'
    THEN CAST($1 AS timestamp without time zone)
END
$$;

CREATE TABLE auction_generated_fields (
    id SERIAL NOT NULL,
    start_date TIMESTAMP(0) generated always
        AS (text_to_timestamp(item->>'startDate')) stored,
    item JSONB NOT NULL, PRIMARY KEY(id));

Other Ways to Look into Json

There are several very similar ways to extract elements from JSON(B) column:

item->author->name
JSON_QUERY(item, '$.author.name')
item #> '{author,name}'
item['author']['name']

JSON_QUERY uses the very powerful Json path specification, allowing selection with complicated rules similar to CSS selectors. The # and array access syntax seem however a bit superfluous to me. Without an index, the # and array syntaxes are quite a bit slower, though you can define indexes on them as well.

Important to note: The different accessors, even when semantically equivalent, do not share indexes.
I recommend you pick the syntax you like and stick to that one.

Additional Functions for JSONB

With JSONB only, you have additional constructs:

jsonb_path_exists(item, '$.author')
jsonb_path_query(item, '$.author')
-- to get the value in old postgres versions
jsonb_path_query(item, '$.author')->>0 
-- to get the value in postgres 17 or better
jsonb_value(item, '$.author')

We can also query the existence of objects or keys inside a JSONB structure (but not in a JSON structure):

-- contains
item @> '{"author": "David"}'

-- key exists (short and long form)
item ? 'author'
jsonb_exists(item, 'author')

-- at least one of the keys exists (short and long form)
item ?| array['author','foo']
jsonb_exists_any(item, array['author', 'title']

-- all keys exist
item ?& array['author','title']
jsonb_exists_all(item, array['author', 'title']

For comparing performance, I used the same query for the author name as with the others. However, the contains query can be used for multiple properties at once, and even for nested objects.

To improve performance of the contains queries, we can create a Generalized Inverted Index (GIN). This index speeds up the contains queries, but unfortunately not other queries.

CREATE INDEX auction_json_gin_idx
ON auction_jsonb_gin USING GIN (item);

SELECT COUNT(*)
FROM auction_jsonb_gin
WHERE item @> '{"author": "David"}';

Lets compare response times in milliseconds for this search:

Rows ->> indexed @> no index @> with index @> with GIN
10k 20 5 5 5
100k 13 20 19 5
1M 3 144 149 13
10M 6 1'239 1'235 76

Side Note: Value Distribution

The statistical properties of the data itself can make significant differences. I generated test data with different numbers of variants for the author column. The query times in milliseconds on 10 million rows for queries using an index are quite different - with little variance, the index is less efficient:

Variants JSON JSON idx JSONB JSONB idx GIN >> GIN @ gen gen idx
10 12'372 784 1'895 676 1'882 1'364 1'030 44
1'000 12'424 6 1'583 6 1'260 58 931 3
100'000 13'197 3 1'710 3 1'501 5 838 3

This is a reminder that you should test your database performance with real, or at least realistic data.

Insert Performance

I wanted to see the impact of the data type and indexes on insert performance. This table shows insert times in milliseconds for dummy data generated directly in the SQL query.

Rows JSON JSON idx JSONB JSONB idx GIN gen gen idx
10k 60 90 90 120 130 94 143
100k 550 740 820 990 2'110 890 1'000
1M 5'150 8'500 7'670 11'340 24'000 8'440 11'773
10M 55'000 96'000 97'000 115'000 256'000 84'000 116'000

Parsing for JSONB, indexes and generated columns do not come for free. When looking at the toble rows, we see the impact of the data format and indexes. Looking at the columns, we can see that each of those types scales roughly linear with the amount of data being inserted.

JSONB adds about 40% over JSON on inserts. Having one index again costs about 40% more. Generating text columns from JSON(B) increases the necessary storage, but does not cost significant CPU time. (However - not in the table above - generating the date columns with a custom date conversion function nearly doubles the query time).

I used large numbers of rows to average out the time. Except during imports, most applications will not insert large amounts of data at the same time. Per record, these numbers are really low, and its quite likely that your application is taking longer to prepare the data than Postgres is taking to store it.

The Impact of the Size of your Json

The test numbers above are all based on a relatively small Json object with only a few properties, some 2-3 kb in total. To check how Postgres copes with larger documents, I generated some large data sets and found that:

  • Postgres processing time scales linear with size of Json
  • Storage consumption scales linear too, no optimization is happening
  • With many rows, the GIN index can become larger than the table itself

Updating Json in SQL

Instead of replacing a JSON(B) column with completely new data, we can also manipulate the value in SQL.

-- update only, ignore if not exists
UPDATE auction_jsonb
SET item = jsonb_set(item, '{author}', '"value"', false)
WHERE id=42;

-- update or create
UPDATE auction_jsonb
SET item = jsonb_set(item, '{author}', 'value', true)
WHERE id=42;

These queries are not faster than replacing the whole JSON(B) field. Which makes sense, as we have to re-assign the new Json object to the field after manipulating it.
However, this can still be useful either to update JSON(B) columns with data you fetch in the query. Another use case is if your application knows what changes, it can avoid parsing the whole Json object and re-serializing it for the database.

We can delete keys from a Json object:

-- delete attribute
UPDATE auction_jsonb
SET item = item - 'key'
WHERE ID=42;

-- delete attribute alternate
UPDATE auction_jsonb
SET item = jsonb_set_lax(item, '{author}', null, true, 'delete_key')
WHERE ID=42;

-- delete multiple
item - array['key1', 'key2']

set_jsonb_lax is a versatile function. You can tell it how to behave if your data is null (for example when the value is the result of a query).

  • use_json_null: Set property to null (default)
  • delete_key: Delete the key on null value
  • raise_exception: Fail if value is null
  • return_target: Leave previous value when new is null

Don't get confused with use_json_null vs delete_key. Many of the Postgres Json functions will return a null value when a key was not found, but for example the check if a key exists will be different when the key exists but has a null value vs when it does not exist.

Other Postgres Json Constructs

With ||, you can concatenate arrays and merge objects. When merging objects, values from the second parameter win in case of key collisions. There is no deep merge of colliding keys.

You can create a Json object on the fly. The parameters are alternately the key and its value:

json_build_object(
   'type', 'book',
   'author', 'David'
)

Or you can write a Json string and convert it to JSONB with jsonb('{"type":"book", "author": "David"}') or '{"type":"book", "author": "David"}'::jsonb

You can convert a Json object to a string with serialize:

json_serialize(item)
jsonb_serialize(item)

You can expand a Json object into a table with key and value column for each top-level property:

json_each(item)
jsonb_each(item)

SELECT * 
FROM jsonb_each
((SELECT item FROM auction_jsonb WHERE id=1));

Json Schema Validation

With the default installation, Postgres can not validate that data is matching a schema. There are 3rd party extensions that add the functionality, e.g. pg_jsonschema or the older extension postgres-json-schema. If you run queries that manipulate the data, it is a good idea to add the schema to protect you from errors in those queries.

When you serialize value objects in your application and only insert Json that way, your Json is already valid, so the schema is a bit redundant. Adding a schema gives up on the flexibility in favor of robustness of your application.

Some Notes about MySQL / MariaDB

There are similar functions to work with Json available in MySQL/MariaDB. SQL:2016 defines Json functionality on string fields, SQL:2023 defines the JSON data type and operations on it.

MySQL uses a dedicated type for Json, MariaDB not. Operators like ->> are not standard. MySQL supports them too, not MariaDB.

I would love feedback from somebody who worked with Json in MySQL/MariaDB about their experiences and what is possible and what not.

Scaling/Replicating Postgres

In the introduction, I mentioned that one upside of NoSQL databases is their easy scalability. Postgres is powerful and can work with a lot of data without needing multiple servers.

Before looking into replication strategies, I would recommend to investigate if the Postgres configuration can be optimized.
You can clear out old data that is no longer relevant, or create a new table per time interval, to avoid endless growth.

As a next step, you can put unrelated tables into different databases and run the databases on separate servers. If the load is evenly spread, this allows you to scale without the complexity of synchronization.

Replicas

You can create read replicas. By enabling the "hot-standby" mode, additional Postgres servers can sync all the changes from the main server. However, those standby servers can only be queried for read queries, any write queries must go to the main server. This adds complexity into your database layer.

The functionality is all or nothing: If you enable replication, all databases of the whole server with all their content are replicated.

Logical replica are more flexible. They mirror data changing queries from the main server to replicas. Contrary to the hot standby, you can limit which tables, columns and specific rows should get replicated (or not). If not all data is replicated, you can configure "Foreign Data Wrappers" (FWD) to have the replica query missing data from the upstream server.

With logical replicas, you can scale postgres horizontally, albeit with a bit more complexity and particularities than a typical NoSQL database.

For more details, I recommend this pgdash.io blogpost.

Conclusions

Postgres has powerful support of Json, both for unstructured data or for nested data with structures. Because of the subtle differences, it will be harder to switch between Postgres and MySQL/MariaDB (though for most projects, that is not an important consideration).
Of course there are some caveats - but you will run into surprises when learning a new NoSQL database technology as well.

Scaling the database only becomes an issue for fairly busy or very specific applications. Designing a clean schema, tweaking the configuration, crafting your queries well and properly leveraging indexes can get you quite a long way with a relational database.

And basically, if you have less than 10'000 rows in your data set and don't expect this to grow quickly, do whatever you want. Chose the functionality that is easiest to develop with. Performance of the database won't be an issue.

  • Use JSONB unless something prevents you from it.
  • Put important data into dedicated columns. It can make sense to use a generated column, certainly if you keep the same field in the JSON(B) column as well.
  • Performance depends very much on your data structure.
  • Don't take things for granted (ever, with databases)
  • Performance depends a lot on how your data looks - Test on real or at least realistic data in realistic quantities!

Scaling Postgres is possible, but first look into configuration and optimizing queries

If you do heavy searching in your application, maybe index data additionally in a dedicated search engine, e.g. Elasticsearch

Notes

I have been experimenting with using the Postgres Json functionality in the PHP object-relational mapper Doctrine ORM. As this post is already long, I will write about that in a separate post wrote about that in this separate post.

Have a look at the experiments and sample code in my repository github.com/dbu/postgres-json.