To make this possible, we had to use a generated column to be able to create an index on a field inside a JSON column. Doctrine does not know Generated Columns, so we had to hide the column and index from the Doctrine Schema Generator.

Our use case is that we have a table with orders. We have to build a report that shows sums of the orders by regions (zip codes in our case). The address on the order is not allowed to change, the goal is to record to what address an order has actually been shipped. Rather than linking to a different table with foreign key, we decided to denormalize the address on the order as a JSON MySQL field.

The first approach queried the zip codes table and then looped over the zip codes to query the order database for each of the 3 different sums the report contains. This of course leads to 3*n queries. Add to this that each query is highly inefficient because it needs to do a full table scan because one criteria involves accessing the zip code in the JSON field with MySQL JSON functions. At some point we started hitting timeout limits for the web request to download the export...

Using Subqueries

This is one place where using the ORM for reading is a trap. Writing direct SQL is a lot easier. (You can achieve the same with DQL or the Doctrine Query Builder and hydrating to an array.)

We converted the query into one single query with subqueries for the fields. Instead of looping over the result of one query and having a query for each row in that result, we unified those into one query:

SELECT 
    a.zip,
    (
        SELECT COUNT(o.id) 
        FROM orders AS o
        WHERE o.state = ‘confirmed’ 
          AND JSON_CONTAINS(a.zip, JSON_UNQUOTE(JSON_EXTRACT(o.delivery_address, '$.zip'))
          ) = 1
    ) AS confirmed,
    (
        SELECT COUNT(o.id) 
        FROM orders AS o
        WHERE o.state = ‘delivered’ 
          AND JSON_CONTAINS(a.zip, JSON_UNQUOTE(JSON_EXTRACT(o.delivery_address, '$.zip'))
          ) = 1
    ) AS delivered,
    ...
FROM areas AS a
ORDER BY a.zip ASC

Each subquery still needs to do a table scan for each row to determine which orders belong to which region. We found no fundamentally easier way to avoid having to select over all orders for each row in the areas table. If you have any inputs, please use the comments at the bottom of this page. What we did improve was having an index for those subqueries.

MySQL Generated Columns

Since version 5.7, MySQL supports “Generated Columns”: A column that represents the result of an operation on the current row. Among other things, generated columns are a neat workaround for creating an index on a value stored inside a JSON data field. The MySQL configuration is nicely explained in this article. For our use case, we have something along the following lines:

ALTER TABLE orders 
     ADD COLUMN generated_zip CHAR(4) GENERATED ALWAYS AS
        (JSON_UNQUOTE(JSON_EXTRACT(delivery_address, '$.zip'))
CREATE INDEX index_zip ON orders (generated_zip)

With that, our query can be simplified to be both more readable and use a field where we can use an index:

SELECT 
    a.zip,
    (
        SELECT COUNT(o.id) 
        FROM orders AS o
        WHERE o.state = ‘confirmed’ 
          AND o.generated_zip = a.zip
    ) AS confirmed,
    (
        SELECT COUNT(o.id) 
        FROM orders AS o
        WHERE o.state = ‘delivered’ 
          AND o.generated_zip = a.zip
    ) AS delivered,
    ...
FROM areas AS a
ORDER BY a.zip ASC

So far so good, this makes the query so much more efficient. The rest of this blogpost is not adding further improvements, but explains how to make this solution work when using the Doctrine Schema tool / Doctrine Migrations.

Working around Doctrine

While Doctrine is an awesome tool that helps us a lot in this application, it does not want to support generated columns by design. This is a fair decision and is no impediment for us using them for such queries as the one above.

However, we use Doctrine Migrations to manage our database changes. The migrations do a diff between the current database and the models, and produce the code to delete columns and indices that do not exist on the models.

It would help us if this issue got implemented. Meanwhile, we got inspired by stackoverflow to use a Doctrine schema listener to hide the column and index from Doctrine.

Our listener looks as follows:

<?php

namespace App\EventListener;

use Doctrine\Common\EventSubscriber;
use Doctrine\DBAL\Event\SchemaColumnDefinitionEventArgs;
use Doctrine\DBAL\Event\SchemaIndexDefinitionEventArgs;
use Doctrine\DBAL\Events;

/**
 * The orders.generated_zip column and orders.index_zip index have been created
 * with a manually crafted migration as Doctrine does not support generated
 * columns. This listener prevents migrations from wanting to remove the field
 * and index.
 */
class DoctrineSchemaListener implements EventSubscriber
{
    public function onSchemaColumnDefinition(SchemaColumnDefinitionEventArgs $eventArgs)
    {
        if ('orders' === $eventArgs->getTable()) {
            if ('generated_zip' === $eventArgs->getTableColumn()['Field']) {
                $eventArgs->preventDefault();
            }
        }
    }

    public function onSchemaIndexDefinition(SchemaIndexDefinitionEventArgs $eventArgs)
    {
        if ('orders' === $eventArgs->getTable() 
            && 'index_zip' === $eventArgs->getTableIndex()['name']
        ) {
            $eventArgs->preventDefault();
        }
    }

    /**
     * Returns an array of events this subscriber wants to listen to.
     *
     * @return string[]
     */
    public function getSubscribedEvents()
    {
        return [
            Events::onSchemaColumnDefinition,
            Events::onSchemaIndexDefinition,
        ];
    }
}