|Indexing||Since each database entity is being written in Elasticsearch as a separate document in a single index, you can stream the database entries directly into the index.||Since children are being written in Elasticsearch as fields of the parent entity, if you want to stream your database entities into Elasticsearch, you might need to create a separate logic that will enrich the parent with data each time a new child event comes from a database.|
|Structure||All entities are written in a single index. The relationship between a parent and its children is established through a “join” field. A child cannot have more than one parent.||Children are being written as an “array” field of their parent.|
|Querying||Allows retrieving of a single entity, e.g. only one single message from a conversation. Cannot retrieve parent and child in a single query. When multiple parent-child relationships are present, querying can get complicated.||Nested query allows for searching of nested objects as they are indexed as separate documents.|
|Update||Since you write each entity as a separate document in an index, the update is quite straightforward: if you update a child/parent, it rewrites only this document.||Since Elasticsearch documents are immutable, you cannot update only the child of an entity, you have to update the whole document, which could create a reindexing overhead.|
|Routing||Routing is enforced. For performance considerations, parents and children have to reside on a single shard. Cannot use routing to an index partition to mitigate uneven distribution of documents among shards in case of using routing.||Routing is not enforced.|
|Scoring||If you are querying a parent by its children (“has_child” query), you can hit multiple children for a single parent. In this case, the |
_score will be an aggregation of the individual scores of each matched document. Aggregation can be
min. For more advanced scoring, a
function_score query can be used.
|Out of the box.|
|Performance||Each join field, |
has_parent query adds a significant tax to your query performance.
|In general, it should be at least as performant as parent-child approach. In some cases, it can outperform the parent-child approach by 100x plus.|
|Maintenance||Hard if many parent-child relationships involved.||Easy.|
These theoretical pros and cons made us cautious about both approaches, and we did not want to make a choice before we saw how they would behave in practice. So, we set up a test cluster consisting of four r5.large.elasticsearch data nodes with 50GB of space and 16GB RAM each, and created a test plan to see how fast queries would be executed against the two different indices using the respective data structures.
In particular, we planned to execute a script that would contain search terms of different sizes, from 2 to 251 chars (which is the maximum number of chars that we allow in our search as input; with a proportionally increasing amount of search terms), and run it against indices of two different sizes, 10GB and 25GB, to see how the data structures perform on indices of different sizes.
The first run showed very interesting results. When both indices were about 10GB in size, both approaches performed very well within the time threshold of 1000ms that we deemed tolerable for our search. The difference in performance started to be visible once both index sizes reached about 25GB. While the nested approach seemed to give stable results across all queries, it seemed that the parent-child approach was always performing much worse for the first few search queries (despite being the shortest), while stabilizing and improving after the first dozen queries. It looked like it needed to “warm up” before it started to work efficiently. The bigger the index, the more obvious the difference between the time that Elasticsearch needed to find the data. We made second and third runs, and this behavior persisted and left us quite puzzled.
Querying 10GB index:
Querying 25GB index:
We started to read the literature on Elasticsearch, trying to understand what could cause this phenomenon. After hours of research, we found that the parent-child approach is quite heavy on computation of so-called ‘global ordinals’. They are in-memory data structures which help with optimizing terms aggregations and that is what Elasticsearch applies, amongst others, when resolving interrelated, denormalized documents — like finding the conversation corresponding to a given message.
We also found that we can move the computation of the global ordinals from the search to the index time. It made sense for our use case because our main concern was the search time performance, while the indexing time was not really an issue for us.
To do this, we had to enable the eager computation of global ordinals and to scrutinize the documentation fine print to realize that we explicitly had to set the index refresh interval — because enabling eager loading of global ordinals shifts its recomputation to shard refresh time, which in turn, by default, happens once every second if it has received a search request in the last 30 seconds. While potentially having gaps between search requests of more than 30s, refreshing the index shards (and recomputing the ordinals) every second would both be unnecessarily frequent, and, potentially too costly given our index size.
The Final Decision
When we did another test with the new configuration, we found that the search performance stabilized, and the performance was within the acceptable threshold of 1000ms. Thus, we decided to adopt the parent-child approach as our new data structure for conversations. Once this was decided, we opened a bottle of Marchese Manodori from 2015 to celebrate. A very decent wine, by the way! We fully endorse it for any upcoming data structure celebrations you may have planned.
Querying 25GB index 2nd attempt*:
*With computation of global ordinals for the parent-child approach shifted from the search to the index time.
A special thanks to Edin Dagasan for assisting with this research.