Choosing the Right Elasticsearch Index Data Structure

Due to its extensive growth, our platform often faces cardinal changes. One of them, which is currently underway, is the introduction of a new search interface, which will change the way users interact with our search. We wanted to share our decision-making process as we explored the pros and cons of different data structures, as well as the results of our performance testing, in hopes that it could prove useful to someone.

The Old Way

In the old version of the search, a user was not able to search for a specific document type by a type related to it. To give you an example, in the context of Dixa, the main entity is a conversation that contains messages. The old functionality did not allow the user to find conversations by their messages, and, vice versa, to find messages by the parent conversations. The new design unlocks this by allowing inter-domain searches. So you can easily find all conversations where the user “Iev Strygul” is mentioned in the messages.

In addition, it also aims to improve the user experience by introducing a “split-view”, so the user does not need to jump from a search page to a conversation page, but can simply click on the conversation, opening it on the same page in a separate view.

Old Design

New Design

To be able to implement it on the backend, we realized that we needed to do a redesign of the structure of our primary Elasticsearch indices.

Choosing a data structure is not a trivial task. You will not only build your code around it, but it can also have a significant impact on search performance. In this article, we want to share our experience choosing an index data structure, as well as show the results of the performance testing that we conducted in the course of our investigation.

In Dixa, we have three closely related entities that come from separate database tables: conversations, messages and notes. The former consists of one or more messages as well as zero or more notes.

The new design integrates these objects into a split view, allowing the user to search for conversations by either messages or notes. As these entities were previously decoupled in our search engine, we needed to link them together to implement the new design. After digging through the documentation on Elasticsearch, we decided that two data structures could satisfy our needs: Nested datatype and Join datatype (parent-child). Both structures seemed to have advantages and disadvantages. Some of the properties of these data structures seemed to be more relevant to our design than the others and therefore deserve a separate mention.

Indexing

Our data resides in AWS DynamoDB and comes to Elasticsearch through a lambda function that receives a stream of events from the database and inserts documents in Elasticsearch. The parent-child approach would allow us to store the entities in Elasticsearch as separate documents in a single index. The parent-child design offers a denormalized document structure, connecting related documents through a so-called join field, thereby greatly simplifying the indexing process. It would save us the effort of writing new lambda functions to enrich a single document with data that comes from other database entities. We could just stream them from the database as we did it before, just in a single index instead of three.

Update

In addition to the overhead with lambdas, the nested approach would require us to find a workaround to update the nested fields. Documents are immutable in Elasticsearch, and if you update a single field, Elasticsearch completely removes the old document and writes the new one with the updated field. This could create an indexing pressure if you have a field which contains an array of documents which are constantly being updated. In our case, we would have a conversations document and messages as an array field of this document. Creation of new messages happens constantly in our instant messenger — this would mean rewriting the whole conversation document each time a new message is added which is obviously not optimal.

Performance

It seemed that the parent-child approach would save us from many of the issues linked to the nested approach. Reading technical blogs and official documentation made us quite cautious though, as it mentioned that this data structure could lead to significant performance problems. When you read that it could make your search one hundred times slower, it certainly makes you question adoption.

Routing

There were also other downsides to the parent-child approach. To make the data structure efficient, the parent and its children should reside on the same shard. Therefore, Elasticsearch enforces routing at the index time for this data structure. Routing could be a strong instrument to optimize searches, but you should be very careful with it because it could cause unequal distribution of data among your shards, and hurt performance rather than improving it. Elasticsearch allows for normalizing the distribution by configuring the routing in a way that custom routing values go to a subset of shards instead of a single shard. This configuration is, however, not available for the parent-child data structure.

The summary of join (parent-child) versus nested approach properties that we used during our decision-making process:

Type	Join approach (parent-child)	Nested approach
Indexing	Since each database entity is being written in Elasticsearch as a separate document in a single index, you can stream the database entries directly into the index.	Since children are being written in Elasticsearch as fields of the parent entity, if you want to stream your database entities into Elasticsearch, you might need to create a separate logic that will enrich the parent with data each time a new child event comes from a database.
Structure	All entities are written in a single index. The relationship between a parent and its children is established through a “join” field. A child cannot have more than one parent.	Children are being written as an “array” field of their parent.
Querying	Allows retrieving of a single entity, e.g. only one single message from a conversation. Cannot retrieve parent and child in a single query. When multiple parent-child relationships are present, querying can get complicated.	Nested query allows for searching of nested objects as they are indexed as separate documents.
Update	Since you write each entity as a separate document in an index, the update is quite straightforward: if you update a child/parent, it rewrites only this document.	Since Elasticsearch documents are immutable, you cannot update only the child of an entity, you have to update the whole document, which could create a reindexing overhead.
Routing	Routing is enforced. For performance considerations, parents and children have to reside on a single shard. Cannot use routing to an index partition to mitigate uneven distribution of documents among shards in case of using routing.	Routing is not enforced.
Scoring	If you are querying a parent by its children (“has_child” query), you can hit multiple children for a single parent. In this case, the `_score` will be an aggregation of the individual scores of each matched document. Aggregation can be `avg`, `sum`, `max`, `min`. For more advanced scoring, a `function_score` query can be used.	Out of the box.
Performance	Each join field, `has_child` or `has_parent` query adds a significant tax to your query performance.	In general, it should be at least as performant as parent-child approach. In some cases, it can outperform the parent-child approach by 100x plus.
Maintenance	Hard if many parent-child relationships involved.	Easy.

Performance testing

These theoretical pros and cons made us cautious about both approaches, and we did not want to make a choice before we saw how they would behave in practice. So, we set up a test cluster consisting of four r5.large.elasticsearch data nodes with 50GB of space and 16GB RAM each, and created a test plan to see how fast queries would be executed against the two different indices using the respective data structures.

In particular, we planned to execute a script that would contain search terms of different sizes, from 2 to 251 chars (which is the maximum number of chars that we allow in our search as input; with a proportionally increasing amount of search terms), and run it against indices of two different sizes, 10GB and 25GB, to see how the data structures perform on indices of different sizes.

The first run showed very interesting results. When both indices were about 10GB in size, both approaches performed very well within the time threshold of 1000ms that we deemed tolerable for our search. The difference in performance started to be visible once both index sizes reached about 25GB. While the nested approach seemed to give stable results across all queries, it seemed that the parent-child approach was always performing much worse for the first few search queries (despite being the shortest), while stabilizing and improving after the first dozen queries. It looked like it needed to “warm up” before it started to work efficiently. The bigger the index, the more obvious the difference between the time that Elasticsearch needed to find the data. We made second and third runs, and this behavior persisted and left us quite puzzled.

Querying 10GB index:

Querying 25GB index:

We started to read the literature on Elasticsearch, trying to understand what could cause this phenomenon. After hours of research, we found that the parent-child approach is quite heavy on computation of so-called ‘global ordinals’. They are in-memory data structures which help with optimizing terms aggregations and that is what Elasticsearch applies, amongst others, when resolving interrelated, denormalized documents — like finding the conversation corresponding to a given message.

We also found that we can move the computation of the global ordinals from the search to the index time. It made sense for our use case because our main concern was the search time performance, while the indexing time was not really an issue for us.

To do this, we had to enable the eager computation of global ordinals and to scrutinize the documentation fine print to realize that we explicitly had to set the index refresh interval — because enabling eager loading of global ordinals shifts its recomputation to shard refresh time, which in turn, by default, happens once every second if it has received a search request in the last 30 seconds. While potentially having gaps between search requests of more than 30s, refreshing the index shards (and recomputing the ordinals) every second would both be unnecessarily frequent, and, potentially too costly given our index size.

The Final Decision

When we did another test with the new configuration, we found that the search performance stabilized, and the performance was within the acceptable threshold of 1000ms. Thus, we decided to adopt the parent-child approach as our new data structure for conversations. Once this was decided, we opened a bottle of Marchese Manodori from 2015 to celebrate. A very decent wine, by the way! We fully endorse it for any upcoming data structure celebrations you may have planned.

Querying 25GB index 2nd attempt*:

*With computation of global ordinals for the parent-child approach shifted from the search to the index time.