Choosing the Right Elasticsearch Index Data Structure

Due to its extensive growth, our platform often faces cardinal changes. One of them, which is currently underway, is the introduction of a new search interface, which will change the way users interact with our search. We wanted to share our decision-making process as we explored the pros and cons of different data structures, as well as the results of our performance testing, in hopes that it could prove useful to someone.

The Old Way

In the old version of the search, a user was not able to search for a specific document type by a type related to it. To give you an example, in the context of Dixa, the main entity is a conversation that contains messages. The old functionality did not allow the user to find conversations by their messages, and, vice versa, to find messages by the parent conversations. The new design unlocks this by allowing inter-domain searches. So you can easily find all conversations where the user “Iev Strygul” is mentioned in the messages.

In addition, it also aims to improve the user experience by introducing a “split-view”, so the user does not need to jump from a search page to a conversation page, but can simply click on the conversation, opening it on the same page in a separate view.

Old Design

New Design

To be able to implement it on the backend, we realized that we needed to do a redesign of the structure of our primary Elasticsearch indices.

Choosing a data structure is not a trivial task. You will not only build your code around it, but it can also have a significant impact on search performance. In this article, we want to share our experience choosing an index data structure, as well as show the results of the performance testing that we conducted in the course of our investigation.

In Dixa, we have three closely related entities that come from separate database tables: conversations, messages and notes. The former consists of one or more messages as well as zero or more notes.

The new design integrates these objects into a split view, allowing the user to search for conversations by either messages or notes. As these entities were previously decoupled in our search engine, we needed to link them together to implement the new design. After digging through the documentation on Elasticsearch, we decided that two data structures could satisfy our needs: Nested datatype and Join datatype (parent-child). Both structures seemed to have advantages and disadvantages. Some of the properties of these data structures seemed to be more relevant to our design than the others and therefore deserve a separate mention.

Indexing

Our data resides in AWS DynamoDB and comes to Elasticsearch through a lambda function that receives a stream of events from the database and inserts documents in Elasticsearch. The parent-child approach would allow us to store the entities in Elasticsearch as separate documents in a single index. The parent-child design offers a denormalized document structure, connecting related documents through a so-called join field, thereby greatly simplifying the indexing process. It would save us the effort of writing new lambda functions to enrich a single document with data that comes from other database entities. We could just stream them from the database as we did it before, just in a single index instead of three.

Update

In addition to the overhead with lambdas, the nested approach would require us to find a workaround to update the nested fields. Documents are immutable in Elasticsearch, and if you update a single field, Elasticsearch completely removes the old document and writes the new one with the updated field. This could create an indexing pressure if you have a field which contains an array of documents which are constantly being updated. In our case, we would have a conversations document and messages as an array field of this document. Creation of new messages happens constantly in our instant messenger — this would mean rewriting the whole conversation document each time a new message is added which is obviously not optimal.

Performance

It seemed that the parent-child approach would save us from many of the issues linked to the nested approach. Reading technical blogs and official documentation made us quite cautious though, as it mentioned that this data structure could lead to significant performance problems. When you read that it could make your search one hundred times slower, it certainly makes you question adoption.

Routing

There were also other downsides to the parent-child approach. To make the data structure efficient, the parent and its children should reside on the same shard. Therefore, Elasticsearch enforces routing at the index time for this data structure. Routing could be a strong instrument to optimize searches, but you should be very careful with it because it could cause unequal distribution of data among your shards, and hurt performance rather than improving it. Elasticsearch allows for normalizing the distribution by configuring the routing in a way that custom routing values go to a subset of shards instead of a single shard. This configuration is, however, not available for the parent-child data structure.

The summary of join (parent-child) versus nested approach properties that we used during our decision-making process:

TypeJoin approach
(parent-child)
Nested approach
IndexingSince each database entity is being written in Elasticsearch as a separate document in a single index, you can stream the database entries directly into the index.Since children are being written in Elasticsearch as fields of the parent entity, if you want to stream your database entities into Elasticsearch, you might need to create a separate logic that will enrich the parent with data each time a new child event comes from a database.
StructureAll entities are written in a single index. The relationship between a parent and its children is established through a “join” field. A child cannot have more than one parent.Children are being written as an “array” field of their parent.
QueryingAllows retrieving of a single entity, e.g. only one single message from a conversation. Cannot retrieve parent and child in a single query. When multiple parent-child relationships are present, querying can get complicated.Nested query allows for searching of nested objects as they are indexed as separate documents.
UpdateSince you write each entity as a separate document in an index, the update is quite straightforward: if you update a child/parent, it rewrites only this document.Since Elasticsearch documents are immutable, you cannot update only the child of an entity, you have to update the whole document, which could create a reindexing overhead.
RoutingRouting is enforced. For performance considerations, parents and children have to reside on a single shard. Cannot use routing to an index partition to mitigate uneven distribution of documents among shards in case of using routing.Routing is not enforced.
ScoringIf you are querying a parent by its children (“has_child” query), you can hit multiple children for a single parent. In this case, the _score will be an aggregation of the individual scores of each matched document. Aggregation can be avgsummaxmin. For more advanced scoring, a function_score query can be used.Out of the box.
PerformanceEach join field, has_child or has_parent query adds a significant tax to your query performance.In general, it should be at least as performant as parent-child approach. In some cases, it can outperform the parent-child approach by 100x plus.
MaintenanceHard if many parent-child relationships involved.Easy.

Performance testing

These theoretical pros and cons made us cautious about both approaches, and we did not want to make a choice before we saw how they would behave in practice. So, we set up a test cluster consisting of four r5.large.elasticsearch data nodes with 50GB of space and 16GB RAM each, and created a test plan to see how fast queries would be executed against the two different indices using the respective data structures.

In particular, we planned to execute a script that would contain search terms of different sizes, from 2 to 251 chars (which is the maximum number of chars that we allow in our search as input; with a proportionally increasing amount of search terms), and run it against indices of two different sizes, 10GB and 25GB, to see how the data structures perform on indices of different sizes.

The first run showed very interesting results. When both indices were about 10GB in size, both approaches performed very well within the time threshold of 1000ms that we deemed tolerable for our search. The difference in performance started to be visible once both index sizes reached about 25GB. While the nested approach seemed to give stable results across all queries, it seemed that the parent-child approach was always performing much worse for the first few search queries (despite being the shortest), while stabilizing and improving after the first dozen queries. It looked like it needed to “warm up” before it started to work efficiently. The bigger the index, the more obvious the difference between the time that Elasticsearch needed to find the data. We made second and third runs, and this behavior persisted and left us quite puzzled.

Querying 10GB index:

Querying 25GB index:

We started to read the literature on Elasticsearch, trying to understand what could cause this phenomenon. After hours of research, we found that the parent-child approach is quite heavy on computation of so-called ‘global ordinals’. They are in-memory data structures which help with optimizing terms aggregations and that is what Elasticsearch applies, amongst others, when resolving interrelated, denormalized documents — like finding the conversation corresponding to a given message.

We also found that we can move the computation of the global ordinals from the search to the index time. It made sense for our use case because our main concern was the search time performance, while the indexing time was not really an issue for us.

To do this, we had to enable the eager computation of global ordinals and to scrutinize the documentation fine print to realize that we explicitly had to set the index refresh interval — because enabling eager loading of global ordinals shifts its recomputation to shard refresh time, which in turn, by default, happens once every second if it has received a search request in the last 30 seconds. While potentially having gaps between search requests of more than 30s, refreshing the index shards (and recomputing the ordinals) every second would both be unnecessarily frequent, and, potentially too costly given our index size.

The Final Decision

When we did another test with the new configuration, we found that the search performance stabilized, and the performance was within the acceptable threshold of 1000ms. Thus, we decided to adopt the parent-child approach as our new data structure for conversations. Once this was decided, we opened a bottle of Marchese Manodori from 2015 to celebrate. A very decent wine, by the way! We fully endorse it for any upcoming data structure celebrations you may have planned.

Querying 25GB index 2nd attempt*:

*With computation of global ordinals for the parent-child approach shifted from the search to the index time.

A special thanks to Edin Dagasan for assisting with this research.

Author

Iev Strygul

Software Engineer at Dixa.

Share this article

Get CX insights, proven strategies, & the latest news in your inbox every month!

You might also enjoy

Customer service
7 min read

7 benefits of AI knowledge bases for increasing customer service efficiency

Read more
Customer service
7 min read

5 ways to boost customer retention with AI and measure it

Read more
Dixa life
7 min read

Dixa kicks off 2024 with 45 new badges and awards in G2 winter report

Read more