Thursday, January 18, 2018

Introduction to MongoDB indexing: How MongoDB indexes depend on memory and I/O operations



Introduction to MongoDB indexing: How MongoDB indexes depend on memory and I/O operations
CENTAUREA 18.12.2014
Mongodb designed to provide perfect read operations performance and oriented for read access as much as it possible. And main source to achieve this results is indexing. So when you query any data or making any aggregation mongodb try to utilize indexes as much as it possible, nevermind is it a simple queries or complex aggregation pipelines (that’s why proper compound indexes could massively improve aggregation queries).

Let’s assume the following:

We have a collections of posts with id, date, content
{
    "id": …,
    "date": …,
    "content": …
}
At the moment X we have only initial index based on "id".
Also let’s think that our collection is 10 GB and it's run on 2 GB RAM server.
Queries based on not-indexed fields

In that case every request which based on “date” or “content”, e.g.

db.Post.find({"date": {"$gt": "X"}})
will allow Mongo to match through every document in collection, checking is it satisfy requirements or not. As we have 10 GB of data and only 2 GB of RAM Mongo will load 2 GB of data, check it, then swap on disk, load new memory mapped file with next 2 GB of data, than check it again, then swap, load, etc until it will check all of data. I hope you catch the workflow pretty well.

Checking data itself is quite fast operation, but memory swapping – it’s slow and time expensive. Also in this scenario you could see a lot of I/O operation (reading from disk and swapping to disk). So this case is slow and I/O intensive – that’s clear.

Index-based fields queries

Let’s index all document based on "date". Now we have additional "date" field based index. Just for example, we suggest that it’s size is 200 MB

Let execute "date" field based request, e.g.

db.Post.find({"date": {"$gt": "X"}})
What will do Mongo in this case?

Check if "date" is indexed field
Find proper index
Load it into memory
Make all matching procedure only based on index (already loaded in memory, no additional disk operations), and then just return document satisfied requirements.
At the second hit of queries like

db.Post.find({"date":{"$gt": "X"}})
it will work even faster, because date index already in memory, so no need to load it again.

Now you see that with the same server configuration and the same data we have fast performance and few I/O operations.

So the short answer for RAM vs. I/O dilemma is the following:

If you have proper indexes and have enough memory to load your indexes in memory – your system will be very fast and will take few I/O operation.

Move back to the real world

But we have more complex problems in real life. Except of simple queries we have compound queries like

db.Post.find({"date": {"$gt": "X"}, "content": "MongoDB rock!" })
And even if we have indexes for the most of fields you query for it does not solve the problem of compound queries. In this case in order to avoid I/O operations you need compound indexes, e.g.

{"date": 1, "content": 1}
So let’s get back to our example and suggest the following:

Add index for "content" and compound index {"date": 1,"content": 1}.
The size of "content" index is 4 GB for example
The size of compound index is 4.5 GB
Let’s see how it works now.

In the previous example we query db.Post.find({"date": {"$gt": "X"}}). Ok we know what is going on. It loads data, look through index, etc. Let’s check now what happen for query:

db.Post.find({"date": {"$gt": "X"}, "content": "MongoDB rock!"})
So ok, this operation needs "date" and "content" field then we are looking for corresponding index
The index is found but it’s size 4.5 GB. We have only 1.8 GB of RAM (200 MB already spend for "date" index).
We load 40% of index, check it, then swap to disk, load 2nd 40% of index, check it, swap, load check and we done.
So right now we have "date" index in memory, and last nearly last 30% of compound index. And we already able to see some increasing of I/O operations. So then we hit queries like db.Post.find({ "content": "MongoDB rock!" }). We need "content" index for that, but we have 0 free memory, so we starting with swapping previously loaded indexes, then load "content" index, check it and returning results. So this case allow us to see that different indexes are competing for memory. And it’s getting worse with increasing numbers of indexes and collections. So more collections => more indexes => more competition for resources => more disks I/O => slower system.

Except all mentioned above you should not forget about memory fragmentation. If you constantly load many differently sized files in RAM it will be fragmented with time, and you could see that you have 1 GB of free RAM, but MongoDB can’t use it, because it’s just spread in all address space with smaller 20-30MB chunks, but Mongo need 500MB in single chunk.

Conclusion

Taking into account all these cases I prepared some recommendation how to improve MonogDB performance if you hesitating with RAM vs. I/O dilemma:

From the hardware point. Both RAM and I/O increasing will cause positive effect on system but it difficult to say what will get better increasing – I/O or RAM. It very depends on MongoDB usage pattern. But if have a chance to choosing between a couple more Gigabytes of RAM and SSD – SSD will be more effective in most of the cases.

Multiple servers. In the question of spreading data across multiple servers it will be better to have small and middle sized collection on the same server (not sharded across multiple servers), and only big collections sharded between multiple nodes. It reduces indexes per server number and will reduce resources completion between servers.


Better to have nothing except MongoDB on the database server to avoid additional concurrency for resources. Verify index model and make sure that it covers the most frequently used cases.

No comments:

Post a Comment

Recent Post

Databricks Delta table merge Example

here's some sample code that demonstrates a merge operation on a Delta table using PySpark:   from pyspark.sql import SparkSession # cre...