Introduction to MongoDB indexing: How MongoDB indexes
depend on memory and I/O operations
CENTAUREA 18.12.2014
Mongodb designed to provide perfect read operations performance
and oriented for read access as much as it possible. And main source to achieve
this results is indexing. So when you query any data or making any aggregation
mongodb try to utilize indexes as much as it possible, nevermind is it a simple
queries or complex aggregation pipelines (that’s why proper compound indexes
could massively improve aggregation queries).
Let’s assume the following:
We have a collections of posts with id, date, content
{
"id": …,
"date": …,
"content": …
}
At the moment X we have only initial index based on
"id".
Also let’s think that our collection is 10 GB and it's run
on 2 GB RAM server.
Queries based on not-indexed fields
In that case every request which based on “date” or
“content”, e.g.
db.Post.find({"date": {"$gt":
"X"}})
will allow Mongo to match through every document in
collection, checking is it satisfy requirements or not. As we have 10 GB of
data and only 2 GB of RAM Mongo will load 2 GB of data, check it, then swap on
disk, load new memory mapped file with next 2 GB of data, than check it again,
then swap, load, etc until it will check all of data. I hope you catch the
workflow pretty well.
Checking data itself is quite fast operation, but memory
swapping – it’s slow and time expensive. Also in this scenario you could see a
lot of I/O operation (reading from disk and swapping to disk). So this case is
slow and I/O intensive – that’s clear.
Index-based fields queries
Let’s index all document based on "date". Now we
have additional "date" field based index. Just for example, we
suggest that it’s size is 200 MB
Let execute "date" field based request, e.g.
db.Post.find({"date": {"$gt":
"X"}})
What will do Mongo in this case?
Check if "date" is indexed field
Find proper index
Load it into memory
Make all matching procedure only based on index (already
loaded in memory, no additional disk operations), and then just return document
satisfied requirements.
At the second hit of queries like
db.Post.find({"date":{"$gt":
"X"}})
it will work even faster, because date index already in
memory, so no need to load it again.
Now you see that with the same server configuration and the
same data we have fast performance and few I/O operations.
So the short answer for RAM vs. I/O dilemma is the
following:
If you have proper indexes and have enough memory to load
your indexes in memory – your system will be very fast and will take few I/O
operation.
Move back to the real world
But we have more complex problems in real life. Except of
simple queries we have compound queries like
db.Post.find({"date": {"$gt":
"X"}, "content": "MongoDB rock!" })
And even if we have indexes for the most of fields you query
for it does not solve the problem of compound queries. In this case in order to
avoid I/O operations you need compound indexes, e.g.
{"date": 1, "content": 1}
So let’s get back to our example and suggest the following:
Add index for "content" and compound index
{"date": 1,"content": 1}.
The size of "content" index is 4 GB for example
The size of compound index is 4.5 GB
Let’s see how it works now.
In the previous example we query db.Post.find({"date":
{"$gt": "X"}}). Ok we know what is going on. It loads data,
look through index, etc. Let’s check now what happen for query:
db.Post.find({"date": {"$gt":
"X"}, "content": "MongoDB rock!"})
So ok, this operation needs "date" and
"content" field then we are looking for corresponding index
The index is found but it’s size 4.5 GB. We have only 1.8 GB
of RAM (200 MB already spend for "date" index).
We load 40% of index, check it, then swap to disk, load 2nd
40% of index, check it, swap, load check and we done.
So right now we have "date" index in memory, and
last nearly last 30% of compound index. And we already able to see some
increasing of I/O operations. So then we hit queries like db.Post.find({
"content": "MongoDB rock!" }). We need "content"
index for that, but we have 0 free memory, so we starting with swapping
previously loaded indexes, then load "content" index, check it and
returning results. So this case allow us to see that different indexes are
competing for memory. And it’s getting worse with increasing numbers of indexes
and collections. So more collections => more indexes => more competition
for resources => more disks I/O => slower system.
Except all mentioned above you should not forget about
memory fragmentation. If you constantly load many differently sized files in
RAM it will be fragmented with time, and you could see that you have 1 GB of
free RAM, but MongoDB can’t use it, because it’s just spread in all address
space with smaller 20-30MB chunks, but Mongo need 500MB in single chunk.
Conclusion
Taking into account all these cases I prepared some
recommendation how to improve MonogDB performance if you hesitating with RAM
vs. I/O dilemma:
From the hardware point. Both RAM and I/O increasing
will cause positive effect on system but it difficult to say what will get
better increasing – I/O or RAM. It very depends on MongoDB usage pattern. But
if have a chance to choosing between a couple more Gigabytes of RAM and SSD –
SSD will be more effective in most of the cases.
Multiple servers. In the question of spreading data
across multiple servers it will be better to have small and middle sized
collection on the same server (not sharded across multiple servers), and only
big collections sharded between multiple nodes. It reduces indexes per server
number and will reduce resources completion between servers.
Better to have nothing except MongoDB on the database server
to avoid additional concurrency for resources. Verify index model and make sure
that it covers the most frequently used cases.
No comments:
Post a Comment