The first question to ask, then, when moving from a relational database to MongoDB is, ‘How will this data be accessed?’ Other important questions include:
What is the access pattern? What are you hoping to show to your customers/users? How are you going to write this data?
There is no such straightforward mapping in MongoDB but the relationships here are designed using embedded and linking documents.
in real-life large deployments the biggest impact to performance is how well the schema design fits with the application needs. Second biggest impact is from lack of indexes or wrong indexes or way too many indexes
Sharding too early may be a premature optimization. Not every MongoDB deployment requires sharding.
When you have very poorly tuned schema, or incorrect indexes, sharding won’t solve your problem Sharding is appropriate when a specific resource becomes a bottleneck on a single machine or replica set
MongoDB’s big selling points are speed and simplicity
Good Shard Key
- Cardinality
- Write Distribution
- Query Isolation
- Reliability
- Index Locality
You should carefully amalyze all options before selecting the shard key since it can significantly affect your system performance and it cannot be changed after data is inserted in MongoDB.
Usually you will not shard all collections but only collections that need data to be distributed over shards to improve read and/or write performance. All un-sharded collections will be held in only one shard that is called primary shard
MongoDB supports three types of sharding:
- Range-based sharding
- Hash-based sharding
- Tag-aware sharding
In general, range-based sharding provides better support for range queries that need query isolation while the hash-based sharding supports write operations more efficiently.
In order to properly select a shard key for your MongoDB sharded cluster, it is important to understand how your application reads and writes data. Actually the main question is
What is more critical, query isolation, or write scaling, or both?
In order to select an optimal shard key for query isolation you must take into consideration the following:
- Analyze what query operations are most performance dependent;
- Determine which fields are used the most in these operations and include them in the shard key;
- Make sure that the selected shard key enable even (balanced) distribution of data across shards;
- A high cardinality field is preferable. Low cardinality fields tend to group documents on a small number of shards what would require frequent rebalancing of the chunks.
The most common techniques people use to distribute data are:
Ascending key distribution – The shard key field is usually of Date, Timestamp or Objectld type. This pattern is not definitely good for the write scaling.
Random distribution – This pattern is achieved by fields that do not have an identifiable pattern in the dataset. This is a preferable pattern for write scaling since it enables balanced distribution of write operations and data across the shards. However this pattern does not work well for the query isolation if the critical queries must retrieve large amount of “close” data based on range criteria in which case the query will be spread across the most of the shards in the cluster.
Compound Shard Key – Combine more than one field into a shard key in order to come up with optimal shard key values for high cardinality and balanced distribution of data for an efficient write scaling and query isolation.
high Query isolation / Write scaling - high A shard key enabling mid-high randomness and relatively even distribution of data. A compound shard keys are usually good candidates.
Shard Key Considerations With that said, there are five criteria for a good shard key. They are:
- Cardinality
- Write Distribution
- Read Distribution
- Read Targeting
- Read Locality
There are two design patterns that I think work well for shard key selection. The first is using a hashed shard key, based on a field that is usually present in most queries. The other useful design pattern is a compound shard key, composed of of a low-cardinality (“chunky”) first part, and a high-cardinality second part, often a monotonically increasing one.
Hashed shard keys can often be a good option: out of the 5 criteria, the only one they don’t provide is Read Locality. If your application doesn’t use range queries, they may be ideal.
Two important things to note about hashed shard keys: the underlying field that they’re based on must provide enough cardinality, and the underlying field must be present in most queries in order to allow for Read Targeting.
Compound shard key, composed of of a low-cardinality (“chunky”) first part, and a high-cardinality second part, often a monotonically increasing one. If there are enough distinct values in the first part (at least twice the number of shards) you’ll get good write and read distribution; the high-cardinality second part gets you good cardinality and read locality.
For one thing, these five criteria I listed are typically mutually incompatible: it’s very rare to be able to get good write distribution, read distribution, and read locality all with a single shard key.
the only reasonable way to approach MongoDB shard key selection is the way that you approach any other part of MongoDB schema design: you have to carefully consider the requirements arising from all of the different operations your application will perform
Hashed id : This will distribute reads and writes evenly, and it will ensure that each document has a different shard key so chunks can be fine-grained and small.
It’s not perfect, because queries for multiple documents will have to hit all shards, but it might be good enough.
Multi-tenant compound index If you want to beat the hashed _id scheme, you need to come up with way of grouping related documents close together in the index. At Bugsnag we group the documents by project, because of the way our app works most queries are run in the scope of a project. We can’t just use projectId as a shard key because that leads to jumbo chunks, so we also include the _id to break large projects into multiple chunks. To avoid this problem in the future, we will likely migrate to an index on {projectId: ‘hashed’, _id: 1}
In summary Choosing a shard key is hard, but there are really only two options. If you can’t find a good grouping key for your application, hash the _id. If you can, then go with that grouping key and add the _id to avoid jumbo chunks. Remember that whichever grouping key you use, it needs to also distribute reads and writes evenly to get the most out of each node in your cluster.
Ascending shard keys are equivalent to this strategy: ObjectIds, dates, timestamps, auto-incrementing primary keys.
Random Sharding keys : MD5 hashes, UUIDs. If you shard on a random key, you lose data locality benefits.
The efficient operation of your MongoDB database depends on which field in the documents you designate as the shard key. Since you have to select the shard key up front and can’t change it later, you need to give the choice due consideration.
The MongoDB Manual recommends that your shard keys have a high degree of randomness to ensure the cluster’s write operations are distributed evenly, which is referred to as write scaling.
Conversely, when a field has a high degree of randomness, it becomes a challenge to target specific shards. By using a shard key that is tied to a single shard, queries run much more efficiently; this is called query isolation.
When a collection doesn’t have a field suitable to use as a shard key, a compound shard key can be used, or a field can be added to serve as the key.
Choice of shard key depends on the nature of the collection
The field you choose as your hashed shard key should have a good cardinality, or large number of different values. Hashed keys work well with fields that increase monotonically like ObjectId values or timestamps.
Shard Keys and Cluster Availability The most important consideration when choosing a shard key are: • to ensure that MongoDB will be able to distribute data evenly among shards, and • to scale writes across the cluster, and • to ensure that mongos can isolate most queries to a specific mongod.
Theindexontheshardkeycannotbeamultikeyindex.
3.1.2 Considerations for Selecting Shard Keys
Choosing the correct shard key can have a great impact on the performance, capability, and functioning of your database and cluster. Appropriate shard key choice depends on the schema of your data and the way that your appli- cations query and write data
A shard key with high degree of randomness prevents any single shard from becoming a bottleneck and will distribute write operations among the cluster.
The challenge when selecting a shard key is that there is not always an obvious choice. Often, an existing field in your collection may not be the optimal key. In those situations, computing a special purpose shard key into an additional field or using a compound shard key may help produce one that is more ideal.
Cardinality in the context of MongoDB, refers to the ability of the system to partition data into chunks
While “high cardinality,” is necessary for ensuring an even distribution of data, having a high cardinality does not guarantee sufficient query isolation or appropriate write scaling .
Shard Key Selection Strategy
# mongo shell
$ rs.slaveOk()
# mongo
$ db.user.ensureIndex( { _id : "hashed" } )
$ sh.shardCollection("facebook.user", { "_id": "hashed" } )
$ db.user.getShardDistribution()
### Set Sharding chunk size
use config
db.settings.save( { _id:"chunksize", value: 8 } )
### Set Shard Key & Index Key
mongo --host mongo_router_1:27017 <<EOF
use prod;
sh.enableSharding("prod");
db.user.ensureIndex( { slug : 1 } );
sh.shardCollection("prod.user", { "_id": 1 } );
EOF
## check nscanned
db.user.find({slug: '8PwobE4O'}).explain("executionStats")
## Ensure Index
db.user.getIndexes()
db.user.dropIndexes()
db.user.ensureIndex( { slug : 1 }, { unique: true, backgroud: true })
db.user.ensureIndex( { slug : "hashed" }, { backgroud: true } )