Cosmos DB: Lessons from the field
Part 2: To split or not to split, that is the question
The quote “To be or not to be, that is the question..” from Shakespeare’s Hamlet has been the quintessential soliloquy. Funnily enough, many database architects may have pondered this question while buried deep in data modeling sessions. OLTP relational database architects have mostly steered towards the 3rd normal form. As an application grows organically, the database schema adapts to the new requirements and the joins grow accordingly based on evolving query access patterns.
As complexity grows and the relational data model evolves, the ease of making schema changes becomes more challenging leading to an ‘impedance mismatch’. The golden promise of NoSQL datastores has been a schemaless implementation. This frees developers from orchestrating schema changes across all OLTP relational databases and downstream datastores. DBAs are no longer needed to keep hitting the ‘enter’ key repeatedly until they are able to get an exclusive lock in order to rollout an ‘ALTER TABLE…’ statement. Production releases are no longer held hostage to acquiring that exclusive lock.
The ultimate developer nirvana is to simply persist a key value schema into a NoSQL database and be done with it. This is exactly how a lot of application teams start on Cosmos too. Over time the value object acquires a bloat of it’s own. Developers can simply keep adding additional attributes and paths into the JSON document. It’s so simple! The default Cosmos indexing policy indexes every path automatically and life is all set! Well…that state of nirvana bursts soon when managers start seeing an upward trend in the monthly Cosmos bills. Sooner or later there’s a rude awakening when someone in the management chain questions the Cosmos bill.
Let’s consider a very simple Customer entity. The Cosmos collection is partitioned by customerId.
{
"id":"123212121",
"customerId": "C101",
"firstName": "Joseph",
"middleName": "",
"lastName": "Morales",
"suffix": "",
"status": "ACTIVE",
"tier": "GOLD",
"emailAddresses": [
{
"default": "true",
"email": "abc.def@yahoo.com"
},
{
"default": "false",
"email": "abc@bcx.com"
}
],
"addresses": [
{
"default": "true",
"street": "408 Flora Vista Ave",
"city": "Santa Clara",
"state": "CA",
"zipcode": "92009"
},
{
"default": "false",
"street": "408 Flora Vista Ave",
"city": "Santa Clara",
"state": "CA",
"zipcode": "92009"
},
],
"dob": "01/01/2001",
"defaultPaymentMethod": "CREDIT_CARD",
"creditcards": [
{
"default": "true",
"number": "12312112898812121",
"expiryDate": "01/2022",
"cvv": "XYZ",
"type": "VISA"
},
{
"default": "false",
"number": "7346346343934634",
"expiryDate": "03/2024",
"cvv": "ABC",
"type": "DISCOVER"
},
{
"default": "false",
"number": "723543734533493",
"expiryDate": "11/2025",
"cvv": "ABC",
"type": "MASTERCARD"
}
],
"createdDate": "06/21/2021",
"lastUpdated": "06/30/2021"
}
The above Customer entity is a single JSON doc that represents the object’s attributes in entirity. Lets assume that the create operation in Cosmos SQL API consumes 7 RUs (Request Units) with a single index on the customerId partition key. With the default indexing policy, it may consume a few more.
Now when it’s time to update any single attribute or path in the above document, you are going to have to update the entire document. Updates are 2X the cost of a create or a delete operation. So it’ll cost you 10 RUs at the minimum. As the payload size increases, this replace/update RU cost will keep spiralling upwards.
You can benchmark actual CRUD operation RU charges using the opensource CosmosSQLCalipers tool available here.
So how do we bring about cost efficiency in our Customer entity? At this point it becomes critical to understand the access patterns of the cutsomer object.
- Is the Customer entity accessed in entirety or can it be split up?
- Are there scenarios where you just need to update a payment method or an address without retrieving and mutating the entire document? If yes, then read on…
A more optimized way to structure this entity is given below.
{
"id":"123212121",
"customerId": "C101",
"docType": "CUSTOMERHEADER",
"firstName": "Joseph",
"middleName": "",
"lastName": "Morales",
"suffix": "",
"status": "ACTIVE",
"tier": "GOLD",
"dob": "01/01/2001",
"defaultPaymentMethod": "CREDIT_CARD",
"createdDate": "06/21/2021",
"lastUpdated": "06/30/2021"
}{
"id":"63443747",
"customerId": "C101",
"docType": "ADDRESS",
"default": "true",
"street": "408 Flora Vista Ave",
"city": "Santa Clara",
"state": "CA",
"zipcode": "92009"
}
{
"id":"756353223",
"customerId": "C101",
"docType": "ADDRESS",
"default": "false",
"street": "408 Flora Vista Ave",
"city": "Santa Clara",
"state": "CA",
"zipcode": "92009"
}{
"id":"635241",
"customerId": "C101",
"docType": "EMAILADDRESS",
"default": "true",
"email": "abc.def@yahoo.com"
}
{
"id":"85936300",
"customerId": "C101",
"docType": "EMAILADDRESS",
"default": "false",
"email": "abc@bcx.com"
}{
"id":"11001212",
"customerId": "C101",
"docType": "CREDITCARD",
"default": "true",
"number": "12312112898812121",
"expiryDate": "01/2022",
"cvv": "XYZ",
"type": "VISA"
}
{
"id":"7462292",
"customerId": "C101",
"docType": "CREDITCARD",
"default": "false",
"number": "7346346343934634",
"expiryDate": "03/2024",
"cvv": "ABC",
"type": "DISCOVER"
}
{
"id":"2023232",
"customerId": "C101",
"docType": "CREDITCARD",
"default": "false",
"number": "723543734533493",
"expiryDate": "11/2025",
"cvv": "ABC",
"type": "MASTERCARD"
}
The above data model has basically split the Customer entity into multiple documents.
- A CUSTOMERHEADER doc that comprises the customer metadata
- Multiple ADDRESS, CREDITCARD and EMAILADDRESS child documents
- A docType attribute identifies the document category
- All documents comprising the object graph share the same customerId partition key. As a result, the entire object graph will be persisted in the same logical partition.
The entire object graph can be persisted as a single transaction using either a stored procedure or transactional batch feature of the Cosmos SDK. This implementation offers far more advantages compared to the single document implementation.
- Modular mutations and lower RU costs. If you are retrieving and mutating an address document, you will pay lower RU costs for retrieval and the subsequent update.
- Higher scalability. The service is able to scale far better because you end up exchanging smaller documents with Cosmos compared to a single big payload.
- Fewer 429 throttling exceptions and reduced overall monthly costs.
Another alternative that can be considered is to compress the payload on the fly. This is a quick hack to reduce the payload size and hence, lower the RU consumption and storage costs. However, you may have to ensure you have enough cores on your app servers to handle the additional compression/decompression overhead of the payload documents.