Prologue

Once upon a time... A drama begins when our NoSql solution gets requirement to act as a RDBS. We can try to adjust our schemas or remodel, but it may lead to even more complicated or counterproductive results. Let's play!

Dramatis Personae

{
	_id : ObjectId(“....”),
	(... other fields),
	providerId : ObjectId(“....”)
}
  • Providers
{
	_id : ObjectId(“....”),
	(... other fields),
	type : // 2 possible types regular and casual
}
  • Relation
  • $lookup operation - left outer join in MongoDB. See more.
  • $group operation - group by aggregation. See more.

  • Question to answer - how many orders do we have for each provider type?
  • Dataset

Number of orders : 1 000 000

Number of providers : 3


Act I

Scene I

Note:

We are not discussing here changing our model. For more info about possible variants click here and here.
db.orders.aggregate([
{
	$lookup : {
		from: "providers",
		localField: "providerId",
		foreignField: "_id",
		as: "provider"
	}
},
{ $unwind : "$provider" },
{
	$group : {
		_id : "$provider.type",
		countByProviderName: {$sum : 1}
	},
},
{ $out: "sumOfProvidersQuery" }
])
  1. To be able to group by type we need to join provider data which includes our grouping field - type. We use lookup operation.
  2. Unwind transforms provider list returned from lookup into a single field.
  3. Performing grouping.

Query execution time:

~36 s

Act I

Scene II

Our case is quite specific - we have a huge disproportion between orders and providers. In addition number of providers is really small. In the scenario where number of providers is also not very small, we should always consider creating an index on the foreign key. Read more about lookup optimization here.

In our case, if we drop the index or add another field (e.g. providerId in providers collection), we can reduce the execution time to approximately 19 s.

Query execution time:

~19 s

Act II

Scene I

Let’s think for a second about our current model and what may be our bottleneck. We may suspect that lookup operation is performed at least as many times as we have orders (if we don’t have proper indexes it could be much worse in fact). Unfortunately, to group by provider type we need information contained in joined collection. There is solution to this problem - we can group first by providerId and then perform a lookup. After that we can group by type. There is additional group stage but we reduce the number of lookups to the number of providers. If this number is significantly smaller than number of orders we should expect huge improvement in the performance.

db.orders.aggregate([
{
	$group : {
		_id : "$providerId",
		countByProvider : {$sum : 1}
	},
},
{
	$lookup : {
		from: "providers",
		localField: "_id",
		foreignField: "_id",
		as: "provider"
	}
},
{ $unwind : "$provider" },
{
	$group : {
		_id : "$provider.type",
		countByProviderName : {$sum : "$countByProvider"}
	},
},
{ $out: "sumOfProvidersQuery2" }
])

Query execution time:

~1 s

Epilogue

Every model is worth to look up before creating a query to read or update it. The dataset and the model should be considered together to optimize. Check the code to illustrate above optimization on our github.