const results = await db.collection('orders').aggregate([
{ $group: { _id: '$customerId', total: { $sum: '$amount' } } }
]).toArray();
See the bug? There isn't one, technically. It runs fine. But you just grouped every order in the collection, including cancelled ones, test data, and that batch of records from 2019 nobody cleaned up. No $match stage first. This is the number one mistake I see with aggregation pipelines, and it's a good entry point into why the order of stages matters so much.
How the Pipeline Actually Works
Picture a factory assembly line. Raw documents go in one end. Each station along the line does one thing to them: filter some out, group them together, reshape them, sort them. What comes out the other end is your result.
The mental model matters more than you think. I spent my first month with aggregation pipelines treating them like fancy SQL queries, writing the whole thing at once and then wondering why the output was wrong. Once I started thinking of each stage as a standalone transformation -- "what goes in, what comes out" -- everything clicked.
const results = await db.collection('orders').aggregate([
{ $match: { status: 'completed' } }, // Stage 1: Filter
{ $group: { _id: '$customerId', total: { $sum: '$amount' } } }, // Stage 2: Group
{ $sort: { total: -1 } }, // Stage 3: Sort
{ $limit: 10 } // Stage 4: Limit
]).toArray();
Each stage gets the output of the previous one. That's why putting $match first matters. If you group first and match later, you've already done the expensive work on documents you didn't need.
The Stages You'll Use 80% of the Time
$match filters documents. Put it early. It can use indexes, which nothing else in the pipeline can (except the first $sort).
{ $match: {
createdAt: { $gte: new Date('2026-01-01') },
status: { $in: ['active', 'pending'] },
amount: { $gt: 100 }
}}
One thing that tripped me up early on: $match uses the exact same query syntax as find(). If you know how to write a find() filter, you already know how to write a $match stage. Same operators, same structure. The difference is where it sits in the pipeline and how the optimizer handles it.
$group is where the actual aggregation happens. You pick a field to group by and then define what to compute for each group.
{ $group: {
_id: '$category',
totalSales: { $sum: '$amount' },
averagePrice: { $avg: '$price' },
count: { $sum: 1 },
products: { $push: '$name' }
}}
That $sum: 1 trick counts documents per group. $push collects values into an array. Both come up constantly.
A subtlety worth knowing: if you want to group over the entire collection (no grouping key), set _id: null. This is useful for computing collection-wide totals:
{ $group: {
_id: null,
totalRevenue: { $sum: '$amount' },
orderCount: { $sum: 1 },
uniqueCustomers: { $addToSet: '$customerId' }
}}
That $addToSet operator collects unique values only, so you get a deduplicated array of customer IDs. Handy for "how many unique users" type questions without needing a separate distinct() call.
$lookup gives you joins. Yes, in MongoDB. No, it's not relational. But sometimes you need data from two collections and this beats making two separate queries.
{ $lookup: {
from: 'users',
localField: 'authorId',
foreignField: '_id',
as: 'author'
}},
{ $unwind: '$author' }
The $unwind after $lookup is almost always necessary. $lookup returns an array (even for one-to-one relationships), so $unwind flattens it back to a single object.
There's also a more powerful form of $lookup with a sub-pipeline. I ignored it for months because the syntax looked intimidating, but it's actually more useful than the basic version:
{ $lookup: {
from: 'comments',
let: { postId: '$_id' },
pipeline: [
{ $match: { $expr: { $eq: ['$postId', '$$postId'] } } },
{ $match: { approved: true } },
{ $sort: { createdAt: -1 } },
{ $limit: 5 }
],
as: 'recentComments'
}}
This fetches only the 5 most recent approved comments for each post, rather than pulling in every comment and filtering later. The performance difference on a collection with millions of comments is enormous. The let / $$ syntax is how you reference fields from the outer document inside the sub-pipeline. Weird syntax, very useful feature.
$project controls which fields show up in the output. You can also create computed fields here.
{ $project: {
title: 1,
author: '$author.name',
readTime: { $ceil: { $divide: ['$wordCount', 200] } },
isRecent: {
$gte: ['$createdAt', new Date(Date.now() - 7 * 24 * 60 * 60 * 1000)]
}
}}
There's also $addFields, which I honestly use more than $project these days. The difference: $project requires you to explicitly include every field you want to keep. $addFields keeps everything and just adds (or overwrites) the fields you specify. For mid-pipeline transformations where you want to tack on a computed value without losing all your other fields, $addFields is much less annoying.
{ $addFields: {
fullName: { $concat: ['$firstName', ' ', '$lastName'] },
ageInDays: {
$divide: [
{ $subtract: [new Date(), '$birthDate'] },
1000 * 60 * 60 * 24
]
}
}}
Real Example: Sales Dashboard Query
This is the kind of pipeline that actually ships to production. It powers a dashboard showing daily revenue broken down by category:
const analytics = await db.collection('orders').aggregate([
{ $match: {
createdAt: {
$gte: new Date('2026-01-01'),
$lt: new Date('2026-02-01')
}
}},
{ $group: {
_id: {
day: { $dayOfMonth: '$createdAt' },
category: '$category'
},
revenue: { $sum: '$amount' },
orders: { $sum: 1 },
avgOrderValue: { $avg: '$amount' }
}},
{ $sort: { '_id.day': 1 } },
{ $group: {
_id: '$_id.day',
categories: {
$push: {
name: '$_id.category',
revenue: '$revenue',
orders: '$orders'
}
},
dailyTotal: { $sum: '$revenue' }
}},
{ $sort: { _id: 1 } }
]).toArray();
Two $group stages. The first groups by day+category. The second regroups by day, nesting the categories into an array. Before I learned this pattern, I was doing two queries and stitching results together in Node. Not great.
$facet and $bucket: Underused and Underrated
Most tutorials stop after $group and $lookup, but there are two stages I want to call out because they solve real problems that are otherwise painful to deal with.
$facet runs multiple sub-pipelines on the same set of documents in a single pass. Think of it as "give me several different views of this data at once." I use it for search results pages where I need the results, the total count, and some filter breakdowns all from one query:
const searchResults = await db.collection('products').aggregate([
{ $match: { $text: { $search: 'wireless headphones' } } },
{ $facet: {
results: [
{ $sort: { score: { $meta: 'textScore' } } },
{ $skip: 20 },
{ $limit: 10 },
{ $project: { name: 1, price: 1, brand: 1 } }
],
totalCount: [
{ $count: 'count' }
],
priceRanges: [
{ $bucket: {
groupBy: '$price',
boundaries: [0, 50, 100, 200, 500, Infinity],
default: 'Other',
output: { count: { $sum: 1 } }
}}
],
topBrands: [
{ $group: { _id: '$brand', count: { $sum: 1 } } },
{ $sort: { count: -1 } },
{ $limit: 5 }
]
}}
]).toArray();
Without $facet, that's four separate queries. With it, it's one round trip to the database. On a page that needs to load quickly, that matters.
$bucket groups documents into ranges, which is perfect for histograms, price ranges, age distributions -- anything where you're binning continuous values. I showed it inside the $facet above, but it works as a standalone stage too. There's also $bucketAuto, which figures out the boundaries for you given a target number of buckets. Useful when you don't know the data distribution ahead of time.
Don't Wreck Your Database
Aggregation pipelines can get expensive fast. Some rules of thumb:
- $match goes first. Always. I already said this but it bears repeating. Filter early, filter often.
- Indexes only help at the beginning. The first
$matchand$sortstages can use indexes. After that, you're working on in-memory documents. - Drop fields you don't need. A
$projectearly in the pipeline means less data gets carried through every subsequent stage. - Be careful with $unwind. If a document has an array with 1,000 elements,
$unwindturns that one document into 1,000. Your pipeline just got 1,000x bigger. - Use allowDiskUse for big pipelines. By default, each stage has a 100MB memory cap. Pass
{ allowDiskUse: true }as an option when you know you're working with large datasets.
One more performance tip that took me embarrassingly long to learn: use .explain("executionStats") on your pipeline. It works just like explain() on a find() query. It tells you whether your initial $match is actually hitting an index or doing a full collection scan, how many documents each stage processes, and where the bottleneck is. I had a pipeline once that was taking 8 seconds. The explain output showed the $match was scanning 2 million documents because the index I thought existed had been dropped during a migration. Added the index back, query dropped to 40ms.
const explanation = await db.collection('orders').aggregate([
{ $match: { status: 'completed', createdAt: { $gte: new Date('2026-01-01') } } },
{ $group: { _id: '$customerId', total: { $sum: '$amount' } } }
]).explain('executionStats');
console.log(JSON.stringify(explanation.stages[0], null, 2));
When Your Pipeline Returns Nonsense
Debugging a 6-stage pipeline by staring at it doesn't work. Here's what does: comment out stages from the bottom up. Run it with just the first stage, check the output. Add the second stage, check again. You'll find the broken stage pretty quickly.
MongoDB Compass has a visual aggregation pipeline builder that lets you see the output of each stage as you add it. Use it. It's not cheating. I still use it for anything past 4 stages.
Another thing: $out and $merge let you write pipeline results into a new collection. Useful for materializing views or pre-computing reports that would be too slow to run on every request.
I want to highlight a distinction between those two. $out replaces the target collection entirely -- it drops whatever was there and writes the new results. $merge is smarter: it can insert new documents, update existing ones, or fail on duplicates, depending on how you configure it. For incremental updates (say, refreshing only this month's data in a reporting collection), $merge is what you want:
{ $merge: {
into: 'monthly_reports',
on: ['_id.year', '_id.month'],
whenMatched: 'replace',
whenNotMatched: 'insert'
}}
Using Aggregation with Mongoose
If you're using Mongoose (and most Node.js projects with MongoDB are), there are a couple of things to know. Mongoose has its own .aggregate() method on models, and it mostly works the same as the native driver version. But there's a catch: Mongoose does not cast types in aggregation pipelines the way it does for find() queries. If your schema defines userId as an ObjectId, a find({ userId: someString }) will auto-cast that string to an ObjectId. An aggregation $match will not. You have to cast it yourself:
const mongoose = require('mongoose');
const results = await Order.aggregate([
{ $match: {
customerId: new mongoose.Types.ObjectId(req.params.customerId)
}},
{ $group: { _id: '$status', total: { $sum: '$amount' } } }
]);
I've seen this bite people who switch from find() to aggregation and suddenly get zero results back. The query looks right, there are no errors, but the string "507f1f77bcf86cd799439011" doesn't match the ObjectId stored in the database. Silent failure. Always cast your ObjectIds explicitly in aggregation pipelines.
Wrapping Up
The aggregation pipeline replaces a lot of code you'd otherwise write in your application layer. Grouping, joining, filtering, reshaping -- all in one database call. If you're still looping over find() results and doing math in JavaScript, you're doing too much work. Start with a two-stage pipeline ($match then $group) and build from there. Once the mental model clicks -- documents in, transformed documents out, one stage at a time -- you'll wonder how you ever got by with find() alone.
That's it. Go build something.
Comments (1)
The analytics dashboard example is gold. I was doing this with multiple find queries and processing in Node. The aggregation pipeline does it in a single call. Performance improvement was insane.