The Aggregation Pipeline: Data Processing at Scale
The Aggregation Pipeline is a framework for data transformation. Think of it as a factory line where documents pass through multiple "stages" to produce a final result.
Core Stages
- $match: Filters documents (Always put this first to reduce data volume).
- $group: Groups documents by a key and performs calculations (
$sum,$avg). - $project: Reshapes documents (Adding/removing fields).
- $sort: Sorts results.
- $unwind: Deconstructs an array field into multiple documents.
javascript codedb.orders.aggregate([ { $match: { status: "A" } }, { $group: { _id: "$cust_id", total: { $sum: "$amount" } } }, { $sort: { total: -1 } } ])
Performance: The Indexing Rule
Stages like $match and $sort can use indexes if they are at the beginning of the pipeline. Once you use $group or $project, you "lose" the index for subsequent stages.
Advanced Transformations: $lookup and $graphLookup
Use $lookup to perform left outer joins between collections. Use $graphLookup for recursive searches (e.g., finding all descendants in a tree structure).
Pipeline Optimization
- Projection: Only project fields you absolutely need.
- Filtering: Match as early and as strictly as possible.
- Indexes: Ensure the fields in your
$matchstage are indexed.