SSIS parallel processing with LINQ in Biml

First of all, thanks to Ben Weissman and Scott Currie for feedback on this post.

Why should you start using LINQ with your BimlScript? Because LINQ is simply a way of expressing queries inside C#.
Added in C# 3.0, LINQ is general-purpose query language for many kinds of data sources—including plain object collections, XML documents, databases, etc.
Let’s see how LINQ enables us to create a BimlScript template pattern for parallel executions inside a dataflow.

This example assumes you already have collections defined inside RootNode.Tables.

If we were to use the following code inside a Biml file, we would generate one package for each (n)th occurrence of a group.
Each package would contain one dataflow with five OleDbSource tasks.

Revision: 2016-06-22
Current code example provided by Scott Currie.

 

Now let’s start by breaking down the code.

First part consists of defining our boundaries.

“groupSize” defines the limit of how many items we want contained in a group.
Change this value if you’d like to have more or less tasks inside your dataflow.

 

We create a for loop to iterate over the available items inside RootNode.Tables. The loop has 3 actions.

  1. First we declare the initial value for our variable “i”. Only done once upon initializing the loop.
  2. Loop until “i” is greater than our wanted group size. This is the number of items in RootNode.Tables divided by our wanted no of dataflow tasks inside a package. Continues until the statement is false.
  3. Increment “i” by 1 for each iteration. Continues until the statement is false.

 

There’s two steps to actually achieving our parallel tasks inside the dataflow.

The first part is our package and dataflow element . Remember, we are now inside our previously defined for loop.
Our for loop gives us our package groups. So we’d like to name our package and dataflow for the current group.
This is done by the following calculation. i*”groupSize” and (i+1)*”groupSize”. Roughly translated it gives us our start range and end range of the iteration.

 

The second part, now here is where the magic is, it’s by using LINQ over our RootNode.Tables collection and calling .OrderBy, .Skip, .Take that we manage to get our OleDbSources inside the dataflow.

  1. OrderBy (o => o.Name)
    This one contains a lambda expression, roughly translated it does a OrderBy for each item in Ascending order of the property “Name”. The “Name” property isn’t the only way to order which packages could be inside your group. You could have annotations on your tableNodes saying which group it belongs to. You could do a query against the source system to check the row count and sort by the row count.
  2. Skip(i*groupSize)
    Here we tell the foreach loop that we’d like to skip the no of items by the integer value in “groupSize” multiplied by the current iteration.
  3. Take(groupSize)
  4. Here we tell the foreach loop that we’d like to take the no of items defined by the integer value in “groupSize”.

 

And with this simple code example we’ve created a template pattern for parallel execution in your SSIS packages.

Check out the following links for further information on the LINQ methods used.

http://www.dotnetperls.com/orderby
http://www.dotnetperls.com/skip
http://www.dotnetperls.com/take

Revision: 2016-06-21
Old code example:

Leave a Reply

Your email address will not be published. Required fields are marked *