70-448 (070-448) Exam prep, Part 1/3 - SSIS

This section covers about 30% of the exam.  The breakdown, roughly, seems to be SSIS (30%) / SSRS (30%), SSAS (40%).

 

Package Configurations

à make a package more dynamic – set options like server name, data source in a configuration file to make them easier to move from server to server.

http://msdn.microsoft.com/en-us/library/ms141682.aspx

à properties and variables are updated at run time

Can be stored in an XML config file, environment variable, registry, SQL table, or a parent variable inside a package.

If you have no file system access, a SQL server configuration would be required.

BIDs-created packages are stores as .dtsx files; these are XML-formatted files.

Using an environment variable as the pointer to an XML file = indirect configuration

Packages can share SQL Server configurations across servers

Parent package: use for variable-passing

Moving packages from devà prod, changing paths: use indirect config

Multiple entries for 1 SSIS config: XML or SQL server

Data sources are project-level; connections are package-level

Control Flow tasks in SQL Server 2008:

http://www.sql-server-performance.com/2009/sql-server-2008-integration-services-tasks/

 

The ValidateExternalMetadata property on various SSIS objects determines if SSIS will automatically check for the existence of referenced objects during design.

http://siddhumehta.blogspot.com/2009/05/validateexternalmetadata-property.html

Data Profiling Task:

http://www.simple-talk.com/sql/ssis/sql-server-2008--ssis-data-profiling-task/

Generate XML reports of data analysis inside an SSIS package

Analysis includes candidate key recommendations, column length/value distributions

SSIS variables

http://msdn.microsoft.com/en-us/library/ms141085.aspx

User-defined variables are defined by the developer.  System variables are built into SSIS.

SSIS System variables

http://msdn.microsoft.com/en-us/library/ms141788(v=sql.105).aspx

VersionGUID: Unique Identifier for package version.

TaskID: Unique Identifier for Task

Can be set programmatically.

http://msdn.microsoft.com/en-us/library/microsoft.sqlserver.dts.runtime.variable.aspx

Scope: Package-level variables are available to everything inside the package.

Container-level variables are available to the container and everything inside it.

If the package calls another using the execute package task, variables can be passed to the child package.

Inside a script task, variables must be included in ReadOnlyVariables or ReadWriteVariables to be available inside the scripting engine.

Backup and restore of SSIS packages:

Both SSIS packages and SQL Server agent job information are stored in MSDB.  Backing up a copy of this database will protect your packages.  Some of the testing information incorrectly specifies that they are stored in the master database.

Control Flow

Checkpoints (package property)

http://msdn.microsoft.com/en-us/library/ms140226.aspx

Not to be confused with breakpoints, which are used for debugging,

Can be used to restart a package that fails part way through.  You can restart from the checkpoint.  It’s most often used in situations where a large file download or some other intensive task had occurred previously in the package; checkpoints can avoid running that task again.

CheckPointUsage property: IfExists: use the checkpoint; Always: package won’t run without a checkpoint; checkpoints cannot be used with

Package option/parameter TransactionOption: Required

Breakpoints: pause the control flow

Can be on any task, container, or inside control flow script

http://msdn.microsoft.com/en-us/library/ms141664(v=sql.105).aspx

Transactions in SSIS packages

Great article here:

http://www.mssqltips.com/sqlservertip/1585/how-to-use-transactions-in-sql-server-integration-services-ssis/

Transaction support is built in to SSIS.  The TransactionOption property exists at the package level, container level (e.g. For Loop, Foreach Loop, Sequence, etc.), as well as just about any Control Flow task (e.g. Execute SQL task, Data Flow task, etc.).  TransactionOption can be set to one of the following:

  • Required - if a transaction exists join it else start a new one
  • Supported - if a transaction exists join it (this is the default)
  • NotSupported - do not join an existing transaction

Note also that the SSIS package elements also have an IsolationLevel property with a default of Serializable.

MSDTC must be running.

Auditing tasks should be set to NotSupported so they won’t be rolled back as part of another transaction.

“Supported” makes no sense at the package level outside of nested packages, since it’s the top level and there couldn’t be another transaction running.  Required should be used at the package level to initialize a transaction.

Property Expressions

Update control flow properties in the midst of package execution, including connection properties

http://msdn.microsoft.com/en-us/library/ms141214.aspx

For example, If you have a directoryFile variable that operated on a list of files, you would –

Set the directoryFile variable property EvaluateAsExpression to TRUE

Loop through the files, setting the directoryFile variable to the next filename using an expression

Precedence evaluation options

Constraint (completion status of upstream task)

Expression

Expression and Constraint

Expression or Constraint                                                                                                            

SSIS precedence constraints:

Solid Green: Success

Red: Fail

Blue: Completion

Logical AND: Green (Solid)

Logical OR: Green (Dotted)

Data flow paths and errors

Error Output options: Fail transformation (entire data flow fails); Ignore Failure: Set error values to NULL but continue down green path; Redirect Row: sends rows down a different (Red) path

Different error outputs can be set based on the column the error occurs in

Control flow event handlers

SSIS Designer à Event handlers

Task + Event Handler pairing

OnError: Any error

OnExecStatusChanged: Any change in status for any objects (Success, Failed, in Process)

OnInformation: informational messages

OnPostExecute: task/container completion (success)

OnPostValidate: task/container validation (success)

OnPreExecute

OnPreValidate

OnProgress: any progress of task/container

OnQueryCancel: Manual cancelling of an Execute SQL task

OnTaskFailed

OnVariableValueChanged: If RaiseChangeEvent property = TRUE, raised when variable changes

OnWarning

 

Container/Task-level property DisableEventHandlers controls event handler activations

Fuzzy grouping is a de-duplication tool that runs against a dataset.

http://msdn.microsoft.com/en-us/library/ms141764.aspx

Package protection levels

http://msdn.microsoft.com/en-us/library/ms141747.aspx

explained well here.

You can set with the designer or DTUTIL.

Command-Line / Executables DTUTIL, DTEXEC, DTEXECUI

DTUTIL:

 package management from the command line

Copy, Move, Delete, Exists (check), encrypt, decrypt, sign, fcreate (folders)

 

DTUTIL return codes: 0 = success, 1 = failure, 4 = package not found, 5 = package cannot load, 6 = bad syntax

DTEXEC/ DTEXECUI

Dtexec – execute packages from command line

DtexecUI – visual interface to help build dtexec commands

http://msdn.microsoft.com/en-us/library/ms162810.aspx

 

If a package configuration is not specified, you can apply it with DTEXEC/DTEXECUI at execution time.

DTEXEC for logging on an error: /DumpOnErr

Security

 

Integration services roles/ Package Security on SQL Server

http://msdn.microsoft.com/en-us/library/ms141053.aspx

DB_SSISOPERATOR: view, execute, export packages

DB_SSISLTDUSER: execute, delete, change and export only packages that user imported

DB_SSISADMIN: all permissions

Windows administrator: view/stop packages

 

All roles exist in MSDB.  If a user needs more granular access, custom roles would be required.

Packages can be digitally signed to ensure authenticity.

http://msdn.microsoft.com/en-us/library/ms137947.aspx

The BlockSignatureStates registry entry controls how these signatures are handled.

0: No Administrative Restriction

1: Block Invalid Signatures (But not unsigned packages)

2: Block Invalid and untrusted signatures (not unsigned packages, but does block self-generated signatures)

3: Block Invalid and untrusted signatures and packages

Package Protection levels

DontSaveSensitive: no connection pwds saved; must used configuration for execution if SQL logins are used; window authentication doesn’t require saved pwds

EncryptAllWithPassword

EncryptAllWithUserKey: Can only be opened by same user on same computer as was developed

EncryptSensitiveWithPassword

EncryptSensitiveWithUserKey

Rely on Server Storage – only applies to SQL-deployed packages

SSIS installation/configuration/high availability/monitoring

SSIS service – windows service that executes MsDtsSrv.exe

Connect via SSMS

Running packages: shows which are currently executing

Stored packages: shows packages in the package store in file system and deployed to MSDB database inside SQL Server (sysssispackages)

SSIS is not cluster-aware but can be configured as a cluster resource

Restartability rules apply as normal on a cluster (eg. Checkpoints)

 

Data Mining/Modeling:

Changing chart settings will not affect trends in the lift chart, but will affect trends in the profit chart [ From a .VCE – no context ]

Deployments

Deployment manifests create a package deployment wizard, which simplifies deployment but lacks any flexibility.

From the SSIS Project properties page,  change “CreateDeploymentUtility” to TRUEthen build the project to create the deployment manifest.

 

SSIS LoggingSSIS Menu à Logging

Containers: Package, Container, or task-level logging

Log Provider Type: SQL Server tables (sysssislog), text files, EventLog, Profiler, XML

Log Events: OnError, etc.

 

Data Flow

OLEDB Destination vs SQL Server Destination:

SQL Server Destination uses shared memory and must connect to a data source on the same system as the package is running, making it poor for portability but potentially faster

For bulk processing, OLE DB destination adapters can be configured for “Fast load”.

 

 

Flat File Destination: Text/CSV

 

Error logging for specific tasks:

Decide what file type the destination will be (.DOC, .TXT, etc) – create a destination component of the appropriate type.

Configure the error output to flow to that destination.

 

Control Flow: MaximumErrorCount specifies max errors allowed before package failure

 

 

Data viewers

http://msdn.microsoft.com/en-us/library/ms140318.aspx

allow you to see the data inside a data flow as it moves from one component to the next, and then allow it to continue

 

 

Data Flow transformations

Row-level:

 

Audit: adds auditing columns

http://www.mssqlsage.com/content/audit-transformation

Cache transform: caches data for subsequent transformations (Using full cache mode)

Character Map: string operations, only works on string data types

http://msdn.microsoft.com/en-us/library/ms141694.aspx

 

Copy Column

Data conversion: data type conversions

http://msdn.microsoft.com/en-us/library/ms141706.aspx

Derived column – builds new columns based on expressions/other logic

http://msdn.microsoft.com/en-us/library/ms141069.aspx

Export column – moves data into files; for large objects

http://msdn.microsoft.com/en-us/library/ms139818.aspx

Import columns – moves LOBs into relational data model

Row Count

 

Multi-Input/Output

Conditional Split – moves data in one output or another (exclusively)

Lookup – matches the current data set in the flow to external lookups (SQL tables)

Merge: Combines two sorted data sets into one large dataset.  This isn’t a join and should have been called UNION or something to differentiate it from MERGE JOIN.

Union All: like merge - combines datasets, but can be unsorted and combine 2+ datasets – row count is total of all input data sets

The merge join transformation

http://msdn.microsoft.com/en-us/library/ms141775.aspx

is just a T-SQL join (left, right, inner)

It requires sorted data:

http://msdn.microsoft.com/en-us/library/ms137653.aspx

The easiest way is to set IsSorted = TRUE on any previous components.

Sort transformations can also be used.

 

To insert rows into a table but first check if they exist, a combination of MERGE JOIN and LOOKUP could be used.

 

Multi-Row (aggregations and stats, usually intensive)

Aggregate – SUM, COUNT, MIN, MAX

Percent sampling – takes a sampling of the input data set

Pivot

Row sampling  - fixed sampling

Sort – sorting, allows removal of dupes

Unpivot

 

Data Prep (advanced operations)

The OLEDB Command transformation runs a sql statement for each row in a data flow.

http://msdn.microsoft.com/en-us/library/ms141138.aspx

 

Slowly changing dimension – dimension changes

Data mining query – data mining model predictions

Fuzzy grouping – column à row matching for data cleanup operations

Fuzzy Lookup – column à Dataset matching for data cleanup

Script Component

Term extraction – looks for keywords

Term Lookup – looks for keywords based on an existing dataset

 

 

Lookup transformation caching options in SQL 2008

http://blogs.msdn.com/b/mattm/archive/2008/10/18/lookup-cache-modes.aspx

Full Cache (Default): entire reference set is pulled at pre-execute of data flow.  Most memory consumption, highest initial startup time.  Lookup operations fast.

Partial Cache: New rows are compared to existing cache; if they don’t exist, added to cache

No Cache: Hit DB for every row

 

 

For Loop: Uses an integer as the count

For Each: Iterates over a number of objects, like files

 

Extracting data from multiple files:

Use a ForEach Loop Container.

Add a variable at the package level.

Create a flat file connection manager; use an expression for the connection string.

 

 

 

 

Assemblies / Custom Components

 

It’s possible to develop custom data flow components

http://msdn.microsoft.com/en-us/library/ms136078.aspx

for SSIS packages. 

Questions about this functionality seem well beyond the scope of a general BI exam, but some are included.

Custom assemblies must be moved into the BIDS folder (for example, C:\Program Files\Microsoft SQL Server\100\DTS\ to be visible to BIDS.

In more detail,

http://bennyaustin.wordpress.com/2009/06/30/steps-to-build-and-deploy-custom-ssis-components/

1)      Create strong name key

2)      Sign + Build

3)      Deploy into BIDS folder

4)      Deploy into GAC (probably C:\Windows\Assembly), close/reopen bids

5)      In Bids/Toolbox items, select custom assembly