ETL Testing techniques
There is definitely a significantly escalating cost connected with discovering software defects later on in the development lifecycle. In data warehousing, this can be worsened due to the added expenses of utilizing incorrect data in making important business decisions. Given the importance of early detection of software defects, here are some general goals of testing an ETL application:
· Data completeness. Ensures that all expected data is loaded.
· Data transformation. Ensures that all data is transformed correctly according to business rules and/or design specifications.
· Data quality. Makes sure that the ETL software accurately rejects, substitutes default values, fixes or disregards, and reports incorrect data.
· Scalability and performance. Makes sure that data loads and queries are executed within anticipated time frames and that the technical design is scalable.
· Integration testing. Ensures that the ETL process functions well with other upstream and downstream processes.
· User-acceptance testing. Makes sure that the solution satisfies your current expectations and anticipates your future expectations.
· Regression testing. Makes sure that current functionality stays intact whenever new code is released.
One of the most basic tests of data completeness is to verify that all data loads correctly into the data warehouse. This includes validating that all records, fields, and the full contents of each field are loaded. Strategies to consider include:
· Comparing record counts between source data, data loaded to the warehouse, and rejected records.
· Comparing unique values of key fields between source data and data loaded to the warehouse. This is a valuable technique that points out a variety of possible data errors without doing a full validation on all fields.
· Utilizing a data profiling tool that shows the range and value distributions of fields in a data set. This can be employed during testing and in production to compare source and target data sets and point out any data anomalies from source systems that may be missed even when the data movement is correct.
· Populating the entire contents of every field to verify that no truncation takes place during any step in the procedure. For example, if the source data field is a string(30) ensure it is tested with 30 characters.
· Testing the boundaries of each field to find any database limitations. For example, for a decimal(3) field include values of -99 and 999, and for date fields include the entire range of dates expected. Depending on the type of database and how it is indexed, it is possible that the range of values the database accepts may be too small.
Validating that data is modified properly according to business rules is the most intricate component of testing an ETL application with considerable transformation logic. One technique is to select several sample records and “stare and compare” to verify data transformations manually. This is often beneficial but calls for manual testing steps and testers who understand the ETL logic. A combination of automated data profiling and automated data movement validations is a better long-term strategy. Here are some simple automated data movement techniques:
· Create a spreadsheet of scenarios of input data and expected results and validate these with the business customer. This is an excellent requirements elicitation step during design and could also be used as part of testing.
· Create test data that includes all scenarios. Utilize an ETL developer to automate the entire process of populating data sets with the scenario spreadsheet to permit versatility and mobility for the reason that scenarios are likely to change.
· Utilize data profiling results to compare range and submission of values in each field between target and source data.
· Validate accurate processing of ETL-generated fields; for example, surrogate keys.
· Validate that the data types within the warehouse are the same as was specified in the data model or design.
· Create data scenarios between tables that test referential integrity.
· Validate parent-to-child relationships in the data. Create data scenarios that test the management of orphaned child records.
SQA Solution defines data quality as “how the ETL system deals with data rejection, replacement, correction, and notification without changing any of the data.” To achieve success in testing data quality, we incorporate many data scenarios. Typically, data quality rules are defined during design, for example:
· Reject the record if a certain decimal field has nonnumeric data.
· Substitute null if a certain decimal field has nonnumeric data.
· Validate and correct the state field if necessary based on the ZIP code.
· Compare the product code to values in a lookup table. If there is no match, load anyway; however, report this to our clients.
Dependant upon the data quality rules of the software we are testing, specific scenarios to test could involve duplicate records, null key values, or invalid data types. Review the detailed test scenarios with business clients and technical designers to ensure that all are on the same page. Data quality rules applied to the data will usually be invisible to the users once the application is in production; users will only see what’s loaded to the database. For this reason, it is important to ensure that what is done with invalid data is reported to the clients. Our data quality reports provide beneficial information that in some cases uncovers systematic issues with the source data itself. At times, it may be beneficial to populate the “before” data in the database for clients to view.
Scalability and Performance
As the amount of data in a data warehouse increases, ETL load times may also increase. Consequently, the efficiency of queries should be expected to decline. This could be mitigated by using a sound technical architecture and excellent ETL design. The goal of performance testing is to uncover any potential problems in the ETL design. The following strategies will help discover performance issues:
· Load the database with maximum anticipated production volumes to make certain this amount of data can be loaded by the ETL process in the agreed-upon timeframe.
· Compare these ETL loading times to loading times conducted with a reduced amount of data to anticipate possible issues with scalability.
· Compare the ETL processing times component by component to indicate any regions of weakness.
· Monitor the timing of the reject process and consider how large volumes of rejected data will be handled.
· Perform simple and multiple join queries to validate query performance on large database volumes.
· Work together with business clients to formulate test queries and overall performance requirements for every query.
Typically, system testing only includes testing within the ETL application. The input and output of the ETL code constitute the endpoints for the system being testing. Integration testing demonstrates the way the software fits into the general flow of all upstream and downstream applications.
When designing integration test scenarios, we take into account how the overall process could possibly break. Subsequently, we focus on touch points between applications instead of within a single application. We take into account how process breakdowns at each and every step would be managed and how data would be restored or deleted if required.
Most difficulties discovered in the course of integration testing result from incorrect assumptions about the design of another application. Therefore, it is important to integration test with production-like data. Real production data is ideal, but depending on the contents of the data, there could be privacy or security concerns that require certain fields to be randomized before using it in a test environment.
As always, don’t forget the importance of good communication between the testing and design teams of all systems involved. To bridge this communication gap, it’s a good idea to bring team members from all systems together to help create test scenarios and talk about what might go wrong in production. Perform the complete process from start to finish in the exact same order and use the same dependencies, just as you would in production. Ideally, integration testing is a combined effort and not the sole responsibility of the team testing the ETL application via Data Warehouse Testing.
E ————- T————–L Testing
Extracting data for the data warehouse includes:
- Making ETL Architecture Choices
- Data Mapping
- Extracting data to staging area
- Applying data cleansing transformations
- Applying data consistency transformations
- Loading datat.
Data Mapping for Data Warehousing and Business Intelligence
A Data Map is specification that identifies data sources and targets as well as the mapping between them. The Data Map specification is created and reviewed with input by business Subject Material Experts (SMEs) who understand the data.
There are two levels of mapping, entity level and attribute level. Each target entity (table) will have a high level mapping description and will be supported by a detailed attribute level mapping specification.
|Target Table Name||dw_customer|
|Target Table Description||High level information about a customer such as name, customer type and customer status.|
|Source Table Names||dwprod1.dwstage.crm_cust|
|Join Rules||crm_cust.custid = ord_cust.cust.cust_nbr|
|Filter Criteria||crm_cust.cust_type not = 7|
Then for each attribute the attribute level data map specifies:
- Source: table name, column name, datatype
- Target: table name, column name, datatype
- Transformation Rule
Transformations may include:
- Breakout Array Values / Buckets
Extracting Data to Staging Area
Data is first extracted from the source system and placed in a staging area. This staging area is typically formatted like the source system. Keeping data in the same format as the source makes the first extract simple and avoids bogging the source system down.
You most likely will want to process only changed data, to avoid the overhead of reprocessing the entire set of data. This could be done by extracting data based on date/time information on the source system, mining change logs or by examining the data to determine what changed.
- Tip 1: Make sure the source system date/time information is consistently available. Use data profiling to validate.
- Tip 2: Store a copy of the prior version of data in the staging area so that it can be compared to the current version to determine what changed.
- Tip 3: Calculate check sums for both current and prior versions, then compare check sums rather than multiple columns. This speeds up processing.
- Tip 4: Add a source system prefix to table names in the staging area. This helps to keep data logically segregated.
Applying Data Transformations
Data is now ready for transformation which includes cleansing, rationalization and enrichment. The cleansing process, sometimes called “scrubbing” removes errors while rationalization removes duplicates and standardizes data. The enrichment process adds data.
Missing, Incomplete and Wrongly Formatted Data
Common problems that may require correction are missing data, incomplete data and wrongly formatted data. In the case of missing data, a complete column such as zip code or first name is empty. A tool could correct the zip code based on look up of address lines, city and state. Incomplete data is partially missing such as the case where an address constains the name of a street without the building number. Tools are available that can correct some of these problems. Finally, data may be in the wrong format. We may want telephone numbers to contain hyphens. A tool could consistently format telephone numbers.
Applying Data Consistency Transformations
Consistent data is important for “apples to apples” comparisons. For example, all weight measures could be converted to grams or all currency values to dollars. Transformation could be used to make code values consistent such as:
- Gender (“M”, “F”) or (“y”, “n”)
- Boolean (“Y”, “N”) or (1, 0)
More Data Cleansing Issues
|Correcting Duplicate Data||Same Party with Different Names (T. Jones, Tom Jones, Thomas Jones)|
|Dummy Data||Dummy data like ‘111111111’ for SSN|
|Mismatched Data||Postal Code does not Match City / State|
|Inaccurate Data||Incorrect inventory balances|
|Overloaded Attributes||Attributes mean different things in different contexts.|
|Meaning Embedded in Identifiers and Descriptions||Such as including price in SKU.|
Loading the Data Warehouse
The data warehouse is a mix of atomic and dimensional data. The atomic portion is stored in a normalized, relational format. Data stored in this format can be repackaged in a number of ways for ease of access when moved to the data mart.
Positioned for Direct Load to Data Warehouse by Utility
- Very Flexible
- Reduces Contention and Load Time for Data Warehouse
Loading the Data Mart
Loading the data mart through efficient and effective methods is the subject of this article. When loading the data mart, dimensions are loaded first and facts are loaded second. Dimensions are loaded first so that the primary keys of the dimensions are known and can be added to the facts.
Make sure that the following prerequisites are in place:
- Data is stored in the data warehouse and ready to load in the data mart
- Data maps have been created for movement from data warehouse to data mart
- Grain is determined for each dimension and fact
Loading Data Mart Dimensions
There are specific prerequisites that must be in place for dimensions:
- Dimensions have surrogate primary keys
- Dimensions have natural keys
- Dimensions have needed descriptive, non-key attributes
- Maintenance strategy is determined for each dimension:
- Slowly Changing Dimension (SCD) Type 1: Overwrite
- SCD Type 2: Insert new row – partitions history
- SCD Type 3: Columns in changed dimension contain prior data
Some dimensions are loaded one time at the beginning of the data mart project such as:
- Calendar Date
- Calendar Month
- US State
- US Zip Code
|Description||Dates of the year|
|Grain||A single day|
|Primary Key||Date_Key (generated integer)|
|Descriptive Attributes||Multiple date formats are stored, plus week, month, quarter, year and holidays. Both numeric dates and spelled out dates are included.|
|Maintenance Strategy||The date dimension is loaded once, at the beginning of the dart mart project. It may require updates to correct problems to change attributes such as: company_holding_ind.|
Loading Data Mart Facts
Data mart facts consist of 3 types of columns:
- Primary key
- Dimensional keys
In the data warehouse, there will be natural keys that can be joined with dimensions to obtain dimensional keys. For example:
|Description||Data Warehouse||Data Mart|
Consider whether these performance improvement methods are used:
- Turn off database logging to avoid the overhead of log insertions
- Load using a bulk load utility which does not log
- Primary keys should be single integers
- Drop relational integrity (RI) / foreign keys – restore after load is complete
- Drop indexes and re-build after load
- Partition data leaving data loaded earlier unchanged
- Load changed data only – use “delta” processing
- Avoid SQL Update with logging overhead – possibly drop rows and reload using bulk loader
- Do a small number of updates with SQL Update, then use bulk load for inserts
- Use Cyclic Redundancy Checksum (CRC) to detect changes in data rather than brute force method of comparing each column
- Divide SQL Updates into groups to avoid a big rollback log being create.
- Use an ETL tool that supports parallelism
- Use an ETL tool that supports caching