Welcome to Informatica Data Quality Tutorials. The objective of these tutorials is to provide in depth understand of Informatica Data Quality.
In addition to free Informatica Data Quality Tutorials, we will cover common interview questions, issues and how to’s of Informatica Data Quality.
Informatica Data Quality is a suite of applications and components that you can integrate with Informatica Power Center to deliver enterprise-strength data quality capability in a wide range of scenarios. The core components are Data Quality Workbench and Data Quality Server.
Informatica Data Quality is a suite of applications and components that you can integrate with Informatica PowerCenter to deliver enterprise-strength data quality capability in a wide range of scenarios.
The core components are:
Data Quality Workbench: Use to design, test, and deploy data quality processes, called plans. Workbench allows you to test and execute plans as needed, enabling rapid data investigation and testing of data quality methodologies. You can also deploy plans, as well as associated data and reference files, to other Data Quality machines. Plans are stored in a Data Quality repository. Workbench provides access to fifty database-based, file-based, and algorithmic data quality components that you can use to build plans.
Data Quality Server: Use to enable plan and file sharing and to run plans in a networked environment. Data Quality Server supports networking through service domains and communicates with Workbench over TCP/IP. Data Quality Server allows multiple users to collaborate on data projects, speeding up the development and implementation of data quality solutions. You can install the following components alongside Workbench and Server.
Integration Plug-In: Informatica plug-in enabling PowerCenter to run data quality plans for standardization, cleansing, and matching operations. The Integration plug-in is included in the Informatica Data Quality install fileset.
Free Reference Data: Text-based dictionaries of common business and customer terms.
Interested in mastering Informatica Data Quality Training? Enroll now for FREE demo on Informatica Data Quality Training.
Subscription-Based Reference Data: Databases, sourced from third parties, of deliverable postal addresses in a country or region.
Pre-Built Data Quality Plans: Data quality plans built by Informatica. to perform out-ofthe-box cleansing, standardization, and matching operations. Informatica provides free demonstration plans. You can purchase pre-built plans for commercial use.
Association Plug-In: Informatica plug-in enabling PowerCenter to identify matching data records from multiple Integration transformations and associate these records together for data consolidation purposes.
Consolidation Plug-In: Informatica plug-in enabling Power Center to compare the linked records sent as output from an Association transformation and to create a single master record from these records.
Informatica Data Quality Workbench Matching Algorithms
Informatica offers several implementations of matching algorithms that can be used to identify possible duplicate records. Each implementation is based on determining the similarity between two strings, such as name and address. There are implementations that are more well-suited to use with date strings and others that are ideal for numeric strings. In the coming weeks, we’ll go through an overview of each of these implementations and how to use them to your advantage!
Hamming Distance Algorithm
The Hamming distance algorithm is particularly useful when the position of the characters in the string is important. Examples of such strings are telephone numbers, dates and postal codes. The Hamming Distance algorithm measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other.
The Hamming distance is named after Richard Hamming. Hamming was an American mathematician whose accomplishments include many advances in Information Science. Perhaps as a result of Hamming’s time at Bell Laboratories, the Hamming distance algorithm is most often associated with the analysis of telephone numbers. However the advantages of the algorithm are applicable to various types of strings and are not limited to numeric strings.
Worth noting is one condition that needs to be adhered to when using this algorithm; the strings being analyzed need to be of the same length. Since the Hamming distance algorithm is based on the “cost” of transposing one string into another, strings of unequal length will result in high penalties due to the transpositions involving null character values.
Six Measures of Data Quality
The quality of the data records in your datasets can be described according to six key criteria, and an effective quality management system will allow you to assess the quality of your data in areas such as these:
Completeness: Concerned with missing data, that is, with fields in your dataset that have been left empty or whose default values have been left unchanged. (For example, a date field whose default setting of 01/01/1900 has not been edited.)
Conformity: Concerned with data values of a similar type that have been entered in a confusing or unusable manner, e.g. numerical data that includes or omits a comma separator ($1,000 versus $1000).
Consistency: Concerned with the occurrence of disparate types of data record in a dataset created for a single data type, e.g. the combination of personal and business information in a dataset intended for business data only.
Integrity: Concerned with the recognition of meaningful associations between records in a dataset. For example, a dataset may contain records for two or more individuals in a household but provide no means for the organization to recognize or use this information.
Duplication: Concerned with data records that duplicate one another’s information, that is, with identifying redundant records in the data set.
Accuracy: Concerned with the general accuracy of the data in a dataset. It is typically verified by comparing the dataset with a reliable reference source, for example, a dictionary file containing product reference data.
Learn more about Informatica Data Quality Interview Questions in this blog post.
Advantages of Informatica Data Quality
-One tool that acts as a single platform for data quality; No other tools and extra licenses are required; and that slashes license and maintenance costs
-Identify, resolve and prevent data quality problems; Data becomes more trusted
-Effective data profiling and more effective ways to share the profiling rules and results with business; All you need to do is generate a scorecard for the profile and share the URL with business. This enhance trust in business.
-Enhance IT productivity with powerful business-IT collaboration tools and a common data quality project environment
-Develop routines like address standardization, exception handling, data masking and integrate them with PowerCenter to utilize them as components / mapplets; I will discuss more about these routines in future posts
-Universal Connectivity to All Data Sources
-Centralized Data Quality Rules for All Applications
-All rules, reference data, and processes can be reused for all types of data integration projects, including data migration, data consolidation, and MDM.