Data integration has been a major IT challenge for a long time. Before the entry of cloud computing, integrating data from disparate sources was relatively easy. The pre-cloud era involved only desktop databases, but now the IT world is not so simple.
One can never imagine the amount of information involved in every IT systems of a huge organization. The volumes of data that a company integrates is significantly higher than anything it has collectively worked altogether before. Beyond the enormous data volume, the vast range of data types also contributes to the complexity. Gone are the days of dealing with structured data, now unstructured data and metadata are being the prime focus in enterprise data integration.
ETL: The Driving Factor behind Data Integration
Every IT organization would be well aware of the Extract, Transform and Load (ETL) procedure from the long gone pre-cloud years. The following is a refresh of what ETL is and how it works behind the scene of data integration.
ETL is the heart of modern data integration. It follows a series of steps to take data from one or more sources to one or more target destinations:
- Extract data from a service or an application
- Then, transform the data to a format understandable by the destination services and applications
- Finally, load the data into the destination service(s) and application(s)
ETL can be performed in different ways. A common example is saving data from an application as a CSV file, then editing the column names through a spreadsheet processor, and finally loading the data to another application. Such a basic ETL operation is followed by many companies in a standard manner. When there is no transformation required in the data integration, the entire process can be automated.
The Transformation Part of Data Integration
API-driven applications are the most preferred choice in IT enterprises, as it offers an easier form of integration. In ETL, data is regularly extracted from the source app with the help of an API. Then, the data is also loaded into the destination app through an API.
In case of not performing direct API-to-API integration, the process requires an intermediate data repository, which can be a file, database, data warehouse, or middleware solution. This is also a common scenario when multiple data sources are merged as a part of the transformation. In this case, the data from multiple sources are combined into a single data warehouse format. This makes it easy to transform the data before loading it into the destination service or application.
Data transformation process involves the below tasks, and not all of them happen in every data integration:
- Complying with business rules: Data transformation can involve meeting the operational requirements of a business. Some of the examples include converting imperial unit data to metric unit data, appending a time stamp, or including a production priority based on the criteria of a secondary data set.
- Cleaning: Data from the source application needs to compatible with the destination application, so it is necessary to alter the data accordingly. Common examples include mapping NULL values to zero, modifying the date and time formats, or mapping values such as “Male” to “M” and “Female” to “F.”
- Filtering: The transformation process can involve just filtering particular columns to facilitate information exchange between services or applications. It is also worthy to note that filtering is always a part of big-scale data transformations.
- Splitting: The incoming data is never neatly organized. Many times the information stored in a column needs to be divided into multiple columns. Some of the examples include dividing a comma-separated array stored in a column into multiple columns, or splitting a date/time column into separate date and time columns.
- Joining: Data points may need to be inter-connected before loading them into a destination service or application. This process is an opposite of splitting, where multiple data columns of a single source are combined into a single column. It always doesn’t involve consolidating data into a single column; it can also include adding data that didn’t exist in the export. A common example is using the GPS data from the source to identify the location and appending that to the destination data.
- Transposing: This is one of the complex types of transformation. Transposing involves changing the relationship between rows and columns. The simplest form of transposition is making each column as a row and each row as a column. A complex type of transposition includes converting a data table into value/key pairs or vice versa.
- Data validation: Validating the data before the loading process is a good option. An example of data validation includes verifying the format of an email address or the validity of a postal code. Data validation is often carried out for security reasons. An example would be attempting to load data into an SQL server, where the server could probably delete all of its contents.
The Types of Data Integration Different Companies Operate On
Data integration revolves on the ETL of data from IT services and applications. And, these services and applications are not always owned by the same organization, nor do they serve the same organization. Based on this point, data integration is broadly divided into two types, namely application-to-application and business-to-business.
Application-to-Application (A2A) data integration is preferred in the case of ETL operations when the services and applications are used by a single organization. Consider a case of connecting ERP and CRM systems. Here, the A2A integrations are performed only at the API level, although there could be integrations with old solutions at a database level.
Direct integrations always happen between popular services and applications. This fuels the need for an intermediate database in case of new A2A integrations. Direct integrations are restricted to a few use cases, but still many enterprises transfer the data into a middleware between extraction and loading. This scenario is common in organizations that also follow business-to-business data integration, since the data from one service or application may be used in multiple ways.
Business-to-Business (B2B) integration involves the exchange of data between multiple entities in multiple enterprises. An example of this type of data integration is business customers sharing their IT data with a logistics company to generate waybills automatically.
Comparing B2B integration to A2A integration, the difference lies in integrating external elements by exchanging external documents. However many organizations are unavoidably reluctant to open up their business-critical APIs for a variety of security reasons. This document exchange usually happens in a loosely couple format, signifying the possibility of a difference in the customer files after each stage of the integration. Considering our logistics company scenario, the company’s IT executives would have to cope with millions of slightly-to-largely different documents from the millions of customers. Ingesting enormously unstructured data from such files would require custom scripts to transform the receiving data formats. But, this problem gets even bigger when the situation scales up in a hurry.
The Issue of Scalability
At the center of the complexities involved in B2B integration, each organization extracting and sending data from their systems employs not only different IT solutions, but also different developers with various skillsets and expertise.
Even if two organizations are using the same system to extract data, one organization may do the data transformation part while the other might not.
In the logistics company example, let us assume that the company transforms data before filtering and validation. The company sends only the data that is necessary, and ensure formatting, so there are no database errors.
An associated organization, however, might transfer raw data dumps including enormous PII (Personally Identifiable Information). This will put forth a great responsibility on the shoulders of the destination enterprise. Additionally, any malicious or irregular data will go through without any change.
When a company receives the data of these two companies and loads these two datasets without any transformation, they would not only look out of order but also have high chances of corrupting the database.
The Need for Customs Data Integration
So, where do all these issues and limitations leave us? Put yourself in the position of that receiving organization, and you’ll find plenty to ponder upon.
Companies that receive data from multiple sources need quick and efficient parsers to perform ETL and the data integration. Each customer may require a separate parser, and each parser will require a set of data transformations, plus a whole lot of data verification involved.
Unfortunately, this is not an easy process. And, the conventional IT systems are simply not designed to perform integrations at this scale, nor do they have the necessary tools to handle such complex data transformations. Also, many enterprises have a false assumption that if their application is capable of doing API integration, then it can also integrate files. This leaves IT organizations designing complex and custom ETL solutions to the front end their in-house middleware, their data warehouse, or both. Such a custom data integration solution is required by companies with capabilities of real-time transformation. And, it is wise to consider experienced data integration vendors to meet your custom data requirements.