integration of data
Data integration has been a major IT challenge for a long time. Before the entry of cloud computing, integrating data from disparate sources was relatively easy. The pre-cloud era involved only desktop databases, but now the IT world is not so simple.
One can never imagine the amount of information involved in every IT systems of a huge organization. The volumes of data that a company integrates is significantly higher than anything it has collectively worked altogether before. Beyond the enormous data volume, the vast range of data types also contributes to the complexity. Gone are the days of dealing with structured data, now unstructured data and metadata are being the prime focus in enterprise data integration.
Every IT organization would be well aware of the Extract, Transform and Load (ETL) procedure from the long gone pre-cloud years. The following is a refresh of what ETL is and how it works behind the scene of data integration.
ETL is the heart of modern data integration. It follows a series of steps to take data from one or more sources to one or more target destinations:
ETL can be performed in different ways. A common example is saving data from an application as a CSV file, then editing the column names through a spreadsheet processor, and finally loading the data to another application. Such a basic ETL operation is followed by many companies in a standard manner. When there is no transformation required in the data integration, the entire process can be automated.
API-driven applications are the most preferred choice in IT enterprises, as it offers an easier form of integration. In ETL, data is regularly extracted from the source app with the help of an API. Then, the data is also loaded into the destination app through an API.
In case of not performing direct API-to-API integration, the process requires an intermediate data repository, which can be a file, database, data warehouse, or middleware solution. This is also a common scenario when multiple data sources are merged as a part of the transformation. In this case, the data from multiple sources are combined into a single data warehouse format. This makes it easy to transform the data before loading it into the destination service or application.
Data transformation process involves the below tasks, and not all of them happen in every data integration:
Data integration revolves on the ETL of data from IT services and applications. And, these services and applications are not always owned by the same organization, nor do they serve the same organization. Based on this point, data integration is broadly divided into two types, namely application-to-application and business-to-business.
Application-to-Application (A2A) data integration is preferred in the case of ETL operations when the services and applications are used by a single organization. Consider a case of connecting ERP and CRM systems. Here, the A2A integrations are performed only at the API level, although there could be integrations with old solutions at a database level.
Direct integrations always happen between popular services and applications. This fuels the need for an intermediate database in case of new A2A integrations. Direct integrations are restricted to a few use cases, but still many enterprises transfer the data into a middleware between extraction and loading. This scenario is common in organizations that also follow business-to-business data integration, since the data from one service or application may be used in multiple ways.
Business-to-Business (B2B) integration involves the exchange of data between multiple entities in multiple enterprises. An example of this type of data integration is business customers sharing their IT data with a logistics company to generate waybills automatically.
Comparing B2B integration to A2A integration, the difference lies in integrating external elements by exchanging external documents. However many organizations are unavoidably reluctant to open up their business-critical APIs for a variety of security reasons. This document exchange usually happens in a loosely couple format, signifying the possibility of a difference in the customer files after each stage of the integration. Considering our logistics company scenario, the company’s IT executives would have to cope with millions of slightly-to-largely different documents from the millions of customers. Ingesting enormously unstructured data from such files would require custom scripts to transform the receiving data formats. But, this problem gets even bigger when the situation scales up in a hurry.
At the center of the complexities involved in B2B integration, each organization extracting and sending data from their systems employs not only different IT solutions, but also different developers with various skillsets and expertise.
Even if two organizations are using the same system to extract data, one organization may do the data transformation part while the other might not.
In the logistics company example, let us assume that the company transforms data before filtering and validation. The company sends only the data that is necessary, and ensure formatting, so there are no database errors.
An associated organization, however, might transfer raw data dumps including enormous PII (Personally Identifiable Information). This will put forth a great responsibility on the shoulders of the destination enterprise. Additionally, any malicious or irregular data will go through without any change.
When a company receives the data of these two companies and loads these two datasets without any transformation, they would not only look out of order but also have high chances of corrupting the database.
So, where do all these issues and limitations leave us? Put yourself in the position of that receiving organization, and you’ll find plenty to ponder upon.
Companies that receive data from multiple sources need quick and efficient parsers to perform ETL and the data integration. Each customer may require a separate parser, and each parser will require a set of data transformations, plus a whole lot of data verification involved.
Unfortunately, this is not an easy process. And, the conventional IT systems are simply not designed to perform integrations at this scale, nor do they have the necessary tools to handle such complex data transformations. Also, many enterprises have a false assumption that if their application is capable of doing API integration, then it can also integrate files. This leaves IT organizations designing complex and custom ETL solutions to the front end their in-house middleware, their data warehouse, or both. Such a custom data integration solution is required by companies with capabilities of real-time transformation. And, it is wise to consider experienced data integration vendors to meet your custom data requirements.