TLDR
Data integration is synchronizing information between 1 system (e.g.
) and another (e.g.
)
Outline
What is it?
Why do you need it?
How do you implement it?
Who is it for?
Future and next steps
What is it?
Data integration is the process of loading data from an external source and then exporting it to an internal destination. This process can also be reversed (e.g. reverse ETL).
Scenario
You regularly collect gold from scattered kingdoms that have pledged loyalty to your throne. You collect their gold payments through
.
Source: Giphy
Your treasury (aka finance team) wants to better understand which demographics of kingdoms pay well and on time.
They need the gold payment data from
in your
data warehouse so that they can combine it with the kingdom census data. Once combined, they can analyze the data and report it to the high council.
Why do you need it?
Use cases for integrating external data with internal systems include (but aren’t limited to):
Combine data from multiple sources to build comprehensive data models for business use cases
Analyze data from external SaaS tools
Combine data for personalization
Forecasting
Example use cases for loading data from your internal systems (e.g.
) and then exporting it to an external system (e.g.
) include (but aren’t limited to):
Getting internal application data into marketing tools (e.g.
,
, etc.) for outreach campaigns, ads, etc.
Syncing user data with CRMs (e.g.
)
Exporting internal data to task management (e.g.
) software for ops
Scenario
You’re having a big tournament to celebrate your next heir to the throne. You put out lots of ad posters all over the land. Many knights sign up to compete in your tournament.
Source: Giphy
You build a data integration between all the ad service providers (e.g.
,
, etc.), ingesting data from those APIs and storing it in your
data warehouse.
Your team uses the ad campaign performance data, knight sign ups, tournament results , and spending from spectators to calculate the ROI of ad campaigns. These models will help your team better promote the next tournament more efficiently.
How do you implement it?
There are 3 ways you can accomplish data integration:
Software-as-a-service (SAAS)
(reverse ETL)
(reverse ETL)
Open-source software
Write the code from scratch or use open-source libraries
Write custom code
Source: Giphy
Some benefits of using SAAS or open-source software is that you get some out-of-the-box features such as (not a comprehensive list):
When a 3rd party API gets updated, maintainers will help update it
When there is a duplicate record, you can automatically ignore it or update an existing record
Automatically track progress and status of synchronizations
Retry failed synchronizations
And more…
Who is it for?
Typically, data engineering owns data integrations. Data engineers can choose how it’s implemented (e.g. buy SAAS or implement open-source software).
Marketing, sales, operations, etc. can help influence which 3rd party sources need data integration or which 3rd party APIs need data synced to it from internal data warehouses.
Depending on which sources are required by specific business use cases, different solutions are required.
Source: Giphy
Companies use SAAS or a self-hosted open-source software to integrate data from common sources. When an uncommon source doesn’t exist out-of-the-box from a SAAS provider, teams typically write the code themselves to handle the synchronization.
In addition, when a team needs to export data from their internal systems to 3rd party APIs (e.g. Salesforce) and has very customized needs, they also typically write custom code to handle their specific use case.
Future and next steps
Once data is integrated into a common destination, you can begin to combine it with data from other sources, analyze it, then predict on it.
In order to accomplish that, you’ll first need to clean the data, wrangle it, and transform the data all through a data pipeline. Managing this process in a testable, repeatable, and observable way is critical.
Source: Giphy
In this data integration series, upcoming content will include:
: the data engineering community standard for writing data integrations
How to write your own data integration
How to build an end-to-end data pipeline to sync data