Data sources episode 1: Common data sources in modern pipelines

First published on March 22, 2023

Last updated at April 28, 2023

 

12 minute read

Guest post by Shashank MishraData Engineer @ Expedia

TLDR

Databases aren’t the only sources of data, modern data pipelines can consume data from various sources.

Outline

  • What is a Data Pipeline?

  • Difference between Data Pipeline and ETL (Extract, Transform, Load)?

  • What does Data Source mean?

  • Different types of Data Sources

  • Why is it important to know your Data Source?

  • Conclusion

What’s a Data Pipeline?

A data pipeline is a series of interconnected steps that extracts, transforms, and loads data from different sources into a target destination, such as a data warehouse. It enables organizations to process and access data in a useful format for analysis, reporting, or other purposes.

A simple example of a data pipeline could be a process of extracting sales data from multiple sources, such as point-of-sale (POS) systems and online sales platforms. The data is then transformed and cleaned to remove errors or inconsistencies and aggregated to create a single view of all sales. Finally, the transformed data is loaded into a data warehouse where it can be analyzed to identify trends, patterns, or insights that can inform business decisions.

(Source: 

)

Difference between Data Pipeline and ETL (Extract, Transform, Load)?

Data Pipeline and ETL are both essential concepts in data processing and integration, but they serve different purposes and have unique characteristics. Here's a comparison of Data Pipeline and ETL:

Aspect

Data Pipeline

ETL (Extract, Transform, Load)

Purpose

Move and process data from various sources to destinations

Extract data from sources, transform it, and load it to target

Scope

Broader, includes ETL as a subsets

A specific type of data pipeline

Flexibility

Can handle various types of data processing

Primarily focused on structured data transformation

Data Processing

Real-time or batch processing

Typically batch processing

Data Transformation

Optional, can be performed in various ways

Integral part of the process

Data Flow

Unidirectional or bidirectional

Unidirectional

Data Types

Handles structured, semi-structured, and unstructured data

Primarily handles structured and semi-structured

Use Cases

Data replication, streaming, analytics, machine learning

Data warehousing, data migration, data integration

Processing Complexity

Can range from simple to complex

Can range from simple to complex, but usually complex

What does "Data Source" mean?

In the context of a data pipeline, a "Data Source" refers to the origin or starting point from which data is collected, extracted, or ingested before it’s processed, transformed, and either stored or utilized in the pipeline.

(Source: 

)

Common sources of Data in Modern Data Pipelines?

Data sources can be diverse and include various types of systems, databases, applications, or files where raw data is generated, stored, or managed. Here are the popular ones:

  1. Relational databases

    : A relational database is a type of database that organizes data into one or more tables, with each table consisting of rows and columns. Each table represents a single entity or concept, such as customers, orders, products, or employees. Relational databases use a structured query language (SQL) to manipulate and retrieve data from the tables. Example - 

  2. NoSQL databases

    : A NoSQL database is a type of database that differs from traditional, relational databases in its data model and approach to storing and retrieving data. NoSQL databases are often designed to handle unstructured or semi-structured data, such as social media posts, documents, and sensor data. ⁠ ⁠Unlike relational databases, NoSQL databases don’t use tables with fixed schemas to store data. Instead, they use various data models, such as key-value, document, graph, or column-family models. NoSQL databases are highly scalable and can handle large volumes of data and high levels of traffic. Example - 

    ⁠⁠⁠Data Representation in MongoDB

    (Source: MongoDB)

  3. Data warehouses

    : A data warehouse is a large, centralized repository of data that is specifically designed for business intelligence and analytics. It’s used to store, manage, and analyze data from multiple sources to support decision-making and reporting in an organization. Example - 

  4. File systems:

    A file system is a distributed and scalable storage system designed to handle large volumes of data across multiple nodes or servers. It provides a way to store, manage, and access data in a distributed environment, enabling high availability and fault tolerance. Example -

     

  5. APIs:

    An API (Application Programming Interface) acts as a mediator between different software applications, allowing them to communicate and exchange data with each other. APIs can be used to retrieve data, initiate actions or workflows, and enable integrations between different software applications. Example - 

    ⁠Working example of REST API call and its response (

    )

  6. Messaging Queues:

    A messaging queue allows software applications to communicate asynchronously by exchanging messages or data using a publish-subscribe or point-to-point messaging model. It improves the scalability, reliability, and fault tolerance of software systems by decoupling message sending and receiving. Messaging queues are commonly used in distributed systems, cloud computing, and microservices architecture. Example - 

    (

    Source: 

    ) ⁠

  7. Social media platforms:

    Social media platforms such as

    Facebook

    Twitter

    Instagram

    , and 

    LinkedIn

    can serve as valuable data sources for businesses and researchers. They provide large volumes of user-generated content, including text, images, and videos, that can be analyzed to gain insights into user behavior, sentiment, trends, and preferences.

    (

    Source: 

    ) ⁠

  8. IoT devices and sensors:

    IoT devices and sensors generate vast amounts of data that can be used to monitor and analyze physical processes, environments, and activities. This data can be leveraged for predictive maintenance, process optimization, and real-time decision-making in various industries such as manufacturing, healthcare, transportation, and agriculture. Example - 

    Smart Meters, Temperature Sensors, GPS trackers

    (Source: 

    )

Why is it important to know your data source?

Knowing your data source technically is important because it affects how the data is captured, stored, processed, and analyzed. Different data sources may have different data formats, data quality, data volume, and data velocity, which can impact the design of the data pipeline.

For example, if the data source generates large volumes of data in real-time, such as IoT sensors, the data pipeline must be designed to handle the high data velocity and ensure timely processing and analysis. Similarly, if the data source has poor data quality or inconsistent data formats, data cleaning and transformation steps must be included in the pipeline to ensure accurate and reliable data analysis.

Understanding the technical aspects of the data source can also help identify potential data integration or compatibility issues with other data sources or systems, which can impact the overall performance and effectiveness of the data pipeline.

(Source: 

)

Conclusion

In conclusion, data pipelines play a crucial role in processing and integrating data from various sources. The most common sources of data in data pipelines include relational databases, NoSQL databases, data warehouses, file systems, APIs, data streams, spreadsheets, logs, social media platforms, and IoT devices. By harnessing data from these diverse sources, businesses and organizations can gain valuable insights, optimize their operations, and drive data-driven decision-making. As the volume, variety, and velocity of data continue to increase, the importance of robust and scalable data pipelines cannot be overstated, making them a fundamental component in today's data-centric world.

In the episode 2 of the

data sources

series, we’ll include 

an open source platform to sync data from various data sources.