What Is a Data Lake? Some Key Things to Know

Are you asking the question: what is a data lake? If yes, you should check out our guide here on the important things to understand.

Pilotcore Nov 22, 2020 6 min read

What Is a Data Lake? Some Key Things to Know

As datasets grow increasingly bigger, how can companies hope to analyze all that information?

Cloud computing has solved the issue of scalable data storage. Businesses can store anything from images to complex data structures in the Cloud. Yet they face the problem of how to pull all that data together and make sense of it.

The answer to analyzing and working with big data is to form a data lake.

In this article we'll ask what is a data lake, and what are they used for?

We examine the architecture behind the storage system and how it differs from a data warehouse. You'll learn how data ingestion works. And how AWS provides all the tools you need to gather and work with huge datasets.

What Is a Data Lake?

Cloud servers enable your business to add all sorts of digital information.

The rise of NoSQL databases like Amazon DynamoDB offers a way to store data of any type and size. Even non-database objects like application source code from a SaaS platform gets saved in the Cloud.

All this data is constantly changing. And that presents a challenge.

How do you analyze information that changes all the time? Especially if it's spread amongst multiple systems?

The answer is to bring it all together into a central repository called a data lake.

Traditional vs Data Lake Data Reporting

Traditional SQL databases like SQL Server used to make reporting simple.

Dedicated tools were built into the software, and with the right structured query, reports would arrive in your inbox.

But those vertical storage systems could only handle so much data. Enter Cloud storage.

The Cloud can scale to hold huge amounts of data. Yet the system's strength is also a weakness as it's fragmented in nature.

Querying constantly evolving data is a difficult process to manage. That's why data lakes provide the perfect way to analyze information because data lies steady in situ.

Data lakes bring together raw data into one massive collection or lake.

You can store thousands of gigabytes, terabytes, even petabytes of information. Data doesn't have to be of one type either. Unlike the rigid structure of SQL, the Cloud lets you do what you want with your data.

Data Lake Architecture

Pilotcore are advocates of Amazon Web Services or AWS because it offers complete data lake solutions.

AWS managed services offers a straightforward way to import, store, process, then analyze any type of data. It can be structured or unstructured; it doesn't matter. AWS has an automated referencing system that combines with a user-friendly console to search large datasets.

Data Ingestion Process

Data ingestion describes how data gets added to the AWS platform.

Sources can be almost anything, from in-house applications, spreadsheets, to web-scraped content. It's the backbone of data lake architecture as it follows the golden rule of computing: GIGO - garbage in, garbage out.

In other words, what you put into a system is what you'll get out of it.

There are two types of data ingestion:

Batch ingestion
Streaming ingestion

Batch ingestion sees data get periodically added to the data lake, usually through a scheduled event. It's generally easier and less expensive than real-time or streaming ingestion.

Streaming ingestion adds live information to a document store in real-time.

Data isn't grouped, and systems need constantly to monitor different sources, so it's process heavy. Streaming ingestion isn't usually required unless analyzing real-time data is essential.

AWS Data Lake Architecture

A data lake solution architecture in AWS can use Node.js or Python to manage tags, and to search, share, and transform data. It can create subsets of data for one specific type or region and even combine it with other external sources if required.

With AWS CloudFormation, a data lake solution can bring together several AWS services that may include:

AWS Lambda - run code on serverless processes
Amazon Elasticsearch - robust search facility
Amazon Cognito - user authentication
AWS Glue - data transformation
Amazon Athena - data analysis

Amazon S3 forms a catalogue of secure datasets. DynamoDB can work with S3 to handle related metadata.

Querying Data in AWS

Data can be queried through Amazon Athena or a custom console.

Managing IAM permissions ensures only authorized users get access to the results, a security feature that's native to all Amazon's services.

AWS data lakes can log all API calls as well as latency and error rates. Amazon CloudWatch enables you to access this information directly from your account screen. You can also implement audit logging for compliance purposes.

Data Lakes vs Data Warehouses

Data lakes and data warehouses sound similar but are quite different.

A data warehouse pre-processes data before it's integrated. All data has a specific use-case and adheres to strict governance.

A data lake allows any type of data into its store. It doesn't require pre-processing and is often used by data engineers to study big data.

Data warehousing uses traditional structured queries, whereas data lakes are more open-ended. Both have their place in analyzing and producing reports and datasets.

What Is a Data Lake Used For?

Any user that requires access to vast volumes of unstructured data can benefit from data lakes.

That includes data scientists, machine learning and AI programmers, and enterprise users. Basically, anyone wanting to study an enormous data dump.

That's not to say business intelligence (BI) tools like Amazon Quicksight can't benefit from a data lake.

Financial trading systems can use them to forecast future market trends. They analyse existing data to spot patterns that impact the likes of the Dow Jones.

However, a data warehouse is usually more applicable to most businesses.

Warehouses structure data which makes processing information far more efficient. Data lake uses are more for the discovery of hypothetical questions. That could entail analyzing real-world systems like weather patterns.

Big Data on AWS

Adopting Cloud services opens a world of possibilities when working with big data.

Amazon Elastic MapReduce or EMR offers a web service to process large amounts of data in a short timeframe. It does this by distributing data and processing across resizable clusters of EC2 instances.

Amazon Redshift is a data warehouse solution that also works with data lakes.

It queries structured and semi-structured content using SQL. Results have the option of getting saved back into the S3 data lake through open formats. That data, in turn, can be analyzed through Athena, SageMaker, etc.

Redshift costs less than most data warehousing platforms. And Pilotcore can optimize all your costs through our AWS cost optimization service.

We've briefly mentioned Amazon DynamoDB, but it pays to examine this NoSQL system further for storing big data.

DynamoDB uses a key-value document store to house multiple types of data. It handles up to 10 trillion requests per day and 20 million per second.

Big businesses like Airbnb, Samsung, and Toyota use DynamoDB because it's:

Fully scalable
Works on all modern applications
Has low latency data access
Unstructured so is extremely versatile

The Internet of Things (IoT) uses NoSQL solutions like DynamoDB to pave the way for future data storage and distribution.

Analyzing Data Lakes on AWS

Analyzing streaming data is managed through Amazon Kinesis.

Kinesis allows you to collect and process any real-time data. It's cost-effective and scalable so you can ingest any size of datasets including:

Video and Audio
Application logs
IoT machine learning data
Website clickstreams

Instead of waiting for harvested data to get imported, you can react instantly as new information arrives. Amazon Kinesis Data Analytics works with the open-source Apache Flink framework using SQL.

However, if you want Amazon's own structured query language platform, try Amazon Athena.

Athena is an interactive SQL service that works with Amazon S3.

It's fully managed, which means you don't need to manage any infrastructure. You also only pay for the queries you run.

It's simple too.

You direct Athena to your S3 storage then define your schema. Enter your SQL queries, even using ones created for other RDBMS, and let them run.

Results arrive in seconds, not hours. Amazon's Cloud computing network crunches the numbers which frees-up your internal server – if you still have one. There's no need for complex ETL (extract, transform, load) jobs as Athena does it all for you.

AWS Solutions Architects

In this article, we have asked what is a data lake and what advantages can it bring to your business.

We've looked at some of their major uses, and how machine learning is breaking new ground by analyzing big data.

If you want to benefit from Cloud computing, specifically AWS, then it's time to call the experts.

Pilotcore offers a full Cloud architecture design service to our clients.

From evaluation to migration and deployment, we can help you make the move to AWS. We also ensure you never pay more than need to and will guide you on the Cloud roadmap to success.

Our team is dedicated to AWS architecture and has over 20 years of experience. That means you can rely on Pilotcore with your data regardless of its size.

How to Process Dead Letter Queue Messages in AWS

Using AWS Systems Manager for Cloud Management

Why Penetration Testing is Important: The Case for Pentests