Streaming Model Training Without the Need for Another Data Lake

Apache Kafka

5 MIN READ

October 22, 2021

Data Lake

Machine Learning and tiered storage together enables you to build one scalable and reliable simple infrastructure for all machine learning applications using Apache Kafka. These days Kafka is used more to build ML infrastructure for data ingestion, real-time streaming and most importantly model training. 

In this blog we will look into how kafka accomplishes model training without needing another data lake.

Kafka as an indigestion layer into data lake: the traditional way

A data lake is a system of data stored in its raw format. Most commonly used technologies for data storage are HDFS, Amazon S3, Google Cloud Storage, and tools like Apache Spark. 

Apache Kafka is an event streaming platform that collects, processes and stores streams of data in a real-time and fault-tolerant manner.  The Kafka broker stores the data in a distributed highly-available infrastructure. Consumers read the events in real-time. 

Data lake is a very common pattern for building ML infrastructure. But it comes with certain limitations and drawbacks. The problem with a data lake is its batch nature. When the core system is batch, there is no room to add real-time processing on top of it. This means you lose the benefits of Kafka and have to manage two different systems with different access patterns.

Another drawback of this old approach is that one has to use a data lake just for storing data. This burdens the overall architecture with additional costs and operational efforts. But, just think, do you still need a data lake if you have the data in Kafka already? Will it offer any advantage? 

Interestingly more businesses are drifting away from one central data lake to use the correct data store for their needs. Well, you need something different like a real-time consumer to process the data with their business applications.

So, rather than wasting time on the old approach Ksolves advise you to go ahead with the new approach of not having a data lake. Let us understand how?

Kafka for streaming ML without data lake: the newest approach

Let’s understand the new approach for model training and predictions that don’t need a data lake in the first place. Instead we use steaming Machine Learning. Most Machine Learning models do not support online model training, therefore the TensorFlow application takes a batch of the consumed events at once to train an analytic model.

The major difference between the new and the traditional way is that there is no additional data storage like HDFS or S3 is required in the new way.

Kafka is used as a data lake and single source for all events. This means that the core system stores all events in an event-based manner and doesn’t use data storage like HDFS. And since the data is stored as events, you can add different consumers like real-time and near real-time and can use different systems and access patterns. 

While using ML use can directly use streaming data for model training and predictions. 

Kafka is not a data lake

You must be wondering, is it good to use Kafka for long-term data storage. We say, you must. Storing data in Kafka for long-term allows you to easily implement use cases in which you would want to process data in an event-based order. 

Modern architecture design patterns like event sourcing leverage Kafka as event-driven architecture because it provides the required infrastructure for these architectures.

Lets now understand how tiered storage in Kafka can help in simplifying ML infrastructure.

Data ingestion and data preprocessing

Long-term storage in Kafka allows data scientists to work with datasets one can either consume data from the beginning or can consume for a specific time span. This enables rapid prototyping and data processing. 

Model training and model management both with or without a data lake

The next step after data processing is model training. You can either ingest the processed event streams into a data lake or train the model with streaming Machine Learning. 

If you utilize tiered storage you might consider storing the models in a dedicated Kafka topic. The models can co-exist in different versions. You can choose a compacted topic to use only the most recent version of the models.

Model deployment for real-time predictions

There are several ways to deploy your models. Models are either deployed to a model server or are embedded directly into event streaming applications. In most applications, analytic models are directly embedded into the event streaming application and make it robust.

The model predictions are stored in another Kafka topic with tiered storage turned on if the topic needs to be stored for longer. Any application can consume it from here. 

Reusing the data ingestion and preprocessing pipeline

Just remember the fact that data ingestion and preprocessing are required for model training and model inference.

We can reuse the data ingestion and preprocessing pipeline that we built for model training. The same pipeline can also be used for real-time predictions.

Real-time monitoring and analytics

Model training and model deployment are just two parts of Machine Learning. Monitoring, testing, and analysis of the whole machine learning infrastructure are critical. But, it is much harder to do it than for a traditional system. We can solve certain challenges with streaming Machine Learning architecture-

  • Data used for model training 
  • Preprocessed data and model features
  • Data used for model predictions
  • Errors
  • Infrastructure monitoring

Get started with model training with Ksolves

In this blogspot we understand how we can offer model training using Kafka and without any data lake. This streaming ML infrastructure establishes a reliable, scalable, and future ready infrastructure using frontline  technologies. If you are ready to take a step forward, Ksolves is a right platform to do so. Our Kafka services are known for their seamless deployment and round the clock support. Our cross-functional expertise makes us who we are. Let’s connect today and discuss more about model training. Our low latency Kafka models offer great performance with minimal delay.

If you want to know more about Apache Kafka write your comments below or give us a call. 

AUTHOR

author image
Anil Kushwaha

Apache Kafka

Anil Kushwaha, Technology Head at Ksolves, is an expert in Big Data and AI/ML. With over 11 years at Ksolves, he has been pivotal in driving innovative, high-volume data solutions with technologies like Nifi, Cassandra, Spark, Hadoop, etc. Passionate about advancing tech, he ensures smooth data warehousing for client success through tailored, cutting-edge strategies.

Leave a Comment

Your email address will not be published. Required fields are marked *

(Text Character Limit 350)