Project Name

Data Mapping Optimisation Through Apache Spark

Industry

Information Technology

Technology

Apache Spark, Kubernetes

Overview

Our client works on managing the massive amounts of data from different sources that help to scale up the 10,000 records per minute. However, with each JSON file that contains 30–40 data entities, it becomes quite complicated for them to manage the data that possesses multiple challenges. They are looking for a robust solution to manage this data influx.

Challenges

Difficult to adapt to variation with less to no code changes.
The issue is parsing nested hierarchical JSON and mapping it to the Teradata tables instantly.
Accommodation of a high-volume flow of data year-on-year.

Our Solution

Ksolves team provided a robust approach to their client that includes:

First, we have prepared mapping files to facilitate instant mapping between JSON and hierarchical keys and database tables, including column names and types for data forecasting.
Then, with the utilization of Apache Spark with multi-node clusters on Kubernetes, we also implemented the Kubernetes operator, which provides scalability for future data needs.
This successful implementation will help to organize the data by date and time to prevent unnecessary data reprocessing.
Implementation of the Apache Spark system will work on easy modification of mapping files that were written in CSV text format and facilitate swift alterations.
The introduction of separate mapping files for each data type will effectively manage the data from diverse sources.

Conclusion

With the innovative problem-solving Apache Spark implementation approach, the Ksolves team has addressed the client’s challenges in managing the massive amount of data. By leveraging Apache Spark and implementing a meta-data-driven approach, we delivered a comprehensive approach to instantly handling the data variations.