Everyone who has ever worked on Apache Spark has probably known the importance of structured streaming. Spark is an analytics engine used for large scale data processing with 100x faster speed. Spark is known for its ease of use where you can write applications in languages like Java, Scala, Python, R, and SQL. The greatest reason for spark being widely used is that it runs everywhere like Hadoop, Kubernetes, or in the cloud. But, today we will talk about the latest Apache Spark 3.1 Release which is providing amazing improvements in Spark streaming.
In this blog, we are going to discuss the high-end features and improvements that come with Apache Spark 3.1 Release. But, first know what is Structured streaming.
Apache Spark Structured streaming: Overview
Structured streaming is a scalable and fault-tolerant processing platform that is built on Spark SQL. It provides end-to-end stream processing that integrates with storage and batch jobs in a fault-tolerant way.
This streaming model is based on DataFrame and Dataset APIs and can be easily applied SQL query or scala operations on streaming data.
Let us now understand the new and improved features in Apache Spark 3.1 Release and how it changed the shortcomings of the previous versions of Apache Spark that were creating complexities in streaming.
New streaming table APIs
In structured streaming, a continuous data stream is taken as an unbound table and hence they provide a more convenient way to handle the queries of streaming. Apache Spark 3.1 Release has added support for DataStreamReader and Writer.
- Users can use the table API to read and write streaming DataFrames.
- End users can transform the source dataset and write to a new table.
You can also use Delta Lake format with streaming table API which can
- Compact small files
- Maintaining once processing with more than one stream
- Discovering the new files when using them as a source for streams.
Support for Stream-Stream join
Before Apache spark 3.1 Release, only three joins namely, inner, left outer and right outer stream joins were supported. But, in the latest release, full outer and left semi stream-stream join have been implemented. It makes Structured streaming useful in several other scenarios.
Improvements in Kafka data source
In the latest Apache Spark 3.1 Release, the Kafka dependency has been upgraded to 2.6.0. This has helped users in migrating to the new API. Kafka was earlier facing the issue of infinite waiting while using the older version.
Validation of schema
Schemas are as important as anything for queries of Structured streaming. In this new Apache Spark 3.1 Release, schema validation logic has been added for user input as well as internal state store.
Validate schema among query restart
- Key and Value schema are stored in the schema files.
- They are then verified against the existing key and values at the query restart.
- If the numbers of fields are the same and so are the data types, the State schema is considered compatible.
- It will prevent queries with incompatible schema and in result reduces the inderministic behaviour.
Validate schema for streaming state store
- In the new Apache Spark version, checkpoints will be reused which were earlier put into StateStore in the older version.
- Without schema validation any bug fix is used to cause random expectations.
- Now, Spark validates the checkpoint and throws InvalidUnsafeRowExpectation whenever the checkpoint is reused during the migration process.
Enhanced Structured Streaming UI
Well, the structured streaming UI was introduced in Spark 3.0. In Apache Spark 3.1 Release, history server support for Structured streaming UI and further information on the status of streaming runtime.
State information
Metrics added are as follows-
- Aggregated number of total state rows
- Aggregated number of updates rows
- State memory used in bytes
- Number of state rows dropped by watermarks
We can also add some features like capacity planning.
Watermark gap information
Watermark needs to be tracked for stateful queries. Knowing the gap between wall clock and input data is helpful in setting expectations of the output.
State custom metrics information
- Custom metrics information is set in a configuration.
FileStreamSource Enhancement
There are certain improvements for FileStreamSource
- Earlier, FilesStreamSource will fetch all the available files, process some files and ignore others.
- Now with the new improvements in Apache Spark 3.1 Release, it will catch the files and reuse them.
- All the entries in the metadata were deserialized earlier into Spark’s driver memory.
- Now Spark will read and process the metadata log in a streamlined manner.
There is also a new option in Apache Spark 3.1 Release to configure the retention of metadata log files for structured streaming queries.
Other updates in Apache Spark 3.1 Release
Other than extremely amazing features, you will get some other additional features. The release provides usability and stability and can resolve over 1500 tickets. Other features include-
- ANSI compliance- runtime error, create table syntax, etc
- Execution- node decommissioning, partitioning pruning, etc.
- Python- dependency management
- Performance- shuffle removal, predicate pushdown, etc.
- Search function in Spark doc, ignore hints, etc.
Conclusion
We hope we have covered all the important improvements of Apache Spark 3.1 Release. The next releases will also focus on improving performance and usability for Apache Spark Structured Streaming. Though we have tried to cover as many features as possible still there are lot of features that are not mentioned here but are highly beneficial for structured streaming
If you want to get started with Apache Spark 3.1 Release, Ksolves is the right place for you. We have a proven track record of delivering high quality Apache Spark services and have proved our meattle as an Apache Spark development company in India and the USA. Try out our services and get started within minutes. For any further information on Apache Spark or similar big data service, feel free to write to us in the comment section or give us a call for your free demo now.
Contact Us for any Query
Email : sales@ksolves.com
Call : +91 8130704295
Read related articles:
Advantages of NoSQL over RDBMS: Is the Former a Clear Winner?
Apache Nifi Vs Apache Spark: 8 Useful Comparisons To Learn
AUTHOR
Spark
Anil Kushwaha, Technology Head at Ksolves, is an expert in Big Data and AI/ML. With over 11 years at Ksolves, he has been pivotal in driving innovative, high-volume data solutions with technologies like Nifi, Cassandra, Spark, Hadoop, etc. Passionate about advancing tech, he ensures smooth data warehousing for client success through tailored, cutting-edge strategies.
Share with