New features available with Kedro

We’ve added datasets and documentation enhancements to the recent 0.18.4 release of Kedro

--

Jo Stichbury, Technical Writer, QuantumBlack Labs

Since its inception there have been some major milestones in the lifetime of Kedro. From being open-sourced in 2019, to being donated to the Linux Foundation.

Kedro is constantly being developed and the latest release, made in December 2022, brings a raft of changes as the rest of this post describes.

The image shows the silhouette of a person in front of a yellow, green and blue “jellyfish” representation of data.
Data Represented in an Interactive 3-D Form” by Idaho National Laboratory is licensed under CC BY 2.0

Datasets enhancements

The new release of Kedro (0.18.4) focuses on improving datasets to enhance input and output in a data and machine-learning pipeline.

Kedro datasets are used in combination with the Kedro Data Catalog, which is the registry of all data sources to map the names of node inputs and outputs in a specialised class for a range of data storage types. For example:

# Load a Spark DataFrame on S3
flight_patterns:
type: spark.SparkDataSet
filepath: s3a://your_bucket/data/01_raw/flight_patterns*
credentials: dev_s3
file_format: csv

# Save an image created with Matplotlib on Google Cloud Storage
results_plot:
type: matplotlib.MatplotlibWriter
filepath: gcs://your_bucket/data/08_results/plots/output_1.jpeg
fs_args:
project: my-project
credentials: my_gcp_credentials

Kedro provides numerous different built-in datasets for various file types and file systems, to save you from having to write the logic for reading or writing data, including Pandas, Spark, Dask, NetworkX, Pickle, and more.

As we mentioned in “Keeping up with Kedro”, the upcoming Kedro 0.19.0 release (expected in early 2023) will move Kedro’s datasets from the extras directory to a separate package called Kedro-Datasets.

In preparation, within this recent release, we’ve added framework code that prioritises datasets from the kedro_datasets namespace over kedro.extras.datasets(because kedro_datasets is the namespace for the new package).

We’ve also added some datasets:

  • svmlight.SVMLightDataSet to work with svmlight/libsvm files using scikit-learn library
  • video.VideoDataSet to read and write video files from a filesystem
  • video.video_dataset.SequenceVideo to create a video object from an iterable sequence to use with VideoDataSet
  • video.video_dataset.GeneratorVideo to create a video object from a generator to use with VideoDataSet
  • pandas.SQLQueryDataSet now takes the optional argument execution_options to reduce memory usage when dealing with large dataset .

Finally, we’ve updated the MatplotlibWriter dataset docs with working examples.

Documentation improvements

To accelerate the process of getting Kedro up and running, we’ve made some changes to our documentation to improve it for new users.

We have revised the early sections of the documentation to simplify them and clarify the learning path. The spaceflights tutorial is now more straightforward, and we’ve moved advanced materials into more appropriate sections. We’ve improved the experience by streamlining the navigation between pages. The table of contents is now sticky, to make it easier to find your way around.

Contributions from the Kedro community

The release also includes some configuration improvements and numerous bug fixes and minor enhancements in response to reports from our users on Kedro’s Slack organisation. Take a look at the full release notes on GitHub for details. We’re proud of the fact that 14 of the PRs included in this release are contributions by members of Kedro’s open-source community. We’d particularly like to thank the following GitHub users:

jstammers, FlorianGD, yash6318, carlaprv, dinotuku, williamcaicedo, avan-sh, Kastakin, amaralbf, BSGalvan, levimjoseph, daniel-falk, clotildeguinard and picklejuicedev (for comments and input to documentation changes).

Our standardised contribution workflow means that anyone can join Kedro’s continued development and eventually progress into becoming an official maintainer on the project, writing code to improve the framework. In return for a weekly time commitment, maintainers may join Kedro’s Technical Steering Committee and help shape product strategy and roadmap decisions through regular voting.

Our community is thriving, as can be seen from the proliferation of third-party plugins that the Kedro community has recently created, including:

For more insights into the Kedro community, check out this recording of the October 2022 Kedro Showcase, which includes more information about the changes to datasets, and new features in the 0.18.x releases, as well as community updates.

A recording of the October 2022 Kedro showcase online event

To ask us questions, meet the community and stay up to date with Kedro news, why not join our Slack organisation?

--

--

QuantumBlack, AI by McKinsey
QuantumBlack, AI by McKinsey

An advanced analytics firm operating at the intersection of strategy, technology and design.