The first stable version of Google’s open-source tool Magika for AI-powered file type detection is now available. For version 1.0, the application has been redeveloped in Rust and supports more than 200 different file types—double as many as in last year’s alpha version.
According to Google, the redevelopment in Rust ensures significant performance improvements. On a MacBook Pro with an M4 chip, Magika processes nearly 1000 files per second. The tool uses ONNX Runtime for fast AI inference and Tokio for asynchronous parallel processing. In addition to the new native client for the command line, modules for Python and TypeScript are also available.
From Jupyter Notebooks to WebAssembly
The extended type detection also covers specialized formats from various areas: data science…
The first stable version of Google’s open-source tool Magika for AI-powered file type detection is now available. For version 1.0, the application has been redeveloped in Rust and supports more than 200 different file types—double as many as in last year’s alpha version.
According to Google, the redevelopment in Rust ensures significant performance improvements. On a MacBook Pro with an M4 chip, Magika processes nearly 1000 files per second. The tool uses ONNX Runtime for fast AI inference and Tokio for asynchronous parallel processing. In addition to the new native client for the command line, modules for Python and TypeScript are also available.
From Jupyter Notebooks to WebAssembly
The extended type detection also covers specialized formats from various areas: data science formats such as Jupyter Notebooks, NumPy arrays, or PyTorch models are included, as are modern programming languages (Swift, Kotlin, TypeScript, Dart, Solidity, Zig) and DevOps configuration files (Dockerfiles, TOML, HashiCorp HCL). Magika can also distinguish more accurately between similar formats—for example, between JSON and JSONL or between C and C++ code.
According to its statements, Google had to overcome two challenges for the training of the extended model: the training dataset grew to over 3 terabytes, requiring the use of the in-house SedPack library for efficient streaming. For rare or specialized file types for which not enough real examples were available, the company relied on generative AI: Google’s Gemini model generated synthetic training data by translating code and structured files between different formats.
Magika can be set up on Linux, macOS, and Windows. Furthermore, developers can integrate the tool as a library into Python, TypeScript, or Rust projects. According to Google, the project has recorded over one million downloads per month since the alpha version.
(fo)
Don’t miss any news – follow us on Facebook, LinkedIn or Mastodon.
This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.