Let me see if I understand this right.
Vortex is a file format. In their canonicalized, uncompressed form, vortex files are simply Apache Arrow IPC files with some of the bits and bobs moved around a bit (enabling transformation to/from Arrow), plus some extra metadata about types, summary statistics, data layout, etc.
The Vortex spec supports fancy strategies for compressing columns, the ability to store summary statistics alongside data, and the ability to specify special compute operations for particular data columns. Vortex also specifies the schema of the data as metadata, separately from the physical layout of the data on disk. All Arrow arrays can be converted zero-copy into Vortex arrays, but not vice-versa.
Vortex also supports extensions in the form of new encodings and compression strategies. The idea here is that, as new ways of encoding data appear, they can be supported by Vortex without creating a whole new file format.
Vortex-serde is a serde library for Vortex files. In addition to classic serialization/deserialization, it supports giving applications access to all those fancy compute and summary statistics features I mentioned above.
You say "Vortex is a toolkit for working with compressed Apache Arrow arrays in-memory, on-disk, and over-the-wire," but that's kind of like saying "MKV is a toolkit for working with compressed AVI and WAV files." It sounds like Vortex is a flexible file spec that lets you:
1. Work with Arrow arrays on disk with options for compression.
2. Create files that model data that can't be modeled in Arrow due to Arrow's hard coupling between encoding and logical typing.
3. Utilize a bunch of funky and innovative new features not available in existing data file formats and probably only really interesting to people who are nerds about this (laypeople will be interested in the performance improvements, though).
Imagine explaining to a newcomer that you write your app using Vert.x, it consumes AI models from GCP Vertex and uses Vortex for its high-performance columnar file structure.
Thank God this file format is written in Rust, otherwise I'd be extremely skeptical.
Not an expert in the space at all and it does seem like people are exploring new file and table formats so that is really cool!
How does this compare to Lance (https://lancedb.github.io/lance/)?
What do you think the key applied use case for Vortex is?
I do applaud this kind of work. Better, faster tooling for data as files and moving across runtimes is sorely needed.
Two things I would hope to see before I'd start using Vorter is geospatial data support (there's already Geoparquet [1]) and WASM for in-browser Arrow processing. Things like Lonboard [2] and Observable framework [3] rely on Parquet, Arrow and Duckdb files for their powerful data analytics and visualisation.
“Vortex is a toolkit for working with compressed Apache Arrow arrays in-memory, on-disk, and over-the-wire.”
So it’s a toolkit written in Rust. It is not a file format.
Does this fragment columns into rowgroups like Parquet, or is it more of a pure columnstore? IME a data warehouse works much better if each column isn't split into thousands of fragments.
Very cool! Any plans to offer more direct integrations with DataFusion, e.g. a `VertexReaderFactory`, hooks for pushdowns, etc?
This is awesome, you folks are doing great work. I also really enjoyed your blog posts on FSST[1] and FastLanes[2].
[1] https://blog.spiraldb.com/life-in-the-fastlanes [2] https://blog.spiraldb.com/compressing-strings-with-fsst/
There are a bunch of these including fst in the R ecosystem. JDF.jl in the julia ecosystem etc.
Can one edit it in place?
That’s the main thing currently irritating me about parquet
Question: if vortex can "canonicalized" arrow vectors, why doesn't Arrow incorporate this feature?
how does this compare to lance?
> One of the unique attributes of the (in-progress) Vortex file format is that it encodes the physical layout of the data within the file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to the file format specification.
That is quite interesting. One challenge in general with parqet and arrow in the otel / observability ecosystem is that the shape of data is not quite known with spans. There are arbitrary attributes on them, and they can change. To the best of my knowledge no particularly great solution exists today for encoding this. I wonder to which degree this system could be "abused" for that.