I laugh everytime I have to explain that "Apache Arrow format is more efficient than JSON. Yes, the format is called 'Apache Arrow.'"
I like arrow for its type system. It's efficient, complete and does not have "infinite precision decimals". Considering Postgres's decimal encoding, using i256 as the backing type is so much saner approach.
We contributed the first JS impl and were helping with the nvidia gpu bits when it was starting. Some of our architectural decisions back then were awful as we were trying to figure out how make Graphistry work, but Arrow + GPU dataframes remain gifts that keep giving.
Its nice to see useful, impactful interchange formats getting the attention and resources they need, and ecosystems converging around them. Optimizing serialization/deserialization might seem like a "trivial" task at first, but when moving petabytes of data they quickly become the bottlenecks. With common interchange formats, the benefits of these optimizations are shared across stacks. Love to see it.
Intuitively appreciating that these "boring fundamentals" are the default bottlenecks is a aign of senior+ swe capability.
stupid question: why hasnt apache arrow taken over to the point where we are not longer dealing with json?
Because it's a binary format?
I think a big reason (aside from intertia) is that arrow is designed for tables. Json sends a lot more than just that and can support whatever octagonal junitsu squid shaped data you want to fit into it.
Also, a good proportion of web apis are sending pretty small data sizes. On mass there might be an improvement if everything was more efficiently represented, but evaluating on a case by case basis, the data size often isn't the bottleneck.
I read that entire page and I could not tell you what Apache Arrow is, or what it does.
All you had to do was click the logo to go to the homepage
The post celebrates Apache Arrow's 10 years anniversary, so it's assuming you already know what it is and what it does, which I think is fair. If you don't you can always refer to the docs.
We use Apache Arrow at my company and it's fantastic. The performance is so good. We have terabytes of time-series financial data and use arrow to store it and process it.
stumbled upon it recently while optimizing parquet writes. It worked flawlessly and 10-20x'd my throughput
We use Apache Arrow at my company too. It is part of a migration from an old in-house format. When it works it’s good. But there are just way too many bugs in Arrow. For example: a basic arrow computation on strings segfaults because the result does not fit in Arrow’s string type, only the large string type. Instead of casting it or asking the user to cast it, it just segfaults. Another example: a different basic operation causes an exception complaining about negative buffer sizes when using variable-length binary type.
Hey, Arrow developer here. If you get a segfault with our codebase, then please report an issue on our GitHub issue tracker.
(if you have already done so and it wasn't resolved, feel free to ping me on it)
This will obviously depend on which implementation you use. Using the rust arrow-rs crate you at least get panics when you overflow max buffer sizes. But one of my enduring annoyances with arrow is that they use signed integer types for buffer offsets and the like. I understand why it has to be that way since it's intended to be cross-language and not all languages have unsigned integer types. But it does lead to lots of very weird bugs when you are working in a native language and casting back and forth from signed to unsigned types. I spent a very frustrating day tracking down this one in particular https://github.com/apache/datafusion/issues/15967
I had to look up what Arrow actually does, and I might have to run some performance comparisons vs sqlite.
It's very neat for some types of data to have columns contiguous in memory.
yeah not necessarily compute (though it has a kernel)!
it's actually many things IPC protocol wire protocol, database connectivity spec etc etc.
in reality it's about an in-memory tabular (columnar) representation that enables zero copy operations b/w languages and engines.
and, imho, it all really comes down to standard data types for columns!
Take a look at parquet.
You can also store arrow on disk but it is mainly used as in-memory representation.
If I recall, Arrow is more or less a standardized representation in memory of columnar data. It tends to not be used directly I believe, but as the foundation for higher level libraries (like Polars, etc.). That said, I'm not an expert here so might not have full info.
You can absolutely use it directly, but it is painful. The USP of Arrow ist that you can pass bits of memory between Polars, Datafusion, DuckDB, etc. without copying. It's Parquet but for memory.
This is true, and as a result IME the problem space is much smaller than Parquet, but it can be really powerful. The reality is most of us don't work in environments where Arrow is needed.
>> some performance comparisons vs sqlite.
That's not really the purpose; it's really a language-independent format so that you don't need to change it for say, a dataframe or R. It's columnar because for analytics (where you do lots of aggregations and filtering) this is way more performant; the data is intentionally stored so the target columns are continuous. You probably already know, but the analytics equivalent of SQLite is DuckDB. Arrow can also eliminate the need to serialize/de-serialize data when sharing (ex: a high performance data pipeline) because different consumers / tools / operations can use the same memory representation as-is.
Thanks! This is all probably me using the familiar sqlite hammer where I really shouldn't.
> Arrow can also eliminate the need to serialize/de-serialize data when sharing (ex: a high performance data pipeline) because different consumers / tools / operations can use the same memory representation as-is.
Not sure if I misunderstood, what are the chances those different consumers / tools / operations are running in your memory space?
Not an expert, so I could be wrong, but my understanding is that you could just copy those bytes directly from the wire to your memory and treat those as the Arrow payload you're expecting it to be.
You still have to transfer the data, but you remove the need for a transformation before writing to the wire, and a transformation when reading from the wire.
Arrow supports zero-copy data sharing - checkout the Arrow IPC format and Arrow Flight.
If you are in control of two processes on a single machine instance, you could share the memory between a writer and a read-only consumer.
The key phrase though would seem to be “memory representationâ€m and not “same memoryâ€. You can spit the in-memory representation out to an Arrow file or an Arrow stream, take it in, and it’s in the same memory layout in the other program. That’s kind of the point of Arrow. It’s a standard memory layout available across applications and even across languages, which can be really convenient.
if I could tell myself in 2015 who had just found the feather library and was using it to power my unhinged topic modeling for power point slides work, and explained what feather would become (arrow) and the impact it would have on the date ecosystem. I would have looked at 2026 me like he was a crazy person.
Yet today I feel it was 2016 dataders who is the crazy one lol
Indeed. feather was a library to exchange data between R and pandas dataframes. People tend to bash pandas but its creator (Wes McKinney) has changed the data ecosystem for the better with the learnings coming from pandas.
I know pandas has a lot of technical warts and shortcomings, but I'm grateful for how much it empowered me early in my data/software career, and the API still feels more ergonomic to me due to the years of usage - plus GeoPandas layering on top of it.
Really, prefer DuckDB SQL these days for anything that needs to perform well, and feel like SQL is easier to grok than python code most of the time.
> Really, prefer DuckDB SQL these days for anything that needs to perform well, and feel like SQL is easier to grok than python code most of the time.
I switched to this as well and its mainly because explorations would need to be translated to SQL for production anyways. If I start with pandas I just need to do all the work twice.
chdb's new DataStore API looks really neat (drop in pandas replacement) and exactly how I envisioned a faster pandas could be without sacrificing its ergonomics
Do people bash pandas? If so, it reminds me of Bjarne's quip that the two types of programming languages are the ones people complain about and the ones nobody uses.
People also love to hate R but data.table is light years better than pandas in my view
polars people do - although I wouldn't call polars something that nobody uses.
I also use polars in new projects. I think Wes McKinney also uses it. If I remember correctly I saw him commenting on some polars memory related issues on GitHub. But a good chunk of polars' success can be attributed to Arrow which McKinney co-created. All the gripes people have with pandas, he had them too and built something powerful to overcome those.
I saw Wes speak in the early days of Pandas, in Berkeley. He solved problems that others just worked around for decades. His solutions are quirky but the work was very solid. His career advanced a lot IMHO for substantial reasons.. Wes personally marched through swamps and reached the other side.. others complain and do what they always have done.. I personally agree with the criticisms of the syntax, but Pandas is real and it was not easy to build it.
The creator of Pandas even bashes it: https://wesmckinney.com/blog/apache-arrow-pandas-internals/
He missed talking about the poor extensibility of pandas. It's missing some pretty obvious primitives to implement your own operators without whipping out slow for loops and appending to lists manually.
have these 'improvements' been backported to pandas now? i would expect it to close the gap over time.
Yes (mostly) is the answer. You can use arrow as a backend, and I think with v3 (recently released) it's the default.
The harder thing to overcome is that pandas has historically had a pretty "say yes to things" culture. That's probably a huge part of its success, but it means there are now about 5 ways to add a column to a dataframe.
Adding support for arrow is a really big achievement, but shrinking an oversized api is even more ambitious.
What's the difference between feather and parquet in terms of usage? I get the design philosophy, but how would you use them differently?
parquet is optimized for storage and compresses well (=> smaller files)
feather is optimized for fast reading
And now there's Lance! https://lance.org/
Given the cost of storage is getting cheaper, wouldn't most firms want to use feather for analytic performance? But everyone uses parquet.
You can, still, gain a lot of performance by doing less I/O.
There's definitely a "everyone uses it because everyone uses it" effect.
Feather might be a better fit for sime yse cases, but parquet has fantastic support and is still a pretty good choice for things that feather does.
Unless they're really focussed on eaking out every bit of read performance, people often opt for the well supported path instead.
Storage getting cheaper did not really reach the cloud providers and for self-hosting it has recently gotten even more expensive due to AI bs.
What people have done in the face of cheaper storage is store more data.
Storage is cheap but bandwidth no.
Feather (Arrow IPC) is zero copy and an order of magnitude simpler. Parquet has a lot of compatibility issues between readers and writers.
Arrow is also directly usable as the application memory model. It’s pretty common to read Parquet into Arrow for transport.
When you say compatibility issues, you mean they are more problematic or less?
It’s pretty common to read Parquet into Arrow for transport.
I'm confused by this. Are you referring to Arrow Flight RPC? Or are you saying distributed analytic engine use arrow to transport parquet between queries?
Not the OP, but Parquet compatibility issues are usually due to the varying support of features across implementations. You have to take that into account when writing Parquet data (unless you go with the defaults which can be conservative and suboptimal).
Recently we have started documenting this to better inform choices: https://parquet.apache.org/docs/file-format/implementationst...
I read that. But afaik, feather format is stable now. Hence my confusion. I use parquet at work a lot, where we store a lot of time series financial data. We like it. Creating the Parquet data is a pain since it's not append-able.
Have you considered something like iceberg tables?
Yes, but parquet hates small files.
You can't compact? i.e. iceberg maintenance
We might be doing something wrong, but we saw significant performance degradation for both ingestion and query when doing compaction when it comes to finance data during trading hours.
Generally Parquet files are combined in an LSM style, compacting smaller files into larger ones. Parquet isn't really meant for the "journal" of level-0 append-one-record style storage, it's meant for the levels that follow.
So feather for journaling and parquet for long term processing?
I still don't understand what happened to using Apache Avro [1] for row-oriented fast write use cases.
I think by now a lot of people know you can write to Avro and compact to Parquet, and that is a key area of development. I'm not sure of a great solution yet.
Apache Iceberg tables can sit on top of Avro files as one of the storage engines/formats, in addition to Parquet or even the old ORC format.
Apache Hudi[2] was looking into HTAP capabilities - writing in row store, and compacting or merge on read into column store in the background so you can get the best of both worlds. I don't know where they've ended up.
You basically can't do row by row appends to any columnar format stored in a single file. You could kludge around it by allocating arenas inside the file but that's still a huge write amplification, instead of writing a row in a single block you'd have to write a block per column.
Agreed.
There is room still for an open source HTAP storage format to be designed and built. :-)
You can do row by row appends to a Feather (Arrow IPC — the naming is confusing). It works fine. The main problem is that the per-append overhead is kind of silly — it costs over 300 bytes (IIRC) per append.
I wish there was an industry standard format, schema-compatible with Parquet, that was actually optimized for this use case.
Creating a new record batch for a single row is also a huge kludge leading to lot of write amplification. At that point, you're better off storing rows than pretending it's columnar.
I actually wrote a row storage format reusing Arrow data types (not Feather), just laying them out row-wise not columnar. Validity bits of the different columns collected into a shared per-row bitmap, fixed offsets within a record allow extracting any field in a zerocopy fashion. I store those in RocksDB, for now.
https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...
https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...
https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...
> Creating a new record batch for a single row is also a huge kludge leading to lot of write amplification.
Sure, except insofar as I didn’t want to pretend to be columnar. There just doesn’t seem to be something out there that met my (experimental) needs better. I wanted to stream out rows, event sourcing style, and snarf them up in batches in a separate process into Parquet. Using Feather like it’s a row store can do this.
> kantodb
Neat project. I would seriously consider using that in a project of mine, especially now that LLMs can help out with the exceedingly tedious parts. (The current stack is regrettable, but a prompt like “keep exactly the same queries but change the API from X to Y†is well within current capabilities.)
Frankly, RocksDB, SQLite or Postgres would be easy choices for that. (Fast) durable writes are actually a nasty problem with lots of little detail to get just right, or you end up with corrupted data on restart. For example, blocks may be written out of order so on a crash you may end up storing <old_data>12_4, and if you trust all content seen in the file, or even a footer in 4, you're screwed.
Speaking as a Rustafarian, there's some libraries out there that "just" implement a WAL, which is all you need, but they're nowhere near as battle-tested as the above.
Also, if KantoDB is not compatible with Postgres in something that isn't utterly stupid, it's automatically considered a bug or a missing feature (but I have plenty of those!). I refuse to do bug-for-bug compatible and there's some stuff that are just better not implement in this millennia, but the intent is to make it be I Can't Believe It's Not Postgres, and to run integration tests against actual everyday software.
Also, definitely don't use KantoDB for anything real yet. It's very early days.
> Frankly, RocksDB, SQLite or Postgres would be easy choices for that. (Fast) durable writes are actually a nasty problem with lots of little detail to get just right, or you end up with corrupted data on restart. For example, blocks may be written out of order so on a crash you may end up storing <old_data>12_4, and if you trust all content seen in the file, or even a footer in 4, you're screwed.
I have a WAL that works nicely. It surely has some issues on a crash if blocks are written out of order, but this doesn’t matter for my use case.
But none of those other choices actually do what I wanted without quite a bit of pain. First, unless I wire up some kind of CDC system or add extra schema complexity, I can stream in but I can’t stream out. But a byte or record stream streams natively. Second, I kind of like the Parquet schema system, and I wanted something compatible. (This was all an experiment. The production version is just a plain database. Insert is INSERT and queries go straight to the database. Performance and disk space management are not amazing, but it works.)
P.S. The KantoDB website says “I’ve wanted to … have meaningful tests that don’t have multi-gigabyte dependencies and runtime assumptions“. I have a very nice system using a ~100 line Python script that fires up a MySQL database using the distro mysqld, backed by a Unix socket, requiring zero setup or other complication. It’s mildly offensive that it takes mysqld multiple seconds to do this, but it works. I can run a whole bunch of copies in parallel, in the same Python process even, for a nice, parallelized reproducible testing environment. Every now and then I get in a small fight with AppArmor, but I invariably win the fight quickly without requiring any changes that need any privileges. This all predates Docker, too :). I’m sure I could rig up some snapshot system to get startup time down, but that would defeat some of the simplicity of the scheme.
And I have a system that launches Postgres in a container as part of a unit test (a little wrapper around https://crates.io/crates/pgtemp). It's much better than nothing, but the test using Postgres takes 0.5 seconds when the same business logic run against an in-memory implementation takes 0.005s.