I recently had to create a reproducible version of incredibly complicated and messy R concoctions our data scientists came up with.
I did it with pandas without much experience with it and a lot of AI help (essentially to fill in the blanks the data scientists had left, because they only had to do the calculation once).
I then created a polars version which uses lazyframes. It ended up being about 20x faster than the first version. I did try to do some optimizations by hand to make the execution planner work even better which I believe paid off.
If you have to do a large non interactive analytical calculation (i.e. not in a notebook) polars seems to be way ahead imo!
I do wish that it was just as easy to use as a rust library though.. the focus however seems to be on being competitive in python land mainly.
OG polars announcement: https://news.ycombinator.com/item?id=23768227
Love it!
Still don't get why one of the biggest player in the space, Databricks is overinvesting in Spark. For startups, Polars or DuckDB are completely sufficient. Other companies like Palantir already support bring your own compute.
Been a polars fan for a loooong time. Happy to see the team ship their product and I hope it does well!
Polars is certainly better than pandas doing things locally. But that is a low bar. I’ve not had great experience using Polars on large enough datasets. I almost always end up using duckdb. If I am using SQL at the end of the day, why bother starting with Polars? With AI these days, it’s ridiculously fast to put together performant SQLs. Heck you can even make your own grammar and be done with it.
I don't understand. Can I use distributed Polars with my own machines or do I have to buy cloud compute to run distributed queries (I don't want that). If not, is this planned?
Polars is great, absolute best of luck with the launch
Hmm so how does the polars SQLContext stack up against duckdb? And can both cope with a distributed polars?
It feels like we are on the path to reinventing BigQuery.
Out of curiosity and because I don't want to create a test account right now:
How does billing with "Deploy on AWS" work? Do I need to bring my own AWS account and Polars is payed for the image through AWS or am I billed by Polars and they pass a share to AWS. In other words do I have a contract primarily with AWS or Polars?
Cool. But abstract away the infra knowledge to the actual instance types. Instead I’d expect the polars cloud abstraction to find me the most cost effective (spot instance) that meets my cpu and memory reqs and disk reqs. Why do I have to give it — looking at the example — the AWS instance type?
Is there any distributed polars for non Polars Cloud?
EDIT: nevermind see same question in this thread. The answer is no!
How does Polars compare to FireDucks?
Maybe just me, but for anyone else who was confused
- Polars (Pola.rs) - the DataFrames library that now has a cloud version
- Polar (Polar.sh) - Payments and MoR service built on top of Stripe
SnowFlake, Polars, DucksDB, FireBase, FireDuck... I guess the next product will be IceDuck.
What is wrong with you DB people :))).
can i run a distributed computation in pola.rs cloud on my own AWS infra? or do I need to run it on-prem?
So competing with SnowFlake?
can you dive a bit deeper into the comparison with spark rdd
[dead]
[dead]
Having done a bit of data engineering in my day, I'm growing more and more allergic to the DataFrame API (which I used 24/7 for years). From what I've seen over the past ~10 years, 90+% of use cases would be better served by SQL, both from the development perspective as well as debugging, onboarding, sharing, migrating etc.
Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.