There is a basic skill that every analyst or engineer has to master. If you cover this one you will make your colleagues lives much easier and remove a good piece of boilerplate and preparation when getting access to data. This skill is SQL.
As described in this post (Accessing S3 Data through SQL with presto) we have a particular setup inside Schibsted. We have multiple company and operations that cannot always share data, and terabytes of data are already stored on AWS S3.
This drove some of the decisions about technology choices we are listing here. We also defined the query engine as one piece of the puzzle that integrates our SQL data query service. So the final solution had to fit properly inside this puzzle or let us blend the connection points to make it fit. Take it into account when evaluating your own solution: There is always a BUT!
Looking for candidates
We already had some strong candidates in mind before starting the project. Old players like Presto, Hive or Impala have in this times good competitors like Athena, Google BigQuery or Redshift Spectrum. But we also did some research and gathered feedback from colleagues and come with this list:
- Redshift Spectrum
- Google BigQuery
- Apache Kylin
- Apache Ignite
- Stratio Data Centric
We quickly discarded everything below Snowflake for disparate reasons: They either didn’t really belong to the query engine scenario or they served different approaches. And, to be honest, we needed to cut the list somewhere and start implementing the actual solution.
We had been managing Redshift for a while, so it sounded natural to try to get the best from both worlds. It is a traditional columnar database working at scale inside AWS and with all the benefits of being an AWS product when all your stack is running there.
BUT! It doesn’t work properly with JSON files and doesn’t work either with nested schemas in parquet. And we need to manage the infrastructure part from redshift and recreate our authentication method. This extra cost and having no big competitive advantage compared to Athena made us save it as an alternative in case the rest of solutions didn’t work. But not our first choice.
Athena is in concept what we need. SQL query engine on top of S3 data. It provides JDBC drivers to connect there from wherever you need: DBeaver, Tableau, … You can start creating tables and query them right away, practically no setup and zero infrastructure boilerplate as it is serverless. And we can reuse our already existing access granting system inside AWS.
BUT! It is running some old presto version and doesn’t let you adapt it to your specific needs. And we have some particularities:
- We use nested schemas on our data.
- This schemas change slightly from one provider to another and through time
- All our historical data is stored in this way
Athena doesn’t tolerate schema evolution, if one hour’s partition has 2 nested fields inside the object column, and the next one doesn’t have those very same fields, you won’t be able to use that data. Is that a big problem? Wel, that depends. When you have up to 600 column/fields that randomly appear and disappear, and combined with the fact that you need to define ALL nested fields inside a column if you want to use it, then it’s a big problem. Another frequently used thing was missing. You cannot easily create temporary tables as you would do in traditional RDBMS-s.
"name": "Iker is Cool"
"description": "Keeping schemas in synch is hard"
Schema Evolution is a hard problem 🙁
Anyway, for a fast ramp-up we choose Athena and today, we are still using it. Response time is great, and especially, time to data is great (Time since I find the need to query a dataset and to actually getting data from it)
After Athena, we started looking for other solutions that allowed us more flexibility. We found Presto a very interesting piece of technology. It’s built in EMR, so creating a cluster with it preinstalled is really easy. It has a wide community and big corporation adoption (Facebook, Uber, Netflix), and its the core query engine behind Athena.
We already had the experience from our colleagues in OLX Brasil working with it, so we started a parallel long-term track to build over presto all the missing features and put it up to the standards of Athena. Presto also gives us a competitive advantage, we could now join our DataSets with the ones some of our colleagues have on their own. We could be the HUB of all the company data warehouse and data lakes, and make them convergence in our presto cluster.
BUT! we need to do some pre-work for that. We have to implement user-based Auth (Authorisation & Authentication). We also need to work on having a strong infrastructure setup, we are not serverless any more, and this means we have some work ahead finding the specific tuning for memory, CPU, nodes, etcetera.
Hive was very promising. It is where all started, first SQL tables on top of HDFS back then and we were very excited to test it.
BUT! It gives basically the same features as presto, but it was 10x slower in our benchmarks. So we abandoned it very quickly.
We had had good experiences with it some time ago (years ago) in a different context and tried it for that reason. It gives similar features to Hive and Presto and it will be fair to compare their performance.
BUT! AWS doesn’t support it on the newest EMR versions and that made us suspicious. On the other hand our colleagues in Brasil, Facebook, Uber, Netflix, Athena… they all use Presto. This is very important for us as it demonstrates the strong community and long-term support Presto might have compared to Impala.
Some of our colleagues were very disappointed when we didn’t even benchmark BigQuery. The reason is very obvious: In times of GDPR we cannot really keep moving data around.. We need to protect our users’ privacy, therefore we need to minimise the cost (risk, time, work and $$$) of moving data around. To run BigQuey you need to store your data in GoogleCloud, and, as said, we use AWS.
BUT! come the time where you can query data from AWS S3 with BigQuery without the need to copy it across accounts… who knows what we would do then.
So… Which one is fastest?
Among the ones benchmarked and our specific non-nested parquet datasets, Athena is fastest. When reading a lot of files it behaves faster than Spectrum or Presto. But when reading few files Presto is faster. Looks like Athena has some warmup time to manage access and getting resources.
Obviously, this is a totally unfair comparison, Athena has the whole power of AWS behind the scenes, while Presto had just a 10 xlarge machine cluster.
My point is, you need to choose the tool which has a good balance between features, performance, cost and lifetime. Because of the flexibility and extensibility it provides, the community adoption, the reasonable performance, and the future options it opens in our roadmap we have chosen Presto as our long-time bet. BUT! once more, this is a piece of the puzzle, so if the data we have changes, or if the puzzle grows, we are not afraid to change again our query engine and adopt the next big player to come.
Have we made the right design and architecture choices? Let’s continue the discussion in the comments!