The shift from MySQL to Snowflake

Madhu Ramiah
5 min readApr 13, 2020

--

I have always been using mysql from the time I started my under-graduate studies. At those times I was always running mysql queries with smaller datasets. I never actually looked into how much time it took to execute a query. All I was mainly focussing on was, is my query result correct. Once I graduated and started a job, I actually dealt with real big datasets. Only then I realized, just like optimizing algorithms in data structures we could actually optimize queries in Mysql. I learnt better ways to write queries using joins and tried to improve performance using indexing. But slowly when the size of the data kept growing, even the most optimized query(dealing with multiple tables) in mysql took a long time to run. So, companies started moving towards cloud based services. One such cloud based service is Snowflake.

Snowflake is a cloud based data warehouse. It offers storing and analyzing data using cloud based hardware and software. Snowflake is running on Amazon S3 since 2014 and Microsoft Azure since 2018. It is also a platform that allows users to share data securely. Some of the main benefits of Snowflake are

  1. Multi cluster shared architecture across the cloud- We can easily scale up and down the clusters based on the usage and performance. It provides us the necessary concurrency and scale that any organization requires.
  2. Near to Zero Maintenance- We can chose any type of infrastructure, Snowflake would maintain all the workloads across all regions efficiently for us. We don’t have to do any maintenance at all.
  3. Secure Data Sharing- We can securely share any amount of data among users in the same organization or customers without having to port data to new database systems.

I have been using Snowflake for 2 years now. It has been an extremely useful tool. It allows us to provide different permissions for different users. The UI is super friendly and allows us to write queries in worksheets. Some of the cool features I like about this tool are

  1. It leverages SQL query language
  2. It allows us to preview data from any table without actually running a query.
  3. We can easily switch roles through the UI (very useful since you don’t have to write queries)
  4. We can spin up different size of clusters if needed (They have XS, S, M, L , XL, XXL, XXXL, XXXXL depending on the organization requirements)
  5. We can easily see existing tables, views, creation dates, queries used to create tables, table description, etc in the UI
  6. We can connect to Snowflake through other visualization tools like Looker, Tableau or Mode. It’s a tool that all data analysts, data engineers and data scientists can use.
  7. Supports a lot of data formats- including JSON, Avro, XML, Parquet and ORC. It supports both structured and unstructured data formats

The way in which snowflake handles data is quite different. It uses micro-partitioning and clustering for most of the database operations.

Micro-Partitioning:

All the tables are automatically divided into micro-partitions. Each micro-partition is in between 50 and 500 MB in size and it is uncompressed. Groups of rows are combined into micro-partitions and are stored in a columnar format. By using this method, we can convert a very huge database into several smaller micro-partitions. Below are the main benefits,

  1. They store columnar data separately (columnar storage), and only the necessary columns in a query are scanned
  2. Their size is very small, and enables faster pruning for several queries
  3. They are derived automatically and these need not be explicitly defined

Data Clustering:

Snowflake automatically does clustering based on natural dimensions like dates, regions, names, etc. This clustering is needed for faster performance in queries, especially on larger datasets. Whenever, data is added/updated in Snowflake clustering is done on the metadata of the micro-partitions of data. Snowflake uses this process during query execution in such a way that it doesn’t have to scan the micro-partitions unnecessarily. This is the main reason snowflake queries are faster.

You can find more information here

In terms of queries, there are some places where MySQL performs better and some where snowflake performs better. Am going to discuss each of those scenarios in detail.

SELECTIONS:

Let’s consider a table whose size is about a few hundred gigabytes. Let’s write a query to select all rows that meet a specific condition like below

select * from table1 
where id=1234
and date>='2020-03-01'
and date<='2020-03-06'

Since snowflake processed queries in the cloud, I thought that the above query would take just a few secs in snowflake. But on the contrary, it took more than 5 minutes to process. And, mysql actually processed the query in few seconds.

The main reason for this is because mysql supports indexing. So, if the table was indexed on id and date, then this query on mysql would be very fast. But, snowflake doesn’t support indexing. It uses micro-partitioning and data clustering. So, for very simple select statements on large datasets mysql with indexing may perform better than snowflake.

JOINS

When you are performing joins on 2+ tables like below the computation time on Snowflake is significantly lower than on Mysql. The main reason for this is that snowflake always breaks data into micro-partitions which are of size 50 to 500 MB and then perform the join.

select x.id, x.date, x.amount, y.date, z.user_id
from table1 as x
left join table2 as y
on x.id=y.id
left join table3 as z
on y.id1=z.id

OTHER FUNCTIONS:

For all other functions like aggregate functions and window functions, snowflake performs faster than mysql.

For scalar functions like logarithmic, trigonometric and rounding functions, there is not a significant difference in the performance between snowflake and mysql.

Cloud based data warehouses would perform better in almost every scenario if the database sizes were very large except for a few cases. It would vary depending on the scenario you are using it.

Hope this blog gave you some insights onto why data warehouses are gaining a lot of importance nowadays and how they would be beneficial for your organization too.

Contact me via LinkedIn if you have any questions. Thanks for reading through and give a thumbs up if you liked the blog!

--

--