Monday, 10 November 2014

Why shouldn't you use SELECT * FROM TABLE

Software development, like database development is a strange discipline. They usually start teaching you with simple things like SELECT * and then spend the next 10 years telling you not to use it. Likewise in software engineering, you get taught quick and dirty coding and then told to code properly!

I'm sure this is to try and get people productive more quickly but it doesn't serve the industry well. A lot of people don't even know why they shouldn't use SELECT *.

Why you WOULD use it: The only reason you would use SELECT * FROM TABLE is because it is quick (to write) and easy. (You might use it if your database table is dynamically modified but this is not common)

There are two main reasons why you SHOULD NOT use SELECT * FROM TABLE.

ONE reason, which might or might not be significant is that pulling the data from all the columns of a table, if you are not using them, wastes network and disk bandwidth. This doesn't just mean the 'phone line', which might easily support the bandwidth but on a larger system, there might be several (or more) web servers all using a single database server via a single network card and one or more network switches. This card is probably also transmitting lots of other network data so hitting it with unnecessary data is not good on principle and does not scale.

The SECOND, more important reason is related to the way that databases index data to improve performance. Imagine you have a table with 50 columns and 10 million rows and you use SELECT * FROM...WHERE even though you only need 5 of those columns in your result set. The entire table needs to be loaded into memory, possibly in blocks, a load of row or table locks might be required while this is happening and only after the entire table has been scanned, can the result set be returned. Clearly, some tables are several GB in size - this is not a good option. By adding an index to the table, that only contains the 5 columns you are using (including the column(s) in the WHERE clause), only the index needs to be loaded into memory, which is probably considerably smaller and therefore faster, and depending on the index, if it is ordered by the column(s) in the WHERE clause, the database might be able to access the index directly rather than scanning the whole thing.

The differences here can literally be hundreds or thousands of times quicker than simply table scanning but this optimisation cannot happen if you use SELECT *.

A fair question is, "what if my table is really small, why bother?". Firstly, it is not a lot of extra work to only select columns you need and secondly, it is a good habit to get into. There is no obvious line between when it will improve performance and when not but selecting specific columns will always be the same or quicker than select *.

"What if I am selecting the same number of columns as my table has, I might as well use SELECT * right?". The answer is the same as above but there are other advantages to using named columns, even when you are selecting all of them. Firstly, if you use SELECT * and the table has other columns added, your statement will slow down. Unless you test every statement after the modification (I'll assume you don't) then you don't know what you might make unusable after modifying the table. Another advantage is much clearly visibility of what you are selecting and the ability to rename columns you are selecting to make their names more meaningful to the code that has called the SELECT statement. If you are joining two tables that both have a rowid, SELECT * will fail anyway but what if one is called rowid and the other row_id? This is hidden inside SELECT * but could lead to coding errors if the wrong one is accidentally used.

Another question is, "Why not use indexes for everything then?". There is a trade off if the number of indexes starts getting high. What needs to happen is that when you insert or update, some or all of the indexes will need to be updated as well as the table itself. This performance hit will obviously vary depending on the size of the indexes, how much they are fragmented and how many of them there are as well as which columns are indexed and where the new data is added/updated. Generally speaking, data that is massively read-only and only occasionally modified suits indexing very well. The more updates/inserts you do, the more impact will be noted. You might easily have 10 or more indexes on a table with no real performance problems.

To be honest, if you get to the point where your indexes are starting to slow your system down, you are probably in the area where you need to employ some experienced DBAs to look at your specific requirements and suggest either a design change or potentially even some advanced techniques like sharding and replication.

You don't need an index per-select statement. If you have two selects one that selects columns A, B and C and another that selects A, B, C and D then you could simply have a single index on all 4 columns, that both select statements would automatically use.

Another gotcha is two select statements. One that selects A, B, C and D and another that selects C, D, E and F. One index or two? In this situation, it would depend on the size of the unwanted columns in the index as to whether these could simply become a single index. If E and F are large VARCHAR or TEXT columns, then there is not necessarily much performance benefit to the first SELECT by sharing an index. If, however, all 6 columns are fairly small types, then a single index might be acceptable.

One thing I've learned about databases is that they don't always behave as expected so test it! There are various tools that can create dummy data for a database table to give you, say, 10 million rows, which you can then experiment with. The closer to the real usage you can get, the better, but any reasonable testing is better than nothing.

One example of an optimisation I managed on a database was decreasing the time to run a "Lookup address" stored procedure from 25-30 seconds down to less than 1/4 of a second! (clearly it should never have taken 30 seconds, if it was written properly but hey!)
Post a Comment