Thursday, 30 December 2010

Your approach to query optimising

Optimising a query in SQL Server is a bit like trying to pimp your car. You might think logically that putting in an engine with more horsepower will increase the performance of your car when in reality there are other factors that might limit this or at least that might break! In the same way, using the execution plan, you can concentrate on changing a table scan to an index seek and feel victorious only to discover it has made little difference on the bottom line. This can be for a number of reasons:
  1. If you have many joins in the execution plan, then one of these might only be 0.5% and decreasing this to 0.25% although a lot for the one object adds up to little in the overall scheme of it. Consider simplifying the query or creating a smaller query to use in a specific place.
  2. If the number of rows or columns that you are querying is small, then there might not be a great deal of difference in performance between various indexes or the table itself.
  3. If you join on text/varchar/etc columns, you can incur a large overhead which includes not just the memory/IO itself required for more bytes of comparison but also the fact that you cannot include large text columns in 'index columns' if it would push the total size of key to more than 900 bytes. If you simply add it to the 'included columns' which is permitted, the index will not be used when joining on the text column. Try and use integer ids wherever possible.
  4. You might see "Index Seek" and think this is the best you can get but actually if this Index Seek is carried out on a large index, it might be much slower than it needs to be. You can try this by creating the bear minimum index for what you need and running the query again.
  5. Use the cost % information to decide where your bottlenecks are. If you have hundreds of joins all of less than 1%, then you are probably only able to achieve small improvements, whereas if one or more costs is significant (>10%), especially if it is under a table or clustered index scan, you can probably achieve large time savings by optimising these.
  6. As a last resort (or a first resort if you can!), there are two things you can do. Design things differently so you can simplify the query (reduce the number of joins, rows or columns) or otherwise run the query from a job and dump the results into a cache table which can be used in certain areas that can avoid running the query and can live with data that is a number of minutes or hours out of date.

Wednesday, 29 December 2010

Which SQL indexes are faster?

Even wondered what sort of index lookups are the best for your queries? What is the preferred speed order. Well, of course, it depends. In some cases there is little difference and it usually depends on whether the query optimiser thinks it will only need a subset of data from an index and also how much data is in the index compared to the main table. Bear in mind that just because you might get an Index Seek/Scan, it doesn't mean that's the fastest you can go. Using a large index for an index seek might be 10s of times slower than creating a smaller index with the required columns. Anyway, in order fastest to slowest, this is the best I can come up with:
Index Seek - A subset of an index. It's speed relates to how few columns you can include to get what you need and you will need some sort of 'where' clause for the optimiser to reduce rows as it joins. Make sure you include your WHERE clause columns and any JOIN columns in the 'index columns' you can then include other columns you want to select in the 'additional columns' area.
Index Scan - this scans the whole index and might be fast if it contains few columns and/or rows otherwise it is little better than a clustered index scan or table scan. If you have indexes that contain most columns from the underlying table, you are probably better off removing the index or redesigning your queries to not require so many columns. Also, ensure that you include where clauses as early as possible in your queries (i.e. not all at the outer level) which might promote an Index Scan to an Index Seek.
Key/RID Lookup - this might be very slow since it requires an index seek/scan followed by an index/clustered index/table lookup in which case it is not much better than a clustered index scan. It might otherwise be somewhere closer to the speed of an Index Scan if it comes from an Index Seek for a small subset of data. Generally best avoided by modifying/creating other indexes.
Clustered Index Scan - this is usually not much faster than a table scan, if at all, since the only difference is the clustered index is ordered by something which may or may not be useful for a particular query to use. The clustered index itself is however very helpful to speed up joins in general so don't delete the clustered index just because it isn't much better than a table scan. In the case of a small table with few columns/rows this index is probably as fast as it will get since more specific indexes will not be significantly smaller than the underlying table/clustered index. In this case, do not seek to replace Clustered Index Scans with Index Scans.
Table Scan - the base table is scanned, every row. The slowest and best avoided since it not only affects performance by itself but can also slow joins to other tables/views. It increases the amount of IO required even if you don't need all of the columns returned. This is because the data is paged into memory as it is being queried. A smaller set of data in an index takes up less memory and therefore less pages = faster.
I have also seen Clustered Index Seek and Table Seek. These are better than their Scan cousins but still best avoided if possible using an index. The only time clustered or table seeks might be the fastest you would get, like clustered and table scans, is if the underlying table is small or basic.

Making your queries faster

If you run SQL Server management studio, you can include the execution plan with your query (Ctrl-M), it displays a network of how the tables and views join and more importantly, what types of scans are used to get the data. It also shows lines that are proportional to the number of rows being operated on. If you can see any big lines, then possibly your join condition is too generous (and you are using the WHERE clause incorrectly to fix the join) or maybe you do actually want to select tons of data in which case big lines might be acceptable.
As far as table and index scans are concerned, there are really two levels of understanding, one of which will help massively and the other is perhaps when you want to squeeze the last drops of performance out of the query once you have made the major changes.
Let's first consider the easier deductions from the execution plan. If you see any "table scan" elements on your diagram, that is generally bad and unless you have a good reason, you need to create an index on the table that can be used (even if your table is small - it can decrease join performance noticeably). What a table scan means is that the query has to scan all columns in the whole table to find and match rows. Virtually every table should contain a primary key column(s) and this creates an implicit clustered index so at very least you should be scanning the clustered index. If you do not have a primary key (suspicious!), you can still create indexes and it is these that will remove table scans as well as removing clustered index scans which are slightly better than table scans but include all columns still and can be slow also. By creating an index you should be able to include a few key columns which will dictate the sort order (which can help with join performance) but also (SQL Server 2005) you can include additional columns which the query can use but which will not unduly slow the creation and update of the index since they do not dictate ordering. If you have an index on TABLE with, e.g. columns A and B and your query says something like select A from TABLE where B = 2 will use the index directly. You will get an index seek ideally (a subset of rows from the full index, less than about 15%), which is very fast, or perhaps an index scan which has to read all rows but is preferable to table scans since the amount of data being used is a subset of the table data.
The second level of performance comes when your indexes are not quite right and you get strange items like "Bookmark Lookup" or "RID Lookup" which are both similar and caused by the same basic problem. Imagine in your example above you did select A, C from TABLE where B = 2. Although the index can be used to find the matching rows which is fast, in order to return column C, the query needs to find the primary key in the index (which is included by default) and then using this it needs to scan the clustered index or the base table if there is no clustered index in order to find the data for column C. This is a bookmark lookup (or an RID lookup if the base table has no clustered index - it is a 'heap'). These are also slow but possibly faster than a clustered index scan or a table scan. You might find sometimes that the query optimiser decides to not bother using the index and assumes you might as well go straight to the base table. To fix this issue, you would need to include column C into your index or create a new index with A, B and C in it.
Bear in mind that there is always a trade-off between the number of indexes and the insert/update/delete performance of the table as well as the amount of memory required depending on the size of the index which is related to the size of the underlying table. Generally, it is better to add a column into an existing index rather than create a whole new index which is slightly different. If one index has the same columns as another (even if the other index has a few more) then it is redundant e.g. Index1 = A, B, C; Index2 = A, B means that Index2 is redundant since all of the information is already available in Index1.
Also, the actual performance gain depends on the indexed columns relative to the base table columns. If your index had 5 of 6 available columns from the table then there will be little difference between an index scan and a table scan. Also, if you are selecting a text field from a table but using a WHERE clause which is NOT based on the text field, you might find it quicker to split the index into one without the text field (which reduces the size of the index and speeds it up) and then create a second index with just the text field in which the query should be able to use to get the resulting text field for each matching row.

Friday, 17 December 2010

Why Knowledge and Design is important for Software

My company bought a new company in the last year and we have created a new web application to provide some functionality based on an existing system but created from scratch so we could use as many best-pratices as we could from the start.
Currently the new company only uses a small amount of the existing functionality but plans are afoot to increase this. One of the things I have been working on is in the quotation part of the system where the business selects the equipment for a quotation, adds certain services and then produces tender documentation. Our current system has around 22,000 quotes (some of which are test quotes) and one of the luxuries this provides is a set of genuine test data to try out any designs we create for the new company to see whether it scales well. Let's be honest, how many times have we implemented something by putting in a few rows of test data and then deployed it into a system that might end up with 10s or 100s of thousands of rows?
Anyway, there have been two things that have been obvious with our major re-design of the table layout and which have produced an increase in performance of around 5 times for the same data set running on a virtual test server with much less memory and CPU than the live server - pretty impressive. The first thing is that you need knowledge to write good software! This might seem obvious but someone might understand databases at a basic level but without understanding things like indexes, foreign keys and referential integrity, at best you will write something average, at worst it will simply not perform or scale and you won't understand why. Foreign and primary keys are not just for integrity but allow certain queries to perform much faster because of assumptions the system can make about the number of rows that will join in a query. The second thing is that we spent time in design. To be honest it was a Visio diagram that was sent round and which we all looked at. We considered the various objects in the current system and how they would work in the new system, which led to additional columns and the like but a large reduction in the number of tables required and therefore a reduction in joins and a performance increase in queries. Why? Well although you can partially think and design while you implement, in reality, you will never make large changes and see the big picture while adding tables into a database. If we had dived straight in, we would probably have something much closer to the old system whereas the new design is radically different and more efficient in the process. The few days of time we spent checking out the design has saved some time in implementation, a lot of time in potential rework and loads of performance. It seems counter-intuitive to be designing things and not just getting on and doing it but to be honest nowadays, implementation only needs to take a few days once the design has been completed.
Knowledge and Design - don't take shortcuts.

Wednesday, 15 December 2010

Poor passwords

A news story today warned of thousands of internet users being asked to change passwords after a hack exposed login details. What was worrying was how many people used passwords such as 123456 or "password" which are clearly not secure in the slightest. What was interesting was that it took a hack to make various providers ask their users to change their passwords rather than simply not allowing them in the first place.
Quite clearly a brute force attempt at someones password will start with around 20 common words, phrases or number combinations, it would appear that thousands of people would fall foul of that.
What do we do? Firstly, if we allow people to choose their own passwords, we simply either blacklist certain words like "password" and/or we use a minimum password strength such as one letter, one number, one capital etc (although this can annoy people if taken to an extreme).
The other thing we should do is run a tool against our current users' passwords and work out which ones are not secure and then reset those accounts, sending an email or a page messages saying that your password is too insecure.
The other thing that seems sadly lacking is simple intrusion detection so that, e.g. 3 attempts at a password locks out the account for a period or permanently until reset. Even if you use a simple password, if the tool only gets 3 chances to crack it, it is unlikely to succeed.
Come on people, let's be proactive and not be tomorrow's headline.

Tuesday, 14 December 2010

Don't just secure your apps, manage their security

Assuming you are one of the few people who genuinely secure their web apps using all the best known security practices like validation, authentication, authorisation etc, there is a good chance that you have ignored the common but important factor called Time.
Securing an app is fine but over time 2 things happen. Things change and things are forgotten. Why is this important in a web app particularly? Well if you used SHA-0 to hash your passwords a few years back because you knew it was the latest and greatest hashing algorithm, what are you going to do now it has been found to be less secure? Sure, you might have heard about the weaknesses and you might know that your system needs updating, but how does your organisation know that this issue will be picked up and addressed in the future?
Most organisations rely on people just knowing these things. They rely on the experience of developers or architects who have been around a while but this is simply not good enough. It is not good enough to be secure now when you consider how things can change.
In a similar way, people forget things like the fact that I used login X to access database Y for reason Z so that when someone else comes across it, they might re-use the login for something else, elevate its priviledges for some reason and weaken the overall security. How should they know to leave it alone?
A management system is required, any management system to begin with, but something which is adapted over time to be useful and useable. Make it unuseable and guess what? people won't use it. This needs to record information about the system including but not limited to security information. You might list the logins, their purpose and what access they should or shouldn't be allowed to have. It could list securables and secrets on the file system and what security they have been setup with. It should be expressly prohibited for ANYONE to modify a security setting without a review and without recording it in the management system.
It honestly could be setup on a wiki or anything as simple as that which allows regular review and work arising to be allocated. You can even flag things which we might know to be issues for the future such as encryption systems and hashing mechanisms, all of which become weaker over time.
If you don't manage security in your apps, your app is at best good for now but could break down rapidly as new exploits are discovered and no mechanisms exist to trace these to your own systems.

Tuesday, 7 December 2010

KDE4 Network Manager fails after upgrade

Not sure exactly when but I did an upgrade to KDE4 and the kernel and in the process my networking stopped working, wireless and wired. It worked fine in recovery mode and I did another update in case there was a fix but no dice.
Fortunately I have a friend who knows the score, he told me to:
sudo nano /etc/NetworkManager/nm-system-settings.conf

and set managed=true instead of false. I then had to:
sudo nano /var/lib/NetworkManager/NetworkManager.state

and set NetworkingEnabled=true. I then had to restart the service:
sudo service network-manager restart

and it was all good again. Well done whoever made that happen in a package upgrade! Most people, including me, would not easily find out what has happened and would conclude that Windows works better even if we hate M$.

Working on a live database

Even if you have a test system to try out some new functionality, at some point it needs to be deployed to live. If you are lucky, this is done by copying the up-to-date test system over the top of live but in reality, most of us have to modify the live database directly, perhaps during a small time-slot and then it is expected to work.
There are a number of problems that are associated with live databases, specifically SQL server in my case, that are worth knowing before you go and break something.
  1. If you can, avoid altering or removing any tables/columns in the first place. If you can avoid this and perhaps leave a few dead things in place, you are less likely to break something.
  2. If you can use some tools like Red Gate to search for references to your objects, it can help you track down what might break. Red Gate also make a tool for moving objects from a test to a live database in a GUI rather than manually.
  3. For any objects that are accessed by something external to the database, mark these with schemas so you know that there might be knock on effects to other systems. e.g. webapp.procUserFetch
  4. Avoid selecting * from anything since this hides whether a specific column is referenced or not.
  5. If you are adding a new column, make sure it is nullable, even if it ultimately won't be. This avoids any existing inserts failing due to the lack of a value for the new column.
  6. Try not to rename columns if you really don't need to. You can always alias them in a select statement.
  7. Adding new tables, procs and views is safe
  8. If you change a table structure, you must use sp_refreshview on the views that reference it otherwise their cached query plan can be out-of-date and you won't know until some funny data is returned in the select.
  9. If you are deleting objects, rename them first or move them to a dead schema so that they can quickly be restored if there are any problems. You can put the rename date in their new name and after a week or so, delete them permanently.
  10. If you can, script the new or changed objects and store them in a repository so you can track and blame changes that are made.
  11. Use the format procNameVerb when creating stored procedures so that like procs are listed together e.g. procUserInsert rather than procInsertUser.
  12. Have a backup of the database if you are making major changes so that you have a last-resort way of falling back to a known-working copy.

Thursday, 25 November 2010

Writing good software systems

Writing good software systems cannot be reduced to a simple set of rules, it is a complex and wide-reaching subject where some problems have more than one solution, one which might or might not be better than the others. It is full of compromises and various technologies. However, there are certain fundamentals to good software that are not negotiable, certain frameworks that must be in place if you do not want to survive on luck alone or on the fact that one of your engineers might happen to be very good at their job and you get quality by default rather than by design.

  1. Training - If you employ people who are not trained/qualified/experienced, you cannot expect good software. You would not employ an unqualified doctor and software is not something so easy that a kid can do it. You can write bad software easily but knowledge is required to write good software. You must consider ongoing training as well, partly to adapt to new technologies and partly to challenge the way we get used to working.

  2. Specifications - Coders tend to enjoy coding much more than they enjoy writing specs. This is understandable but it is essential that functionality is specified either by the programmer or in larger outfits by people specifically employed for this purpose. Every hour spent on a spec can save 50 hours of wasted development. Get used to really grilling the functionality in the spec, think of details such as security, validation, error (both expected and unexpected) although much of this can be pulled out into common specs that apply to multiple systems. Get the specs signed off by your customer, even if they are internal. If someone then turns around and says something is wrong in the spec, it becomes a development item (low priority) not a defect to fix immediately. It also makes people take their analysis to the correct stage. "No spec" means we've considered the easy 90% and not the hard 10%, a spec should force these harder details to be considered and potentially to affect the things we thought were easy to do.

  3. Quality control - People make mistakes and are sometimes lazy, if you expect them not to be then you should not be a manager. The trick here is to put things in place that prevent mistakes (automating tasks) and pick them up when they are made (such as code reviews). Although this is a learning exercise and might be slightly different depending on the type of work being carried out, how many people etc, the key here is to start with the basics and learn from yours and others mistakes. Something goes wrong, ask yourself why, how it could be avoided, whether it is worth the extra cost/time compared to the liklihood of the mistake being made and then bring in a process/modify a process to learn from your mistake. This really is not rocket science and it will help you to become Quality Accredited if you seek this in the future.

  4. Regular Reviews - Do not think that quality is a one-person thing. Have regular reviews with your teams and value their feedback. People close to the coalface know what is happening and do not appreciate having all decisions made by people they do not feel understand the problems. Create an objective atmosphere in these meetings so that you can measure the value of suggestions e.g. "your idea will take another hour per day but for this kind of defect, which we've seen once, I would suggest it is not worth it." This avoids arguments between two people who both know what the right answer is.

  5. Get the foundations right - It might often be the case that timescales are tight and engineers aren't given enough time to do things properly so getting the foundations right is essential. Having a solid security model, for instance, is much easier to build in from the start and extremely difficult to fit later on. This comes back to having enough input from your team so that they buy into this foundation.

  6. Do not forecast the future - You cannot pre-judge every change that might be made to a system in advance so do not attempt to build in everything at the beginning. If your code is written well then extending it later should not be too much of a challenge. There are however certain things that are reasonable to assume such as a web site might want its look and feel changed in the future. It is reasonable to factor these into the design. Also, people are not paying for all the future possibilities, if you keep the system simple, it will be more reliable and quicker to build. If the customer then turns round and wants a load more functionality, they can pay for the work arising from this.

Friday, 19 November 2010

Using static in .Net

How often do you use the static keyword? Do you disagree with it from a puritan viewpoint that it demonstrates the hiding of functionality that belongs in a separate class? Do you even know what it means?
A static item essentially means there is only one copy. In C#, a static class cannot be instantiated (like a Singleton) and can only contain static members. A normal class can have static members (methods or fields) but these cannot be called on instances of the class, only on the class itself.
As with many things in life, static can be used well and it can be misused. However, I want to suggest that its use, if correct, is both good for performance and is also good for reliability! How do I know? Well look at the following code:

public class Car
private static int carCount = 0;
public static Car BuildCar()
// build car and return it

This is an example of lazy or misinformed programming. Why? Because why should a car 'know' about the number of cars that have been built? In reality, you should probably create a factory class that looks like the above and then it wouldn't have to be static right?

public class Car
// etc

public class CarFactory
private int carCount = 0;
public Car BuildCar()
// build car and return it

We could then keep a single instance of this class somewhere and access it to make our cars. This is OK but it misses a trick when it comes to performance. In simple systems, to be fair, it is unlikely to be noticeable but in the above example, although we are only intending for there to be 1 CarFactory instance, the compiler doesn't know this and therefore every call to one of the instance methods has to resolve which instance (i.e. which block of memory) is being accessed. By using static, we tell the compiler that we only want one 'instance' and therefore the class is loaded when the assembly/program is loaded and is instantly available - bit like a direct pointer to methods/fields. This is good!
The same basic truth also concerns methods on classes that can be static i.e. any methods that do not access instance variables in the class. So although they do not have to be static, by making them so, we avoid the resolution of instance. Look at the following:

public class CarFactory
public void PaintCar(ref Car car, Color colour)
car.Colour = colour;

The method does not have to be marked static but in this case, it only operates on the values passed into the method and therefore should be marked static for performance reasons. Since, like all good programmers, we never modify methods after they are released, we would never need to change it to non-static, we would simply add another non-static method to do the work differently! In some cases, classes have static methods that operate on passed in values and instance methods that do the same thing on the instance itself:

public class Car
public static void PaintCar(ref Car car, Color colour)
car.Colour = colour;
public void PaintCar(Color colour)
this.Colour = colour;

The second and not to be ignored benefit of using static is that the compiler will enforce the fact that a static method is not allowed to access instance fields and that a static class cannot be instantiated. Why is this important? Because this improves reliability. Take the code snippet above. If our Car class had 20 member fields (instance, not static) and we wanted to test the two functions above, how many combinations would we need to test? In reality, we might look at a function and see that it doesn't access member fields but what if it called other methods which called other methods? Ideally we would need to consider every possible combination of instance fields as well as the parameters passed in. For the static method? We know that we cannot accidentally read or write any instance fields so we simply test for all valid permutations of the input (we might test in our case for a null Car, a null Color, a valid Car and a valid Color to ensure our code is robust).
Of course we always ignore reliability because we think we are good engineers and won't make stupid mistakes. To be perfectly honest, we make plenty and need as much help as we can to avoid them. That's why you should use static wherever possible!

What do you use Service Broker for?

SQL Server has a system called Service Broker which is a queued, asynchronous system enabling calls to be made to procs or simply messages posted onto a queue which can be removed at will and processed. I have seen various people ask when you would use service broker, with its complexity, over a simple call to a stored proc.
The two key words here are queue and asynchronous. A queue ensures that things are processed in order so if, for example, you are running an important financial system, presumably various things must happen in order even if there are thousands happening all the time. In other words if you have 100s of people making changes, you might want to ensure that they are placed in a queue as a single point of entry and which will ensure that the first-come is first-served. You have an amount of control over the queue and whether messages are retained, the way in which the conversation is carried out etc but it is a way of controlling parallel environments.
The second and perhaps most important is the asynchronous nature of the service broker. Why wait for something that you do not need to wait for? If you are sending a letter to someone, you don't post it and then wait at the post box until the recipient tells you they have received it. You either don't care or if you do, you set up a converation with the recipient who can tell you if/when it arrives. Many scenarios exist like this in computing. If I want to cancel my credit card account, once I have requested the close, the Customer Services agent should not have to wait for the whole process to execute, simply to start the ball rolling and then assume it has worked unless the other end fails and reports back. This is especially important if the sequence is fired from a user interface which is waiting for the code to complete before updating the screen.
SQL Server allows you to invoke a stored proc on receipt of a message into the queue which might be useful but you can also leave the queue to fill and service it on a job every so many hours which might help avoid server load or meet various Quality of Service (QOS) requirements (such as all accounts are always updated at 6pm).
My general rule of thumb for software is to keep it simple, although you can sometimes guess how intensive an operation is and go straight for something like service broker, our assumptions can be wrong (a view with 10000 rows runs in a few seconds whereas another seemingly more simple view takes 5 minutes) so best to approach it with a 'if it aint broke, don't fix it' since it is reasonably easy to retrofit Service Broker in place of a stored proc at a later date as needed.

Service Broker Queue not calling Activation Procedure

More fun with the Service Broker today on SQL Server 2005. I had got everything working on a test database and wanted to move the changes into the live database, both on the same server. I scripted all the stuff up and then realise service broker wasn't running. I found this because when I selected from sys.transmission_queue (which should be empty if the message is delivered to a queue), the exception column told me this. I tried to alter database to enable service broker but it wouldn't work with users logged in and I didn't want to boot all few hundred of them off! I wrote a SQL server job that ran the enable code at 5:50am with the following code:

CREATE PROC [dbo].[procEnableBroker]

I needed the second bit in the catch which was some weird thing about needing a new service broker ID? The rollback boots out any uncommitted transactions and does the alter.
This all seemed OK so the messages were now being put on the queue but the queue was not calling the activation proc. If I ran the proc manually, it worked OK so I ended up thinking it was something to do with permissions. I followed some MS advice which as usual is buried amongst a lot of smoke and mirrors and did the following:

EXEC [dbo].[procQueueActiviationProc]

in order to match the user that it would normally run as (EXECUTE AS OWNER on the queue) and then I got the old, "the remote server could not be accessed because the current security context is not trusted". All I had to do then was to call:


And then after a little while and poking the queue again, it all came into life. I quite like the idea of Asynchronous calls for some things, this is a proc that updates a load of readonly fields and which takes up to about 10 seconds, something that we needn't wait for.

Friday, 12 November 2010

Cannot retrieve message from Service Broker Queue

I thought I had followed the examples to the letter and yet when I tried to pass a simple message, it seemed to send it but wasn't received by the queue. Eventually I found a view called sys.transmission_queue which holds all messages until they are posted onto a queue (I thought it was just a log so wasn't suspicious that it was still full of messages!). In the transmission_status column for all my messages was "Error 15517 State 1: Cannot execute as the database principal because the principal 'dbo' does not exist, this type of principal cannot be impersonated, or you do not have permission."
It was caused because I was working on a database backup where the security doesn't really map back up properly since I am sysadmin on the server. Anyway, thanks to I realised that all I had to do was call
which I presume sets the database owner to sa and then I was able to make it work. All the previous messages were re-serviced and all ended up on my receiving queue.

Monday, 8 November 2010

Why web security is simply not good enough

The Royal Navy with egg on their face today as some hacker grabs private information from one of their sites and releases it to the public. The trouble here is that what will happen is the site will be secured more tightly and we will "ensure it doesn't happen again" but these things never address the underlying issue.
There are organisations like OWASP who are trying to push for a common framework for web application security but everyone needs to be onboard if it is to implemented. We are talking about service providers, software vendors, the public and governments, if only a small measure of these are involved, it won't happen.
Do people really understand the problems though? Although the common weaknesses in web sites are well known, it would appear that many people are either unaware of these weaknesses or otherwise they are unable or unwilling to do anything about it. Just to give you a heads up, here are lots of common problems which lead to insecure web application processes:

  1. There are simply not enough skilled people for the volume of web sites

  2. The web allows people to experiment and build their own sites which are not easy to distinguish from good sites. The end-user trusts each of these to more-or-less the same degree even though the amateur sites are unlikely to follow any good practices.

  3. You can buy or obtain off-the-shelf products to build your own sites and simply assume they are secure by default. These systems are not necessarily built well but you have to rely on experience and a wide user base to find and/or fix any insecure areas.

  4. There is no way to physically stop people from either not implementing security or making a mistake when they code a site.

  5. Many sites are fixed and updated regularly so it is only likely that at some point, unless there is a secure deployment regime, that a security hole is going to be created.

  6. People tend to go with the easy approach rather than the secure approach. For instance, people use the database admin login to access the database meaning a hacked site is as good as an open door to all your data

  7. Even when using secure practices, it is possible to use them incorrectly (such as using a flawed regular expression when testing for valid user input). Some of these might have inherent bugs or incomplete functionality.

  8. People who write and deploy web applications do not require formal training of any sort to at least demonstrate they have been taught about secure practices, even if they cannot be forced to use them.

  9. Although frameworks like ASP.Net and PHP have various secure controls available, they are easy to ignore, whereas these frameworks could actually insist on secure programming even if some people don't like it. For instance, by default, a text box control should not allow unusual punctuation, you should have to manually allow it on a per-item basis if required. Safety by default

  10. Whoever has written the site, there is no worldwide accepted certification that ensures that a site is secure. There are mechanisms to prove who the site belongs to and various specific bodies like the Payment Card Industry who have their own audit procedures but nothing required for the general population.

  11. By default, once someone has connected their computer to the internet they are both a potential menace and a potential weakness, it is like a load of criminals being moved into a residential area without telling anyone and then being surprised that the crime rate has increased.

Obviously any discussion about what could be done would be met with disapproval and accusations of censorship or the like but in reality, the system does not need everybody's approval since it could be done as an "exclusive club". You would need a way to achieve and prove certification including proving that your certification is valid (which could use digital certificates) and this would involve an audit of procedures/processes and functionality - possibly against a specific release and possibly in general for your organisation or as an individual? This certificate then qualifies you for the 'green flag club' which indicates to end users that a site is about as good as it could be and which would then ultimately allow people to choose to run in ultra-protected mode (for their safety/benefit) which would then cause people who write sites to get their site certified so that they can then access the club (and provide the benefit of a more secure site). The un-certified sites can live in the badlands!
Right, when do we start?

Tuesday, 2 November 2010

Devil in the detail - customising the ASP BulletedList

When you do standard easy programming, you can cover a lot of ground very quickly. Modern tools and class libraries are very useful and do most things very well. When it comes to more specific programming, the devil is very much in the detail. For example, the other day I wanted to replace some hard-coded HTML generation, which made a tabbed divider, with a more standard ASP control. Since our old tab control was based on a <ul>, I found the closest control was an ASP .Net BulletedList.
It was easy at first since by setting various options on the list, I could get it to generate a <ul> with items and using the link button mode, it all looked pretty good. I applied the original CSS styles and I thought I was 99% there.
Oh, I thought, I really need to distinguish the selected item in the list since they represent tabs, I want to style the selected one differently. Since the bulleted list is supposed to be, well, a bulleted list, all the item selection from the base class ListControl has been overridden with NotImplementedExceptions and even if an item is selected by getting the individual ListItem and setting Selected = true, the item is not rendered any differently.
Well, this wasn't a problem since OO allows us to inherit and extend the functionality, I implemented my own version of RenderContents() in a new class, inherited from BulletedList which renders an attribute for class in the LI if it is selected, this was all great - OO is amazing! I then wanted it to remember the selected item between postbacks and to fire the SelectedIndexChanged ListControl event when the tab had changed. This is where it all started kicking off. For a start, since BulletedList has overridden the ListControl SelectedIndex and SelectedValue setters, I could not call the original versions which were fine, I had to use the Red Gate reflector tool to find the original code and duplicate it in my class overriding the overridings! I then found various references to internal and private functions which, again, I needed to duplicate in my code in order to use the code I had copied from the internal workings of the ListControl (I only wanted to make small changes!). I then realised that since the BulletedList already handle the click event and did not fire the SelectedIndexChanged even from ListControl, I would not easily be able to do what I wanted.
In the end, I decided the easiest thing was to inherit directly from ListControl which would give me all the Item selection code and then add in anything from the BulletList class that I needed (fortunately not all of it!) and then I could make my own class handle the postback event when it was clicked and fire the OnSelectedIndexChanged event from ListControl.
I removed any of the display modes and bullet types that I was not using to make it neater and got rid of the start bullet number. I also got rid of the various local variables used to "cache" values for loops. It seems that MS did not consider very well how their code would be specialised in sub-classes which is why there is a hotchpotch of public, protected and private/internal functions meaning that trying to base your code on the original does not work without lots of duplication (unless there are other public utlities to do the same things).
Anyway, if you're interested in the code:

/// A specialisation of the list control which is based on link buttons and tab styles

public class SelectableTabControl : ListControl, IPostBackEventHandler
protected override void OnInit(EventArgs e)
this.CssClass = "tablist";

/// Handle the data being bound

/// Ensure a default item is selected
protected override void OnDataBound(EventArgs e)
if (SelectedIndex == -1 && Items.Count > 0)
SelectedIndex = 0;

/// Add attributes of this control to the HTML

/// The HTML output stream
protected override void AddAttributesToRender(HtmlTextWriter writer)
string uniqueID = this.UniqueID;
if (uniqueID != null)
writer.AddAttribute(HtmlTextWriterAttribute.Name, uniqueID);

/// Render the HTML for this control

/// The HTML output stream
protected override void Render(HtmlTextWriter writer)
if (this.Items.Count != 0)

/// Prevent this control from being given child controls

protected override ControlCollection CreateControlCollection()
return new EmptyControlCollection(this);

/// Render the individual elements for this list

/// The default implementation does not render a class for selected items
protected override void RenderContents(HtmlTextWriter writer)
for (int i = 0; i < this.Items.Count; i++)
if (this.Items[i].Attributes != null)
writer.AddAttribute(HtmlTextWriterAttribute.Class, this.Items[i].Selected ? "selecteditem" : String.Empty );
this.RenderBulletText(this.Items[i], i, writer);

/// Return the tag to use for the overall control

protected override HtmlTextWriterTag TagKey
get { return HtmlTextWriterTag.Ul; }

/// Render the individual items from the list

/// Replacement for base class function which references inaccessible values. This only includes the LinkButton style
protected void RenderBulletText(ListItem item, int index, HtmlTextWriter writer)
if (!this.Enabled || !item.Enabled)
writer.AddAttribute(HtmlTextWriterAttribute.Disabled, "disabled");
writer.AddAttribute(HtmlTextWriterAttribute.Href, this.GetPostBackEventReference(index.ToString(CultureInfo.InvariantCulture)));
if (AccessKey.Length != 0)
writer.AddAttribute(HtmlTextWriterAttribute.Accesskey, AccessKey);
HttpUtility.HtmlEncode(item.Text, writer);


#region Code duplicated from internal .Net classes
/// From System.Web.UI.WebControls.BulletedList

private string GetPostBackEventReference(string eventArgument)
if (this.CausesValidation && (this.Page.GetValidators(this.ValidationGroup).Count > 0))
return ("javascript:" + GetClientValidatedPostback(this, this.ValidationGroup, eventArgument));
return this.Page.ClientScript.GetPostBackClientHyperlink(this, eventArgument, true);

/// From System.Web.UI.Util

private static string GetClientValidatedPostback(Control control, string validationGroup, string argument)
string str = control.Page.ClientScript.GetPostBackEventReference(control, argument, true);
return (GetClientValidateEvent(validationGroup) + str);

/// From System.Web.UI.Util

private static string GetClientValidateEvent(string validationGroup)
if (validationGroup == null)
validationGroup = string.Empty;
return ("if (typeof(Page_ClientValidate) == 'function') Page_ClientValidate('" + validationGroup + "'); ");

/// From System.Web.UI.Control but minus supports event validation which is far too esoteric (and inaccessible)

internal void ValidateEvent(string uniqueID, string eventArgument)
if ((this.Page != null))
this.Page.ClientScript.ValidateEvent(uniqueID, eventArgument);

/// Gets whether the All tab is selected (which is always the last one)

public bool AllTabSelected
get { return SelectedIndex == Items.Count; }

#region IPostBackEventHandler Members
/// Handle a postback event occuring.

/// The value of the event argument
/// This event is raised when the page is posted back and if the postback was caused by this control then the eventArgument is the index of the tab that was clicked
public void RaisePostBackEvent(string eventArgument)
ValidateEvent(this.UniqueID, eventArgument);
if (this.CausesValidation)


Sunday, 24 October 2010

wget doesn't get all levels

I had a problem trying to grab a whole directory structure of files from an ftp site using wget. When I tried to get the files directly they were fine but when trying to get the whole site, it didn't get certain files which definitely existed.

It turns out that wget defaults to 5 levels of depth, even when downloading physical files from ftp (rather than following symlinks). I changed it so it so that it uses 10 levels instead with -l10 and it was fine. The number of levels make sense when you see the default output structure of servername/dri1/dir2/dir3/dir4 even though your files might appear to be level 4.

I'm not sure if rsync is easier to use for this purpose but the wget command is easy enough:

wget -r -N -l10 ftp://username:password@servername/dir

Where -r is recursion, -N only gets items that have been modified and -l10 increases the number of levels to 10 (That is 'L' 10, not 110!)

Tuesday, 19 October 2010

Document Generation with Word 2007 and ASP.NET - part 2

In the first part, we looked at how to build a document ready to accept dynamic data which can be used to generate dynamic reports. Whether we are planning to use a blank template, individual read/write documents or merging documents together, the basic idea is the same:

  1. Get data from data source

  2. Create XML part for this data

  3. Insert this data into the document

  4. Return the result to the client

Getting data from the data source is beyond the scope of this document but in my case, I have an existing database access layer which I call to return a single row of data (a DataRow).
Creating the XML part is quite easy using the XmlWriter class as in the following function:

private void GetData(Stream stream, string quoteRef)
// DataRow dr = etc..
XmlWriter writer = XmlWriter.Create(stream);
writer.WriteAttributeString("Reference", quoteRef);
writer.WriteElementString("QuoteName", "Test New Quote");
writer.WriteElementString("TotalSellingPrice", Convert.ToDecimal(dr["TotalSellingPrice"]).ToString("N2"));
writer.WriteElementString("TotalCost", Convert.ToDecimal(dr["TotalCost"]).ToString("N2"));

Since the XML is text, it is convenient to format numbers at this point rather than playing with it in the document (but you can if you need to).
To insert this into the document, we need to use the System.IO.Packaging classes which allow you to work on the zip file (which is the docx). You might need to reference WindowsBase.dll if you haven't already to get these classes.
My function is then:

private void InsertCustomXml(MemoryStream memoryStream)
// Open the document in the stream and replace the custom XML part
Package pkgFile = Package.Open(memoryStream, FileMode.Open, FileAccess.ReadWrite);
PackageRelationshipCollection pkgrcOfficeDocument = pkgFile.GetRelationshipsByType(strRelRoot);
foreach (PackageRelationship pkgr in pkgrcOfficeDocument)
if (pkgr.SourceUri.OriginalString == "/")
// Add a custom XML part to the package
Uri uriData = new Uri("/customXML/item1.xml", UriKind.Relative);
if (pkgFile.PartExists(uriData))
// Delete template "/customXML/item1.xml" part
// Load the custom XML data
PackagePart pkgprtData = pkgFile.CreatePart(uriData, "application/xml");
GetData(pkgprtData.GetStream(), "QUO-016952");
// Close the file

This function takes a stream which represents a docx document and adds the custom XML in, there is nothing in this function which you would change (except the string parameter passed to GetData) you need a const defined which matches the correct namespace:

private const string strRelRoot = "";

The reason I use streams is because I use the system in different ways. To retrieve a single document:

public MemoryStream RetrieveDocument(string fileName, bool addCustomXml)
// Read the file into memory - default constructor is expandable
MemoryStream memoryStream = new MemoryStream();
byte[] buffer = File.ReadAllBytes(fileName);
// If we want to add in the XML, do it here otherwise we might want to add it at top level
if (addCustomXml)

return memoryStream;

NOTE the comment that you need to use the default constructor for the memory stream otherwise it will not be expandable when you insert your custom XML.
If I am merging several documents (in my case using the Aspose Words dlls from Aspose) I merge the docs and THEN add the custom XML to the main document:

public Document MergeDocuments(List docs, string templateDoc, string outputPath)
Document dstDoc = new Document(templateDoc);

foreach (MemoryStream Doc in docs)
Document srcDoc = new Document(Doc);
dstDoc.AppendDocument(srcDoc, ImportFormatMode.UseDestinationStyles);
// Add in the custom XML via the memory stream
MemoryStream ms = new MemoryStream();
dstDoc = new Document(ms);
return dstDoc;

Aspose merges actual docs whereas I need a memory stream for my functions so I use the Aspose doc class to append documents and then convert it to memory stream to add the custom XML, once it is done, I return a new Aspose doc so it can be saved appropriately. One of the things that is easier in docx is that to merge docs, you can simply merge the content and keep a single set of styles, custom xml and the rest of it but it does mean that any custom XML/styles in individual docs is lost when the documents are merged.
In my web layer then, I call these functions and return the result to the browser using content-disposition to hint that it can be saved instead of viewed inline:

protected void Page_Load(object sender, EventArgs e)
TenderGenerator gen = new TenderGenerator();
const string TemplateFile = @"~/App_Data/QuoteTemplate.docx";
List myList = new List();
myList.Add(gen.RetrieveDocument(@"c:\work\QuoteDocuments\QUO-016952\Documents\UniqueReport.docx", false));
myList.Add(gen.RetrieveDocument(@"c:\work\QuoteDocuments\QUO-016952\Documents\Financial Summary.docx", false));
Document doc = gen.MergeDocuments(myList, Server.MapPath(TemplateFile), "");
doc.Save(Response, "CustomerDocument.docx", ContentDisposition.Attachment, SaveOptions.CreateSaveOptions(SaveFormat.Docx));

In this case, I have a test page so the doc paths are hard-coded and the whole thing is driven from the page load event, in real life, this will be driven from a button press and database driven list of docs. Note that I use the template file stored in App_Data which already has the custom XML linked to it so that the top level document can have the custom XML replaced (in the previous post I mentioned that for some reason, if the custom XML is not present, the new data is not added in its place). Hopefully this is all easy enough to understand.

Document Generation with Word 2007 and ASP.NET

I had been asked to look at this for a project and spent many an hour trawling the interwebs to find out about solutions. The thing that makes it hard is that MS have a tendency to change technologies frequently (although generally for the better) and their MSDN articles are galaxian in size so trawling the various guides, APIs, white papers, support and other pages takes time.
Anyway, I've found some helpful articles and thought I would write a very easy to follow guide to getting the basics working.
If you want to generate a document on-the-fly, chances are that you want to populate the document with dynamic data from some form of database or other data source (the actual source is not important). Currently, this has to be done with various horrible COM technologies or otherwise by directly hacking Word 2003 XML files.
Word 2007 has the ability to link data controls on the page to XML data in the document, which can be generated dynamically. A Word 2007 document is actually a ZIP file (rename it and see!) which contains the various related XML files and which form the overall document, the XML which we will use to generate docs is custom XML and can follow any or no schema. In my case it is a very simple XML file with nothing more than the XML element and a single entity with 3 elements. Because this custom XML is separate from the rest of the document, I can change it at my leisure without affecting any formatting.
Now when adding these data controls, know as Content Controls, there are certain things that are not that obvious. Firstly, the controls can only be linked easily to the XML using code. There is a whole schema thing where you can define a schema and tell Word about it but I went another route by creating a Ribbon Control add-in for Word and using the C# code to add a correctly linked Content Control to a document. These can be added without the XML being present (they just won't have any data until the XML appears at some point). If you create a Word add-in project in Visual Studio and create a Ribbon(XML) component, which is fairly easy, you can then add the code which will look something like this (note the public modifier on the button handler):

// In the handler for a certain button - Ribbon1.cs
public void OnSellingPriceButton(Office.IRibbonControl control)
Globals.ThisAddIn.CreateBoundDataItem("Total Sale", "TotalSellingPrice");

the reason this calls a function via the AddIn is because the AddIn has access to the underlying document whereas the ribbon doesn't (although I think it is possible to do something weird to gain access). The add-in code in my case is then:

public void CreateBoundDataItem(string theTitle, string theProperty)
Microsoft.Office.Tools.Word.Document doc = Globals.ThisAddIn.Application.ActiveDocument.GetVstoObject();
Tools.PlainTextContentControl plainTextControl1 = doc.Controls.AddPlainTextContentControl("plainTextControl" + Count.ToString());
plainTextControl1.Title = theTitle;
plainTextControl1.XMLMapping.SetMapping("/Quote/" + theProperty, null, null);

Special notes here! You CANNOT bind XML to the rich text control so use a plain text control. Also, you need to get the vsto object in order to get the correctly typed document to access the controls collection. Also, when adding the control, you have to provide a unique name so I use a static int variable in this class to generate a unique name.
Obviously the SetMapping code (XPath) will be unique to your XML format, mine, as I said, is very simple. Also, the title property is what appears on the control tab when it is inserted into the document (as an aide-memoire). You can set other properties here like whether the data is readonly and whether the control can be deleted.
The cool thing about the add-in is that once it is built, you simply copy the output from the bin/relase directory into your AddIns directory in AppData (search for AddIns) and Word will automatically show the ribbon.
Once the ribbon is installed, it will create content controls pointing to whatever part of data you need. At this point there may or may not be any xml in the custom parts of the document but if you are using the Packaging classes in C# (which we use in the server end), it appears that it can replace the XML but not create it (might be some permissions thing?) so you will need to make sure that the document starts with a correctly formed XML custom part, even if the element data in it is blank or invalid. You can do this by running some VB in Word 2007 with your document open (you will need to enable the developer tab in Word Options), in the immediate window, type each of these lines and press 'enter' after them to invoke them:

ActiveDocument.CustomXMLParts(4).Load ("c:\CustomerData.xml")

You need to use the index 4 (the doc already has 3 built-in ones) but your path to the XML will obviously be wherever it is. The system will create an item called item1.xml in the customxml directory inside your docx zip structure and the fields will immediately be available to your content controls.
That is the document part done. You have a couple of alternatives now, depending on what you are doing but worth work in the same way. You can either keep a single document with the custom XML present but no content, and use this as a basis for any generated documents or you can save a bunch of different documents, each with the custom XML in place and then work on these individually or together. It depends on whether your generation is one-way or whether it will be loaded and saved.
The next post will be the server side which will look at the various options.

Thursday, 14 October 2010

Embedding Excel cells into Word 2007

I'm trying to write something that generates Word documents from a database which can then be modified by a user to include optional fields - things that will change and need updating in a document.
I'm having a bit of a struggle working out how to do it but I thought about old-school DDE and embedding LINK fields into Word. It is not ideal since it requires an absolute file path but I thought I would try it out. I could then create building blocks based on these LINKs and allow users to add them in. I attempted to copy the example in the Word help and surprise, surprise, it didn't work at all. I kept getting various errors all amounting to the fact that the field was invalid. I tried all sorts of options and almost gave up when someone mentioned the easier way to do it!
Copy the cell(s) in Excel that you want to use and then use Paste Special in Word to paste as a link (it's an option on the left-hand side of the dialog). It will automatically create the correct format for the LINK field (in my case some slightly different switches) and should all be good.

Friday, 1 October 2010

When a SQL JOIN is slower than it should be

I've experienced the situation a few times where select * from viewA runs very fast, select * from viewB is very fast but when joined together, they are very slow (like 10 times the execution time).
I also had today the scenario where joining an old view was fine but joining a new one, which was not only virtually identical but with less columns and one less join, took 30 minutes instead of 10 seconds!
I have experienced minor improvements by changing join orders and trying to use more efficient WHERE clauses, avoiding SELECT in the WHERE clause etc but this was a whole other league of performance issue. When you only join two views, you don't have many options but I tried simplifying the ON clause (no different), removed the GROUP BY (no difference) remove the column which was summing an amount (no difference) and then I realised the only thing remaining was the JOIN. I changed this from INNER JOIN to LEFT OUTER JOIN and all of a sudden, I was back up and running.
As it happens, there are no rows in view 1 which do not appear in view 2 so the INNER JOIN isn't actually required here but I was still surprised. I looked into a few web posts and the best answer was, "an inner join is an implicit outer join + where column is not null" but I don't actually believe this and still wouldn't expect the problem in my 500 row dataset but obviously somehow it is a problem (although this is by FAR the worst case of it I have ever seen).

Monday, 20 September 2010

Process is more important than Software

I have worked in various places over my nearly 20 years of working life and what surprises me is that it is often the basics that let us down, not subtleties or unusual scenarios. How many times do utility companies mess up a change of name or address? How many times does an airline lose luggage or mess up reservations?
I think most of us understand an extraordinary situation causing problems for any company but how do so many let the basics make them look like idiots?
One of the basics is the idea of a process. If you work out how to make the perfect cookie, do you try and guess the ingredients the next time you make some? Of course not, it goes without saying that you write down the recipe and follow it on the basis that if it worked the first time, it will work the second, all things being equal. I think it is the last part that seems to make many businesses consider processes either ineffective or maybe not even essential for business except maybe on a production line.
Consider my humble industry of Software Development. The first time you write a piece of software and get bitten by, for instance, somebody typing in dodgy input which makes the system fall over, you would think, "Ah, if I add that to a checklist for code reviews then it won't happen again". OK, we have to accept that we won't/can't necessarily retro-fix all the other places in code (although we should certainly consider it) but at least all NEW code will be OK. Although weaknesses do get found all the time, the total number in a given scenario is not massive so we can get to a place that only the newest exploits can affect our code.
This also works at a service level. Think of a support centre, it doesn't take much to look at the number of calls, type of calls and amount of time people aren't doing anything and work out how to improve the situation and free up resources. You might decide that you need a Frequently-asked-questions page or a current status web page or phone message that avoids calls from lots of people about something you already know about. The improvements to process and the lessons you learn don't change every day so over time you can hopefully employee less people to do more.
The title alludes to people who want to buy-in software to replace a current manual system but without thought to the process itself and what could be improved. Since computers and people are good at different things, it is safe to assume that a computer dominated process is going to be different from how things are done manually (a computer can run air-traffic control with less people because it can constantly detect and even correct conflicting air movements, something that is much more complicated to do by hand).
Anyway, consider your processes, what you are actually trying to gain as a result and you might find a whole load of people who are employed simply to support an unecessary business practice. You can then redeploy them to places that need extra resource at no extra cost to the company or you can let people leave and not need to replace them! Wonderful.

The 'Beta' ASP Calendar Control

We are having problems on our web site with the jQuery datepicker when used inline to look like a calendar. The problem only exists in IE8 (surprise, surprise) and it is fine in FF but anyway, I was recommended the asp calendar which although a server control might do everything we needed it to.
In terms of functionality, I think it provides all the things we need, you can say whether you are allowed to select months, weeks and/or days and you can also style various aspects. I must admit there were a few things which made it look a bit amateur though.
The color for the day table cells is harded coded to black so you cannot override this inthe styles. Also, you cannot set the styles for the selected day because although it is theoretically possible to do, the styles are not used and are not reported back to the code behind either apparently (very sloppy Microsoft, you do have testing procedures?). The other thing that you can't do is apply a margin to the day cells (which I was trying to do to match the JQuery calendar). This is presumably because the day cells also have a hard-coded width of 14% which takes precedence over the request for a margin which would require wider cells.
My conclusion? Well, to be pragmatic, this control works where the jQuery one doesn't despite being a server control which is less than ideal. Also, because it is a server control, things like date selection all work with the correct types and I don't have to cast between strings and dates in the code-behind. I guess I'll use it.

Wednesday, 8 September 2010

Error 401: Unauthorised when calling web service

I was calling a web service from code in and got the dreaded 401 error. The web service was fine from a browser (but I couldn't execute anything since the test page only works on the local server and this remote). I used the Visual Studio "Add Web Reference" and logged into the HTTPS web service, it all seemed fine.
In the end, I ran up Fiddler (the web browser tool) and accessed the web service in the browser, the header came back telling me that the web service was expecting "Negotiate/NTLM" authentication and I was attempting to authenticate using basic. It now works and here is the code I had to use:

MyWebService service = new MyWebService();
CredentialCache myCredentials = new CredentialCache();
myCredentials.Add(new Uri(service.Url), "Negotiate", new NetworkCredential("MyUserName", "MyPassword"));
service.Credentials = myCredentials;
service.PreAuthenticate = true;
// Call functions etc

Note, I did not require any special certificate code which some people have suggested and this works on an HTTPS connection. I also did not need to specify the domain in the Add() function, just the username and password.

Friday, 3 September 2010

How to speed up SQL queries

One of my pet hates is people going to a computer forum and asking something like, "I have a query that takes 50 seconds, how can I reduce it to 10" without a copy of their code and with all the telltale signs that they are nowhere near qualified or experienced enough and who expect people to simply do their work for them. Anyway, I want to partially mitigate this by giving you some beginners advice on improving the performance of queries. After reading these, you should write good queries in the first place, it is definitely easy than rewriting something and hoping you haven't changed anything.

Structural Improvements e.g. always do this

  1. Avoid carrying out functions like LEFT, LEN etc in queries. These probably betray a poor design (like joining on varchar columns) but might be sometimes unavoidable

  2. Use WHERE and ON correctly. People sometimes get these confused. Your ON clause should ensure that the rows that get joined are logically matched and then a WHERE clause is to optionally reduce the result set. If you have an ON which is not complete enough, you will multiply the join significantly and then have to reduce it after joining using your WHERE clause.

  3. Do NOT use CURSORS, in most/all? cases you can use JOIN to avoid them.

  4. Do NOT use selects in a where clause. It means every row that is matched needs to call another select (300,000 rows = 300,000 selects)

  5. Test your query on large datasets before deciding it performs acceptably.

  6. Only select columns you need, not *

  7. Design your DB tables to link using ID numbers and not text/varchar columns

Design Improvements e.g. you might need to do this

  1. Use the query analyser to suggest index creation. Understand that indexes usually result in faster selects but slower insert/update/delete so depending on how often a table is read compared to how often it is updated, this may or may not be acceptable. The alternative is an indexed cached table (updated every X minutes)

  2. Select required data from tables to join into temp tables or table variables, especially if the dataset is large, the table is in another database or you need to join on it multiple times. You might also want to create indexes on these tables. Consider joining a table with 500,000 rows and 80 columns that takes up 180Mb. Creating a temp table with 8 of these columns and perhaps using a where clause to reduce it to 100,000 rows will obviously perform much faster.

  3. Use cached data if acceptable rather than joining raw tables. If the data you need does not need to be up-to-the-second correct, you might write a proc that updates a cached table every 15 minutes locally and then use this in your join

Tuesday, 17 August 2010

Why are specs important?

When I used to work at my previous company, I always felt that specs were annoying and inconvenient and didn't really serve much purpose other than for ticking boxes for quality auditors or whatever. Then I came to my new company who have almost zero procedures or specifications and I now miss them. For those of you out there who are not convinced or those faux experts who claim "at the end of the day you have to start coding and see what happens", here are reasons why paperwork is important:

  • It is much quicker to define and review a concept or system on paper than it is to write the code. It is almost impossible to review a completed system in code because of the volume of information and any changes are generally impossible (at least to achieve in a robust way)

  • A document is easier to team review including with non-technical people/managers who can understand basic system info but not the detail

  • It gives visibility of intent to the customer, especially if there is more than one customer (this includes product managers and stakeholders). It is easier to argue and agree functionality when it is a single paragraph of text than to rip up a load of work later on in code and attempt to rewrite it without breaking things.

  • It allows changes to be measured against what was originally agreed which means its easier to say, "no we're not changing it because it was agreed" and equally easier for a customer to say, "this is not working compared to what was agreed, please fix it".

  • Changes to the functionality can be formally agreed and recorded electronically so no-one is in a position to easily argue that the system is "wrong" which usually means they personally think it should be different.

  • Often people will disagree about the intent or content of certain functionality perhaps as a balance between management requirements and the additional workload required but since these issues are agreed up front, either the manager can accept a reduction in functionality to appease the worker or they will insist that the worker does what the system requires of them even if they don't personally agree.

  • Various interworkings and calculations are much easier to write and review on paper before the sometimes significant amount of work required to actually implement those calculations.

  • By having processes that govern the production of specifications, if something bites you in the released product or perhaps in test, a note can go into the process that might say something like, "ensure you have considered any additional security that might be required" or "Consider the fact that you will interact with new data and it will probably require validation to be defined"

Wednesday, 23 June 2010

SQL Server Security - Roles, Logins, Users, Schemas and Permissions

When you first approach the area of security in SQL Server 2005, it can seem very complicated. If you search on the internet you will find thousands of sites who all seem to assume that you mostly know what you're doing but if you're like me, it is hard to know where to start.
Well here is a brief and hopefully newbie-friendly guide to SQL Server security.
Firstly the 'why'. Many programmers appear to take the lazy route when setting up security, you take the path of least resistance and highest privilege so that your application simply works and doesn't hassle you with security errors. What is wrong with that approach? Well if you are the only person who works on the system and it is on a private network, that might be the most practical solution causing the least work but what if your system starts being used by more people? What if you want to expose it to the www? You can retro-fit security, but it is much more time consuming than thinking about it at the start. The 'why' in the bigger sense comes down to the following:

  1. You do not want to accidentally change data in the DB

  2. You do not want somebody to do more (i.e. higher privilege) than they are supposed to do either accidentally or deliberately

  3. You do not want anauthorised people to do anything in your system

For 1. you might have some configuration data for your system stored in a table. Although you are strictly allowed to change it, you generally do not want to and do not want to accidentally change it. You could set the security to be 'select' only. For 2. you might have a table of 'items' which must only be updated by an elevated user and selected by normal users, you would need to set 'select' for a normal user and perhaps 'select, update, delete, insert' for an elevated user. For 3. you need to ensure that everyone who accesses the database is authenticated and no-one who is not can therefore do anything. This might include people who work for your company but are not allowed into this specific database.

The first design you need to undertake is to decide what schemas you want to create. A schema is a logical way to group objects (tables, views, procs) so that you can set permissions on the whole group rather than on individual items. The definition of a schema is a group that does or might need different access permissions depending on the user/user role of the currently logged in user. Do not create too many since this can make maintenance messy. I have used schemas like admin (select for everyone, CRUD for admin users), writeonly (for error logs and the like, only allow select and insert), general (anyone can have CRUD access), readonly (items pulled in from elsewhere and which are not updated in this DB also for views) locked (admin tables that are important and unlikely to have rows deleted so have no delete permission) and cache (tables that are dropped and recreated for caching things). If you notice, each of this has a unique set of permissions applied both for system resilience and for security.

Once I have designed my set of schemas, I need to create them for the database under /security/schemas and then I either need to transfer over all the objects I foolishly created in the dbo schema or create new objects into these new schemas.

I now have to consider the sort of database roles for my users. This is NOT an authorisation system for a web application so I don't need to distinguish at too low a level (e.g. user A can access table A and B but user B can only access table B) otherwise I end up with a lot of maintenance for no benefit, the authorisation at page level should be done in the web app. What I might have however are roles for admin, general, readonly and perhaps a couple of specials like finance who might be able to read stuff that most people can't. These are created under /security/roles/database roles. Once you have created the roles, you can if you want select the schemas and add permissions for the newly created roles. For instance you might given the schema cache full permissions for everyone whereas the admin schema gives full access to the admin role and only 'select' to general and readonly users.

Everybody who logs into the database needs a login. You can have a single login and get your application to login as this single user. The problem is then you lose the ability to make the most of roles and permissions because everybody appears as one 'person' to the DB which will then need access to everything for all users and which defeats the purpose. Ideally you should use windows integrated security if you can and remove the anonymous login ability on the web server. You will then need to create a login for each user on the database server under /security/logins that includes the domain name e.g. DOMAINNAME\username which will connect into active directory and which avoids the need to manage user security in your app. Failing the ability to use windows, you can use SQL server logins which will still need creating and which you will then need to manage in your web app including a secure login page to ensure the login details cannot be 'sniffed'. You create the login in the same place but select "SQL Server Authentication" rather than "Windows Authentication". You would then need to set the password policy information which again is important if you have opted for SQL Server logins.

In order to use these logins in your roles, you need to create a user IN EACH DATABASE to link to the login. The database user can then be added to a role when it is created. Under /security/users, right-click and select "New User". This will bring up a simple dialog which allows you to give the user a name (related to the database, it does not have to match the login), tell it what login this user related to e.g. DOMAINNAME\username and then tick what roles this user is in at the bottom. In more complex setups you might assign the user to own certain schemas, giving them the ability to control and alter these schemas but for now ignore that. You do not have to add the user to a role at this point.

So for user BrinerL you might have:

server login: MYDOMAIN\brinerl
database user: brinerl -> MYDOMAIN\brinerl
belongs to role: admin
admin has permissions: Insert/Update/Select/Delete/Execute on all objects.

It's quite straight-forward really. You can automate the adding of the logins with a stored proc if required so that perhaps the first time somebody logs into the web site, it creates them a login and user and gives them the basic readonly role. If they want more then a DB admin would need to either move their roles in Management Studio or log in to a page that can carry out the necessary work from some sort of front end.

Wednesday, 12 May 2010

Ajax, javascript and postbacks

The number of times I have this problem! You create a page that uses updatepanels and MS Ajax which, to be fair, is pretty good and works with little understanding of what's going on. Then, however, we also use javascript libraries for added coolness and to carry out work on the client and need to manually call a postback on the server after carrying out work in javascript. You manually call __doPostBack and you might get a postback but it doesn't call the event handler, you might get a premature postback or maybe no postback at all. To make things worse, the name and id of the control have been mangled by the placeholders so the btnExit id might now be ctl00_MyContentPlaceHolder_btnExit. It all gets confusing. Well here's what you do:

Create your button as usual and allow its names to be mangled (not sure if you have any choice!). Then add a click handler either using onclientclick or we use { }) functionality. If using jQuery, call e.preventDefault() on the passed in event which will ensure you don't get the auto-postback which you get with submit buttons and then carry out your javascript functionality, possibly with the result deciding whether to postback. When you're done, call __doPostBack() with the NAME of the button control (the one with dollar signs usually) and a blank string for the arguments. You should have no problems. e.g.

__doPostBack('ctl00$BodyContentPH$ButtonAddLocation', '');

Thursday, 15 April 2010

MediaWiki and Blank Edit Tab

I installed MediaWiki at work not long ago and can't remember anything being amiss which is why I can't really remember it. Anyway, I installed it at home and the installation all appeared to be smooth, the database was created correctly and the page told me it was all successful but when I clicked on the Edit tab, the page was blank. No HTML at all, no nothing.
I couldn't find any specific issue on the web which surprised me but I assumed it must be my configuration options which it didn't like. I eventually found a post talking about upping the memory limit otherwise certain operations would fail.
It is slightly confusing because there is an entry in /etc/php5/apache/php.ini which I had already changed to 32M and which hadn't worked but there is also an entry in the LocalSettings.php of the wiki site itself and this also needs to be upped. I changed it, didn't have to restart anything, refreshed the page and voila - it worked.
A bit surprising that the setting in the download was not even sufficient to edit the main page which has hardly anything on it, it would have been more of an issue editing some massive page perhaps.
Well, never mind, the fix was easy after all that. Note that there have been issues reported about international character handling causing the same symptoms.

Wednesday, 14 April 2010

Not another "security breach"

I was reading today about hackers who compromised Apache.orgs servers and obtained the passwords of various users. An interesting part of the article is this: "The breach, the second to hit in eight months, also exposed a much larger list of passwords belonging to people who accessed the site's bug-tracking section. While the databases used a one-way hash to disguise the passwords, two of the lists are vulnerable to dictionary attacks because Atlassian, the maker of issue-tracking software used by Apache, failed to add "random salt" to them."

Now in short, a hash is supposed to change some text, say a password, into something that doesn't bear any obvious relationship to the original password. When someone logs in, the password they type is hashed in the same way and compared to the stored hash without ever having to actually know the password. This is all well and good except that hashing algorithms are publically available which means it is possible for a hacker to run loads of dictionary words through the hash and store the results in a database table. Then if you know the hash, you can do a lookup in your database table and assuming it is a normal enough word, you will probably find the original password. Couple this with people who use the same passwords for lots of systems and disaster is afoot (assuming you can gain access to the hashed passwords). So hashing is not in itself a silver bullet security system. HOWEVER, it is possible to make it more secure with something called salting which I have covered in an earlier post.

Again, the main point is not that there was a failure but all these vulnerabilities are already known about yet they still occur, even in high-profile organisations like Apache. At what point is something like OWASP going to become a mandatory badge for web software so that it includes tickboxes like "I have added at least 8 bytes of salt to passwords before hashing". Whether the badge is voluntarily added or carried out by an auditer, at least when failures occur, companies can be prosecuted for not adhering to the checks. It seems so obvious but sadly I can't see it happening for a while.

Tuesday, 23 March 2010

Website images that suck!

I don't know how much longer I can take this. Why can people not learn to use the correct image formats for the type of picture they are displaying on a web site (or any other media for that matter)?
I just visited a web site with a solid coloured logo all covered in strange artifacts because it was saved as a jpeg - it looked terrible and even the uninformed should know that it is wrong. It is then a matter of finding out why and fixing it. I also saw an advert for, amongst other things, a printing company whose graphics were messy and again covered in jpeg artifacts and when I had some business cards printed, they managed to make the small font look messy because obviously they converted the image I sent them to a jpeg and made the nice sharp card look stupid.
Right, here it is, a simple introduction to what image formats you should use and why.
It all comes down to compression or making an image take up less space by cleverly reducing the amount of information stored on your computer. Why? Well, although your computer might have loads of space, an uncompressed image (called a bitmap) can be, lets say of the order of 5 to 10 megabytes in size. While this is not massive, when this is being downloaded onto a web browser, it can be very slow and doesn't need to be this large.
How do we compress? There are two ways. The best way is called lossless compression and involves looking at areas of the same colour and rather than storing every dot (or pixel), we simply save something like '20 dots of black' (obviously not in readable text like that but in 1s and 0s), the reason this is lossless is that when we uncompress the image to print it on the screen, we get back exactly the same image we started with. Sort of like deflating a football and pumping it back up again. Logos with big solid colours will compress really well using lossless compression. Unless you have other reasons not to, use the PNG format. This handles millions of colours and can even store transparency so your image can go over something else and you can see through it!
So what else is there then? If this all sounds too good to be true, unfortunately it is when it comes to photographs. Quite simply there are not enough of the same colour pixels to compress it well using lossless compression e.g. PNG (it will save but won't be very small). For this reason, a Joint Photographic Experts Group worked out how you could compress photographs even though they have millions of dots. The result is the famous jpeg format. This format is lossy. In other words, if I compress using jpeg and then uncompress, I do NOT get the original picture but something that is acceptably close - it is a compromise. In fact, you can set the quality level so that poorer quality makes for a smaller image. This might be used for thumbnails. The main thing to remember here is that jpeg is for photographs NOT solid colour logos. If you save a solid image using jpeg, when it is uncompressed, you get those horrible feathered edges and dotty artifacts in the empty areas - something that wouldn't be noticeable on photos.
You now have no excuses:
photos = jpeg/jpg and solid colour images = png

Monday, 22 March 2010

Unable to start debugging the web server 500.19

My debugging in Visual Studio (VS2008) stopped working this morning, I think it was as a result of Windows Update on Windows 7 adding some "security updates". I run out-of-the-box VS2008 (SP1) and IIS7 built-in to Windows 7 and I didn't even get the browser, just some non-resizable window with what looked like the html source for a browser error and 500.19. After being annoyed, I decided to try it in a browser with no debugger and got some more useful information but still didn't understand what had changed (Invalid config data because I can't read it type error).
Anyway, eventually I added the IIS_IUSRS group to my code directories with read access (only) and it all started working again.
Not sure whether this permission was not required before or implied or whether one of the updates removed its permissions from everywhere but it was a bit of a pain and not obvious which 'user' the system was trying to access the data with. I assumed the debugger was running as me and I had full permission but obviously somewhere it gets changed from me to IIS_USRS!

Thursday, 18 March 2010

Prevention vs Detection

I read another story about government (or in this case the Police) adding audit trail functionality to the Police National Computer. This is being done to help detect abuse of the system.
Now audit trails are great and should be used in many databases to know actually who/what/when, however, what the government still fail to grasp is that prevention is better than detection. For instance, if you were running a banking system, you wouldn't rely on an audit trail to find out who transferred £10 million out of the bank to someone elses account, you would have a system that requires authorisation to make the transaction in the first place. Simply knowing who committed the crime will not help in the majority of cases (the accused has fled, money already spent, information already used to their benefit). This is as well as the fact that catching someone then requires legal action that may or may not work and will cost even more money.
This is a crucial principle and it is what needs to be recognised for databases that contain identity information. Saying that "we have training", or "we audit all access to the system" is quite bluntly not good enough. Once that information is stolen and sold (usually) it cannot be recovered, the damage cannot be undone, someone is left to pick up the pieces or a guilty person mascarades as an innocent person and can commit all manner of crimes.
Dear Government, this is Computer Security 101. Prevention is better than detection. If you can't prevent, don't deploy the system!

Monday, 15 March 2010

Why security is important on Intranet Apps

I have heard people say that security is not important when dealing with internal web applications, that is ones that exist behind a firewall and which are only used by employees, all of which will have a user login. The argument is that we trust our employees by and large and that the firewall means we do not have to concerned with annoymous hackers who might attack apps exposed to the www. Unfortunately this logic is ill-conceived and simply wrong. Here are some reasons why this approach is bad.
  1. Just because an app is behind a firewall doesn't make it unreachable from the net. Firewalls can sometimes be hacked, it might be disabled accidentally long enough to get to your web app at which point damage can be caused. It is like having a castle whose only defence is an outside wall. However hard it might be to penetrate, once it is, the castle falls.
  2. Although there is a general trust of employees in most companies, it is not correct to decide security policy based on this. Even employees can act maliciously or curiously and can do anything from deliberate damage to accidental damage so the system must treat them as untrusted, only giving them what they need. If something was to be changed then at least it can be narrowed down to the few people who have access to a certain part of the app. If a certain area has common access, audit all important operations so that you can tell quickly and easily who has done something. Maybe they acted ignorantly and need training, maybe they were trying to damage the system.
  3. Adding the security code to a reasonable degree is not massively difficult. Most of the information exists on the web and lots of code has already been written. Why leave it out and then potentially in the future, when exposing it to the www, would you try and retro-fit it with all the work and risk that involves. Why not put it in at the beginning. There are several reasons why an app might be exposed to the www so why base policy on the fact that it never will?
  4. Effective defence means defence-in-depth. It means you rely on several measures to protect a resource, like several walls in your castle, so that even if the outer wall is broken, you have time to counter the attack before the inner walls are broken - this is just common sense. Relying on a single defence and assuming that there is little chance of an attack is small consolation after such an attack occurs. This is worse if what you are protecting is valuable or covered by statute and where you might have to excuse your policy before a judge or enquiry.
Just think about it, read about it and do it!