Tuesday, April 03, 2007

Guid vs Int

Rather interesting article on when tho use a guid or an int as a Primary Key:

http://databases.aspfaq.com/database/what-should-i-choose-for-my-primary-key.html

Summary:

IDENTITY
PROS
  • small storage footprint;
  • optimal join / index performance (e.g. for time range queries, most rows recently inserted will be on a limited number of pages);
  • highly useful for data warehousing;
  • native data type of the OS, and easy to work with in all languages;
  • easy to debug (where userid = 457);
  • generated automatically (retrieved through SCOPE_IDENTITY() rather than assigned);
  • not updateable (though some consider this a disadvantage, strangely enough).
CONS
  • cannot be reliably "predicted" by applications รข€” can only be retrieved after the INSERT;
  • need a complex scheme in multi-server environments, since IDENTITY is not allowed in some forms of replication;
  • can be duplicated, if not explicitly set to PRIMARY KEY:
    CREATE TABLE splunge(
    splungeID INT IDENTITY(1,1),
    foo CHAR(1)
    )
    GO
    SET NOCOUNT ON
    INSERT splunge(foo) VALUES('a')
    INSERT splunge(foo) VALUES('b')
    INSERT splunge(foo) VALUES('c')
    GO
    DBCC CHECKIDENT('splunge', RESEED, 2)
    GO
    INSERT splunge(foo) VALUES('d')
    GO
    SELECT * FROM splunge
    GO
    DROP TABLE splunge
    GO
  • if part of the clustered index on the table, this can create an insert hot-spot;
  • proprietary and not directly portable;
  • only unique within a single table;
  • gaps can occur (e.g. with a rolled back transaction), and this can cause chicken little-style alarms.
GUID()
PROS
  • since they are {more or less} guaranteed to be unique, multiple tables/databases/instances/servers/networks/data centers can generate them independently, then merged without clashes;
  • required for some forms of replication;
  • can be generated outside the database (e.g. by an application);
  • distributed values prevent hot-spot (as long as you don't cluster this column, which can lead to abnormally high fragmentation).
CONS
  • the wider datatype leads to a drop in index performance (if clustered, each insert almost guaranteed to 'dirty' a different page), and an increase in storage requirements;
  • cumbersome to debug (where userid = {BAE7DF4-DDF-3RG-5TY3E3RF456AS10});
  • updateable (need to propogate changes, or prevent the activity altogether);
  • sensitive to time rollbacks in certain environments (e.g. daylight savings time rollbacks);
  • GROUP BY and other set operations often require CAST/CONVERT;
  • not all languages and environments directly support GUIDs;
  • there is no statement like SCOPE_GUID() to determine the value that was generated, e.g. by NEWID();
  • there are display issues to consider, such as response.write issues (see Article #2358), and also consider cases where you display a large amount of rows as checkboxes or other form elements with IDs - your page just got a whole lot bigger, and you have to handle the IDs in a custom way because JavaScript can't deal with objects that have dashes in their name/id.