Difference between revisions of "DBMigrationBestPractices"

Latest revision as of 16:41, 6 February 2014

Overview

Database migrations have historically been a source of considerable downtime when deploying new versions of Openstack services. DB migrations in the past have taken as long as an hour or more on some operators databases.

DB migrations are necessary, so the focus is on how to optimize the process to reduce downtime. The best way to reduce downtime is to try to make as much of the DB migration run "online" as possible. This means while the service is running (usually old code, but sometimes new code).

However, some DB migrations do things that are particularly difficult to execute online leading to downtime.

DB migrations typically fall into two categories: schema migrations and data migrations and the strategy to dealing with them often differs.

Schema Migrations

Schema migrations are any schema changes that create, drop or rename tables, columns, indexes and foreign keys.

MySQL Version Considerations

Even if services are tolerant of schema changes, the database engine needs to ensure it doesn't block reads/writes. Blocking reads/writes would cause downtime as well.

Not all versions of MySQL are made alike. Older versions (such as 5.1) will block reads/writes for pretty much any schema changes. Newer versions (5.5 and 5.6) are much better and in many cases much faster and/or online (without blocking reads/writes).

It's important that the operator is running a version of MySQL that allows schema changes to be performed online without blocking running services.

FIXME: Fill in matrix of versions and schema changes that block

Data Migrations

Data changes are any migrations that change data already in tables. For example, encrypting data in a column or moving data from one table to a newly created table.

Best Practices

As a general rule, strive to make DB migrations run as much as possible online. This means that the DB migrations can run while old (or sometimes new) code is already running. This minimizes the amount of time services are down. Less downtime makes operators and users happy.

Maintain Backwards Compatible Schemas If Possible

Running schema migrations online is the best case, but it can only be done if the new schema is compatible with the running code. That means we should strive to keep schemas backwards compatible.

Things like adding new columns or creating new tables is backwards compatible (mostly, be careful of 'SELECT *' or 'INSERT INTO foo VALUES ()') because old running code simply don't use the new columns or tables.

Avoid Renaming Columns

It's extremely difficult to rename columns safely with services running. Old code simply won't know the new column name and new code must go extremely far to transparently handle using either the old or new names. Renaming columns is usually not necessary for technical reasons, so avoiding it is recommended.

Avoid Changing the Type of Columns

For many of the same reasons avoiding renaming columns, we should avoid changing the type of columns. Creating a new column is recommended.

Try to Incrementally Migrate Data

The volume of data being changed or moved during data migrations can be a significant source of downtime. If possible, try to incrementally migrate data. This would require code to handle both the old and new locations of data and potentially migrate the data on a row by row basis. However, this should be significantly easier than handling renaming or changing the type of columns.

@@ Line 1: / Line 1: @@
 == Overview ==
-Database migrations have historically been a source of considerable downtime when deploying new versions of Openstack services.
+Database migrations have historically been a source of considerable downtime when deploying new versions of Openstack services. DB migrations in the past have taken as long as an hour or more on some operators databases.
+DB migrations are necessary, so the focus is on how to optimize the process to reduce downtime. The best way to reduce downtime is to try to make as much of the DB migration run "online" as possible. This means while the service is running (usually old code, but sometimes new code).
+However, some DB migrations do things that are particularly difficult to execute online leading to downtime.
+DB migrations typically fall into two categories: schema migrations and data migrations and the strategy to dealing with them often differs.
 === Schema Migrations ===
-Schema migrations are any schema changes that create, drop or rename tables, columns, indexes and foreign keys. In some cases, making a schema change will cause the database engine to block reads and writes to the tables affected. Depending on the size of the table and the operation being performed, this can be up to an hour or more.
+Schema migrations are any schema changes that create, drop or rename tables, columns, indexes and foreign keys.
-==== MySQL Considerations ====
+==== MySQL Version Considerations ====
+Even if services are tolerant of schema changes, the database engine needs to ensure it doesn't block reads/writes. Blocking reads/writes would cause downtime as well.
 Not all versions of MySQL are made alike. Older versions (such as 5.1) will block reads/writes for pretty much any schema changes. Newer versions (5.5 and 5.6) are much better and in many cases much faster and/or online (without blocking reads/writes).
+It's important that the operator is running a version of MySQL that allows schema changes to be performed online without blocking running services.
 FIXME: Fill in matrix of versions and schema changes that block
@@ Line 15: / Line 25: @@
 === Data Migrations ===
-Data changes are any migrations that alter data already in tables. For instance, moving data from one table to a newly created table would be a data migration.
+Data changes are any migrations that change data already in tables. For example, encrypting data in a column or moving data from one table to a newly created table.
 == Best Practices ==
+As a general rule, strive to make DB migrations run as much as possible online. This means that the DB migrations can run while old (or sometimes new) code is already running. This minimizes the amount of time services are down. Less downtime makes operators and users happy.
 === Maintain Backwards Compatible Schemas If Possible ===
-We want to do as much as possible while code is still running to minimize downtime during deploys. This means doing as much of the DB migrations as possible "online".
+Running schema migrations online is the best case, but it can only be done if the new schema is compatible with the running code. That means we should strive to keep schemas backwards compatible.
-One way to do this is to maintain backwards compatible schemas. Things like adding new columns or creating new tables is backwards compatible (mostly, be wary of SELECT * use) because old running code simply don't use the new columns or tables.
+Things like adding new columns or creating new tables is backwards compatible (mostly, be careful of 'SELECT *' or 'INSERT INTO foo VALUES ()') because old running code simply don't use the new columns or tables.
 === Avoid Renaming Columns ===
-This cannot be done online. Old code will expect the old column name and new code will expect the new column name. As a result, code cannot be running if a column is renamed.
+It's extremely difficult to rename columns safely with services running. Old code simply won't know the new column name and new code must go extremely far to transparently handle using either the old or new names. Renaming columns is usually not necessary for technical reasons, so avoiding it is recommended.
+=== Avoid Changing the Type of Columns ===
-It's theoretically possible to make code tolerant of both the old and new names, but this is extremely complicated because of the need to retry queries. Renaming columns is usually not necessary for technical reasons, so avoiding it is recommended.
+For many of the same reasons avoiding renaming columns, we should avoid changing the type of columns. Creating a new column is recommended.
-=== Avoid Changing the Type of Columns ===
+=== Try to Incrementally Migrate Data ===
-For many of the same reasons avoiding renaming columns, we should avoid changing the type of columns.
+The volume of data being changed or moved during data migrations can be a significant source of downtime. If possible, try to incrementally migrate data. This would require code to handle both the old and new locations of data and potentially migrate the data on a row by row basis. However, this should be significantly easier than handling renaming or changing the type of columns.