Unluckily it’s very common not to change the default charset of your MySQL server and, since the default is latin1, when someone wishes to store cyrillic or chinese character there are many problems.

The first step is to fix the MySQL installation in order to store internationalized information., so locate your my.cnf configuration file on Linux, or the my.ini on Windows boxes.

Search in the configuration file the [mysqld] section when there is the configuration of the MySQL server.

Insert the following lines and eventually remove any existing configuration option with the same name.

[mysqld]
character-set-server=utf8
default-collation=utf8_unicode_ci

The option character-set-server=utf8 tells to the server that, if not otherwise specified, the character set of the created databases, tables, column will be utf8.

utf8 columns will be able to store cyrillic or simplified chinese character, just to give you two examples.

The collation defines how alphabetical ordering will happen, in few words which is the order of the letters that we expect on ORDER BY columnName clauses.

The suffix _ci means that ordering and comparison will be case insensitive and this is the common behavior used in databases.

Be very careful, because usually programming languages (i.e. Java) have case sensitive .equals(String string) method on String class, so it’s quite common to have some mistakes caused by this incongruency.

Then look for the [client] section of your configuration file, and write this line below it.

[client]
default-character-set=utf8

This is very important because it defines the character set used by the MySQL command-line client, and that’s what will be used to migrate the data from latin1 to utf8.

Now everything is setup, restart MySQL to make sure it’s using the updated configuration, and shut-down any application that is using the database that’s going to be migrated.

First, mysqldump will create a .sql file containing all the data:

mysqldump --skip-set-charset --no-create-db –no-create-info -h hostname --protocol=TCP -P 3306 -u username -p old_database > dump.sql

The option --skip-set-charset prevents that in the dump file will be any reference to the old (and wrong) character sets. The options --no-create-db and --no-create-info are used because the new database name will be defined later.

Now the new database is going to be created: mysql -u username -p and the following SQL should be executed in the terminal:

create schema new_database;
quit

Finally the last step is to populate the brand new database with the dumped data:

mysql -u username -p new_database < dump.sql

In this way all the previous data from old_database is now stored in utf8 format in new_database.

I hope this tutorial can be useful, please ask any question or give your feedback.
Thank You.



6 Comments

  1. #
    CHGIS Blog » Blog Archive » Migrate from mysql latin1 default to mysql utf8 default
    February 12th, 2009 at 3:58 pm

    [...] tried following various posts on the topic, like this one, which clearly state the mysqldump commands needed to preserve UTF-8 encoded data.   However, [...]

    Reply to this comment
    • #
      admin
      February 13th, 2009 at 1:17 am

      Did you set the configuration in my.cnf to impose to the mysql client to use utf8 in its connection to the server? That is the most important thing.

      And of course in mysqldump you should specify --skip-set-charset to have a dump file without any references to the previous (wrong) charset.

      Let me now… thank you very much.

      Reply to this comment
  2. #
    Hendrik Holtmann
    May 28th, 2009 at 2:26 am

    Are you sure the
    –no-create-info
    is correct here. It omits all CREATE TABLE statements. So in your next step
    mysql -u username -p new_database < dump.sql
    no tables can’t be created.
    Best,

    Hendrik

    Reply to this comment
  3. #
    Kev
    December 1st, 2009 at 8:01 am

    Great stuff, it works great!
    Thanks a lot for the tip!

    Reply to this comment
  4. #
    Mike
    December 7th, 2009 at 2:41 am

    I tried this, but for me it didn’t work.

    Hendrik is right about the –no-create-info option: that should be left out or the tables cannot be reconstructed.

    But even after changing that, this method left me with garbled characters.

    When I query my (old) database from the command line, the special characters (like trademark, copyright, euro etc.) show up correct. When I export, they are represented by multiple bogus characters.

    On to the next attempt…

    Bye Mike

    Reply to this comment
  5. #
    rug doctor
    September 6th, 2011 at 10:32 am

    Hard to find relevantblogson my besttopic. I very much enjoyed reading your comments. I know this is not on topic but,does anyone know how to clean a wool rug with a oil stain? I have asked this question before but need an answer before I am beheadshouted at by my boyfriend. We will log a trackback link tomorrow.

    Reply to this comment

Leave a Comment

blank