Thursday, June 23, 2011

The SOUNDEX coding algorithm

The SOUNDEX code is a substitution code using the following rules:

The first letter of the surname is always retained.

The rest of the surname is compressed to a three digit code using the following coding scheme:
A E I O U Y H Wnot coded
B F P Vcoded as 1
C G J K Q S X Zcoded as 2
D Tcoded as 3
Lcoded as 4
M Ncoded as 5
Rcoded as 6

Consonants after the initial letter are coded in the order they occur:

HOLMES = H-452

ADOMOMI = A-355

The code always uses initial letter plus three digits. Further consonants in long names are ignored:

VONDERLEHR = V-536

Zeros are used to pad out shorter names:

BALL = B-400

SHAW = S-000

Double consonants are treated as one letter:

BALL = B-400

As are adjacent consonants from the same code group:

JACKSON = J-250

A consonant following an initial letter from the same code group is ignored:

SCANLON = S-545

Abbreviated prefixes should be spelt out in full:

ST JOHN = SAINTJOHN = S-532

Apostrophes and hyphens are ignored:

KING-SMITH = KINGSMITH = K-525

Consonants from the same code group separated by W or H are treated as one:

BOOTH-DAVIS = BOOTHDAVIS = B-312



No comments:

Post a Comment