Suite à la lecture de ce post, je me suis permis en salle d'evoquer la technologie dite "Soundex" qui à mon sens résoud en grande partie la complexité de la reconnaissance d'un mot.
Pour être plus clair en voici l'explication que j'ai reprise d'un site existant
iciJe joint de même les algo de programmation dans divers langage. Pas en Mirc bien sur car c'est a vous de bosser là
Introduction
Il n'est pas rare de voir un prof d'anglais poser la question suivante: "Comment prononce-t-on le mot 'GHOTI'?" Après que les étudiants aient tout essayé, le prof énonce fièrement: "Le mot se prononce 'FISH'!". Stupeur bien compréhensible des étudiants.
Explication:
GH est prononcé comme dans 'tough'
O comme dans 'women'
et TI comme dans 'dictionary'.
Tout ce cinéma pour montrer que les règles de prononciation sont complexes et pleines d'exceptions.
Le problème se pose de la même manière quand un client s'amène au guichet d'une banque, dit son nom et l'employé doit le retrouver dans son listing. Imaginons que le client s'appelle "Van Keersbilck". Il y a peu de chance que l'employé pense à cette orthographe et il risque donc de devoir lui demander d'épeler, dans le bruit, ...
L'algorithme
L'algorithme du Soundex tente d'apporter une solution à ce problème en traduisant le nom en code 'phonétique' sur lequel la recherche dans la table sera effectuée. C'est ainsi que les mots 'mer', mère', 'maire', 'maisre', ... auront le même code. Cet algorithme est d'ailleurs utilisé en généalogie pour retrouver les noms qui auraient subi une transformation due entre autres à une faute de recopie.
L'algorithme du Soundex a été développé au début du siècle par Margaret K. Odell et Robert C. Russel au bureau américain des archives.
Voici les idées sur lesquelles s'appuie l'algorithme:
* les voyelles et Y contribuent moins pour la consonnance d'un mot que les consonnes. Elles seront donc supprimées sauf celle en position initiale;
* les lettres H, W ont aussi une contribution minimale et seront donc supprimées sauf celle en position initiale;
* les consonnes redoublées comme NN, SS et MM ou les lettres qui ont la même prononciation peuvent être réduites à une seule occurence;
Pour savoir si des lettres ont la même consonnance, on s'appuie sur la table suivante:
pour l'anglais:
1 B, F, P, V
2 C, G, J, K, Q, S, X, Z
3 D, T
4 L
5 M, N
6 R
pour le français:
1 B, P
2 C, K, Q
3 D, T
4 L
5 M, N
6 R
7 G, J
8 X, Z, S
9 F, V
Voici un résumé des différentes étapes de l'algorithme:
* supprimer les éventuels 'espace' initiaux
* mettre le mot en majuscule
* garder la première lettre
* supprimer les lettres A, E, I, O, U, Y, H et W
* remplacer les lettres restantes par le chiffre associé dans la table
* supprimer les chiffres répétés (garder une occurence)
* si le code obtenu contient moins de 4 éléments, compléter à droite par des espaces
si le code obtenu contient plus de 4 éléments, conserver les 4 éléments les plus à gauche
--------------------------------------------------
CODE EN JAVASCRIPT
--------------------------------------------------
/*
* v 1.0c TESTED-OK 20060112
* -----------------------
*
* The following SoundEx function is:
*
* © Copyright 2002 - 2006, Creativyst, Inc.
* ALL RIGHTS RESERVED
*
* For more information go to:
*
http://www.Creativyst.com * or email:
* Support@Creativyst.com
*
* Redistribution and use in source and binary
* forms, with or without modification, are
* permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must
* retain the above copyright notice, this
* list of conditions and the following
* disclaimer.
*
* 2. Redistributions in binary form must
* reproduce the above copyright notice,
* this list of conditions and the
* following disclaimer in the
* documentation and/or other materials
* provided with the distribution.
*
* 3. All advertising materials mentioning
* features or use of this software must
* display the following acknowledgement:
* This product includes software developed
* by Creativyst, Inc.
*
* 4. The name of Creativyst, Inc. may not be
* used to endorse or promote products
* derived from this software without
* specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY CREATIVYST CORPORATION
* ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
* PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
* THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
* OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
* WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY
* WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
* ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
*/
function SoundEx(WordString, LengthOption, CensusOption)
{
var TmpStr;
var WordStr = "";
var CurChar;
var LastChar;
var SoundExLen = 10;
var WSLen;
var FirstLetter;
if(CensusOption) {
LengthOption = 4;
}
if(LengthOption != undefined) {
SoundExLen = LengthOption;
}
if(SoundExLen > 10) {
SoundExLen = 10;
}
if(SoundExLen < 4) {
SoundExLen = 4;
}
if(!WordString) {
return("");
}
WordString = WordString.toUpperCase();
/* Clean and tidy
*/
WordStr = WordString;
WordStr = WordStr.replace(/[^A-Z]/gi, " "); // rpl non-chars w space
WordStr = WordStr.replace(/^\s*/g, ""); // remove leading space
WordStr = WordStr.replace(/\s*$/g, ""); // remove trailing space
if(!CensusOption) {
/* Some of our own improvements
*/
WordStr = WordStr.replace(/DG/g, "G"); // Change DG to G
WordStr = WordStr.replace(/GH/g, "H"); // Change GH to H
WordStr = WordStr.replace(/GN/g, "N"); // Change GN to N
WordStr = WordStr.replace(/KN/g, "N"); // Change KN to N
WordStr = WordStr.replace(/PH/g, "F"); // Change PH to F
WordStr =
WordStr.replace(/MP([STZ])/g, "M$1"); // MP if fllwd by ST|Z
WordStr = WordStr.replace(/^PS/g, "S"); // Chng leadng PS to S
WordStr = WordStr.replace(/^PF/g, "F"); // Chng leadng PF to F
WordStr = WordStr.replace(/MB/g, "M"); // Chng MB to M
WordStr = WordStr.replace(/TCH/g, "CH"); // Chng TCH to CH
}
/* The above improvements
* may change this first letter
*/
FirstLetter = WordStr.substr(0,1);
/* in case 1st letter is
* an H or W and we're in
* CensusOption = 1
*/
if(FirstLetter == "H" || FirstLetter == "W") {
TmpStr = WordStr.substr(1);
WordStr = "-";
WordStr += TmpStr;
}
/* In properly done census
* SoundEx the H and W will
* be squezed out before
* performing the test
* for adjacent digits
* (this differs from how
* 'real' vowels are handled)
*/
if(CensusOption == 1) {
WordStr = WordStr.replace(/[HW]/g, ".");
}
/* Begin Classic SoundEx
*/
WordStr = WordStr.replace(/[AEIOUYHW]/g, "0");
WordStr = WordStr.replace(/[BPFV]/g, "1");
WordStr = WordStr.replace(/[CSGJKQXZ]/g, "2");
WordStr = WordStr.replace(/[DT]/g, "3");
WordStr = WordStr.replace(/[L]/g, "4");
WordStr = WordStr.replace(/[MN]/g, "5");
WordStr = WordStr.replace(/[R]/g, "6");
/* Properly done census:
* squeze H and W out
* before doing adjacent
* digit removal.
*/
if(CensusOption == 1) {
WordStr = WordStr.replace(/\./g, "");
}
/* Remove extra equal adjacent digits
*/
WSLen = WordStr.length;
LastChar = "";
TmpStr = "";
// removed v10c djr: TmpStr = "-"; /* rplcng skipped first char */
for(i = 0; i < WSLen; i++) {
CurChar = WordStr.charAt(i);
if(CurChar == LastChar) {
TmpStr += " ";
}
else {
TmpStr += CurChar;
LastChar = CurChar;
}
}
WordStr = TmpStr;
WordStr = WordStr.substr(1); /* Drop first letter code */
WordStr = WordStr.replace(/\s/g, ""); /* remove spaces */
WordStr = WordStr.replace(/0/g, ""); /* remove zeros */
WordStr += "0000000000"; /* pad with zeros on right */
WordStr = FirstLetter + WordStr; /* Add first letter of word */
WordStr = WordStr.substr(0,SoundExLen); /* size to taste */
return(WordStr);
}
-----------------------------------------------------------------
CODE EN VB
-----------------------------------------------------------------
'
' v 1.0c TESTED-OK 20060308
' -----------------------
'
' The following SoundEx function is:
'
' © Copyright 2002 - 2006, Creativyst, Inc.
' ALL RIGHTS RESERVED
'
' For more information go to:
'
http://www.Creativyst.com' or email:
' Support@Creativyst.com
'
' Redistribution and use in source and binary
' forms, with or without modification, are
' permitted provided that the following conditions
' are met:
'
' 1. Redistributions of source code must
' retain the above copyright notice, this
' list of conditions and the following
' disclaimer.
'
' 2. Redistributions in binary form must
' reproduce the above copyright notice,
' this list of conditions and the
' following disclaimer in the
' documentation and/or other materials
' provided with the distribution.
'
' 3. All advertising materials mentioning
' features or use of this software must
' display the following acknowledgement:
' This product includes software developed
' by Creativyst, Inc.
'
' 4. The name of Creativyst, Inc. may not be
' used to endorse or promote products
' derived from this software without
' specific prior written permission.
'
' THIS SOFTWARE IS PROVIDED BY CREATIVYST CORPORATION
' ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
' INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
' WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
' PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
' THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
' INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
' DAMAGES (INCLUDING, BUT NOT LIMITED TO,
' PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
' OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
' HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
' WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
' (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY
' WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
' ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
'
'
'
' ------------------
' FUNCTION NOTES:
'
Function SoundEx( _
ByVal WordString As String, _
ByVal LengthOption As Integer, _
ByVal CensusOption As Integer _
) As String
Dim WordStr As String
Dim b, b2, b3, SoundExLen, FirstLetter As String
Dim i As Integer
' Sanity
'
If (CensusOption > 0) Then
LengthOption = 4
End If
If (LengthOption > 0) Then
SoundExLen = LengthOption
End If
If (SoundExLen > 10) Then
SoundExLen = 10
End If
If (SoundExLen < 4) Then
SoundExLen = 4
End If
If (Len(WordString) < 1) Then
Exit Function
End If
' Copy to WordStr
' and UpperCase
'
WordStr = UCase(WordString)
' Convert all non-alpha
' chars to spaces.
'
For i = 1 To Len(WordStr)
b = Mid(WordStr, i, 1)
If (Not (b Like "[A-Z]")) Then
WordStr = Replace(WordString, b, " ")
End If
Next i
' Remove leading and
' trailing spaces
'
WordStr = Trim(WordStr)
' sanity
'
If (Len(WordStr) < 1) Then
Exit Function
End If
' Perform our own multi-letter
' improvements
'
' double letters will be effectively
' removed in a later step.
'
If (CensusOption < 1) Then
b = Mid(WordStr, 1, 1)
b2 = Mid(WordStr, 2, 1)
If (b = "P" And b2 = "S") Then
WordStr = Replace(WordStr, "PS", "S", 1, 1)
End If
If (b = "P" And b2 = "F") Then
WordStr = Replace(WordStr, "PF", "F", 1, 1)
End If
WordStr = Replace(WordStr, "DG", "_G")
WordStr = Replace(WordStr, "GH", "_H")
WordStr = Replace(WordStr, "KN", "_N")
WordStr = Replace(WordStr, "GN", "_N")
WordStr = Replace(WordStr, "MB", "M_")
WordStr = Replace(WordStr, "PH", "F_")
WordStr = Replace(WordStr, "TCH", "_CH")
WordStr = Replace(WordStr, "MPS", "M_S")
WordStr = Replace(WordStr, "MPT", "M_T")
WordStr = Replace(WordStr, "MPZ", "M_Z")
End If
' end if(Not CensusOption)
'
' Sqeeze out the extra _ letters
' from above (not strictly needed
' in VB but used in C code)
'
WordStr = Replace(WordStr, "_", "")
'
' This must be done AFTER our
' multi-letter replacements
' since they could change
' the first letter
'
FirstLetter = Mid(WordStr, 1, 1)
' in case first letter is
' an h, a w or vowel...
' we'll change it to something
' that doesn't match anything
'
if ( FirstLetter = "H" Or FirstLetter = "W") Then
b = Mid(WordStr, 2)
b = "-" + b
WordStr = b
End If
' In properly done census
' SoundEx, the H and W will
' be squezed out before
' performing the test
' for adjacent digits
' (this differs from how
' 'real' vowels are handled)
'
If (CensusOption = 1) Then
WordStr = Replace(WordStr, "H", ".")
WordStr = Replace(WordStr, "W", ".")
End If
' Perform classic SoundEx
' replacements
'
WordStr = Replace(WordStr, "A", "0")
WordStr = Replace(WordStr, "E", "0")
WordStr = Replace(WordStr, "I", "0")
WordStr = Replace(WordStr, "O", "0")
WordStr = Replace(WordStr, "U", "0")
WordStr = Replace(WordStr, "Y", "0")
WordStr = Replace(WordStr, "H", "0")
WordStr = Replace(WordStr, "W", "0")
WordStr = Replace(WordStr, "B", "1")
WordStr = Replace(WordStr, "P", "1")
WordStr = Replace(WordStr, "F", "1")
WordStr = Replace(WordStr, "V", "1")
WordStr = Replace(WordStr, "C", "2")
WordStr = Replace(WordStr, "S", "2")
WordStr = Replace(WordStr, "G", "2")
WordStr = Replace(WordStr, "J", "2")
WordStr = Replace(WordStr, "K", "2")
WordStr = Replace(WordStr, "Q", "2")
WordStr = Replace(WordStr, "X", "2")
WordStr = Replace(WordStr, "Z", "2")
WordStr = Replace(WordStr, "D", "3")
WordStr = Replace(WordStr, "T", "3")
WordStr = Replace(WordStr, "L", "4")
WordStr = Replace(WordStr, "M", "5")
WordStr = Replace(WordStr, "N", "5")
WordStr = Replace(WordStr, "R", "6")
' not sure which would be faster there
' For i = 1 To Len(WordsStr)
' Next i
'
' End Clasic SoundEx replacements
'
' In properly done census
' SoundEx, the H and W will
' be squezed out before
' performing the test
' for adjacent digits
' (this differs from how
' 'real' vowels are handled)
'
If (CensusOption = 1) Then
WordStr = Replace(WordStr, ".", "")
End If
' squeeze out extra equal adjacent digits
' (don't include first letter)
'
b = ""
b2 = ""
' remove from v1.0c djr: b3 = Mid(WordStr, 1, 1)
b3 = ""
For i = 1 To Len(WordStr) ' i=1 (not 2) in v1.0c
b = Mid(WordStr, i, 1)
b2 = Mid(WordStr, (i + 1), 1)
If (Not (b = b2)) Then
b3 = b3 + b
End If
Next i
WordStr = b3
If (Len(WordStr) < 1) Then
Exit Function
End If
' squeeze out spaces and zeros
' Leave the first letter code
' to be replaced below.
' (In case it made a zero)
'
WordStr = Replace(WordStr, " ", "")
WordStr = Replace(WordStr, "0", "")
' Right pad with zero characters
'
b = String(SoundExLen, "0")
WordStr = WordStr + b
' Size to taste
'
WordStr = Mid(WordStr, 1, SoundExLen)
' Replace first digit with
' first letter
'
WordStr = Mid(WordStr, 2)
WordStr = FirstLetter + WordStr
' Copy WordStr to SoundEx
'
SoundEx = WordStr
End Function