UTL_MATCH - String Similarity in Oracle

3423 단어 Oaclesql
Thursday, December 07, 2006
UTL_MATCH - String Similarity in Oracle
You can compare string using equality operator (=), or using similarity operator (LIKE). But there are cases where it is not enough. You Can use UTL_MATCH package to calculate string similarity index. This packages offers 4 functions that take two strings as parameters:
  • edit_distance - algorithm by Levenshtein - returns number of edits that must be done to change one string into second,
  • edit_distance_similarity - normalized results of edit_distance in percents - integer values,
  • jaro_winkler - returns similarity based on Jaro-Winkler distance algorithm ,
  • jaro_winkler_similarity - same as above but presented as integer in range 0-100.
  • Lets teke a look how it works. First we can create sample table and insert some data:
    create table countries(name varchar2(15) not null);
    insert into countries values ('Poland');
    insert into countries values ('Germany');
    insert into countries values ('United States');
    insert into countries values ('Portugal');
    insert into countries values ('Czech Republic');
    insert into countries values ('China');
    insert into countries values ('Slovakia');
    insert into countries values ('Slovenia');
    commit;
    
    Now we can take a look at results. Lets compare to miss-spelled country name: '
    Slovnia'
    select name
     ,to_char(utl_match.edit_distance(name, 'Slovnia'),'999') edit_dist
     ,to_char(utl_match.edit_distance_similarity(name, 'Slovnia'),'999') edit_dist_sim
     ,to_char(utl_match.jaro_winkler(name, 'Slovnia'),'999d9999') jaro_winkler
     ,to_char(utl_match.jaro_winkler_similarity(name, 'Slovnia'),'999') jaro_winkler_sim
    from countries
    order by jaro_winkler_sim desc;
    
    NAME                 EDIT EDIT JARO_WINK JARO
    -------------------- ---- ---- --------- ----
    Slovenia                1   88     .9750   97
    Slovakia                2   75     .8881   88
    China                   5   29     .5619   56
    United States          12    8     .5531   55
    Poland                  6   15     .5317   53
    Portugal                7   13     .5119   51
    Germany                 7    0     .3571   35
    Czech Republic         13    8     .0000    0
    Above we can observe differences in algorithms.
    Lets test it on NULLs:
    SQL> select to_char(utl_match.edit_distance('test', NULL),'999')
      2      edit_dist
      3   ,to_char(utl_match.edit_distance_similarity('test', NULL),'999')
      4      edit_dist_sim
      5   ,to_char(utl_match.jaro_winkler('test', NULL),'999d9999')
      6      jaro_winkler
      7   ,to_char(utl_match.jaro_winkler_similarity('test', NULL),'999')
      8      jaro_winkler_sim
      9  from dual;
    
    EDIT EDIT JARO_WINK JARO
    ---- ---- --------- ----
      -1  125     .0000    0
    We can see that using
    edit_distance on NULLs migth be dengerous.
    All samples were run on Oracle 10g XE

    좋은 웹페이지 즐겨찾기