Wednesday, June 22, 2011

Japanese Word Segmentation in PHP


            Segmentation is done by using a library called MeCab. MeCab is morphological analysis engine that was developed through open source. It can be used with php for Japanese word processing. MeCab can also be used with other programming languages like Java, Python etc..  

For linux,we have to install c++ compiler, MeCab, a dictionary for language processing, php development module for building php functions and php extension for MeCab.


1. Install gcc
       > sudo yum install gcc-c++   (fedora)  / > sudo apt-get install gcc-c++  (Ubuntu)

2. Download and install MeCab
     Create a folder and download the libraries to it.

      > mkdir Download
     cd Download

      wget http://downloads.sourceforge.net/project/mecab/mecab/0.98/mecab-0.98.tar.gz
      > tar xvzf mecab-0.98.tar.gz
      > cd mecab-0.98
      > ./configure
      > make
      > sudo make install
      > cd ..


3. Download and install a MeCab dictionary (ipadic)

     > wget http://sourceforge.net/projects/mecab/files/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
     > tar xvfz mecab-ipadic-2.7.0-20070801.tar.gz
     > cd mecab-ipadic-2.7.0-20070801
     > ./configure --with-charset=utf8
     > make
     sudo make install
     > cd ..


      Check mecab version, make sure it installed by using the below command:
         > /usr/local/bin/mecab -v

   It will show the version number then the  installation is ok.If the  error :
             mecab: error while loading shared libraries: libmecab.so.1: cannot open shared object file: No such file or directory
ipadic is not installed properly, re-install it.


4. Install php development module
 
     > sudo yum install php-devel (Fedora) /  > sudo apt-get install php5-dev (Ubuntu)


5. Download and install php extension for Mecab

Downoad php-mecab from  https://github.com/rsky/php-mecab/archives/master , unzip and install it. 
       > tar xfvz rsky-php-mecab-4193188.tar.gz
       > cd rsky-php-mecab-4193188/
       >  phpize
       >  ./configure --with-php-config=/usr/bin/php-config --with-mecab=/usr/local/bin/mecab-config
       > make
       > sudo make install


6. Enable the MeCab
      Add the following  in php.ini and restart Apache
extension = mecab.so


Testing the MeCab
  <?php    
error_reporting(-1);
if (extension_loaded(mecab))
    echo "mecab loaded :)";
  else
    echo "something is wrong :(";

$str = "また、Tagger は Stream をくるくる回すのではなく、一括で文字列を解析するようなので、一旦";
$result = mecab_split($str);
print_r($result);
?>

Output

            The avilable php-mecab exteion functions are listed in http://mechsys.tec.u-ryukyu.ac.jp/~oshiro/php_mecab_apis.html

Referances
http://www.programming-magic.com/20080808173652/
http://wiki.jdictionary.com/Building_Mecab_For_PHP
http://mechsys.tec.u-ryukyu.ac.jp/~oshiro/php_mecab_apis.html#mecab_split
https://github.com/rsky/php-mecab/tree/
http://tips.recatnap.info/about_mecab_extension_php/



2 comments:

  1. YES!!! i did it! after few hours of brain...
    whatever..
    i wanna say thanx to you, but, if you get an
    error during install ipadic

    mecab: error while loading shared libraries: libmecab.so.1: cannot open shared object file:

    you must do this

    go to
    /etc/d.so.conf
    edit it
    and add this string
    "/usr/ local /lib"
    after first include
    than
    "sudo ldconfig"

    and try again to reinstall ipadic
    huh...

    ReplyDelete
  2. The path above to edit should actually be /etc/ld.so.conf.

    More info can be found here: http://kooj.blog102.fc2.com/blog-entry-24.html

    ReplyDelete