Wednesday, June 22, 2011

Japanese Word Segmentation in PHP

            Segmentation is done by using a library called MeCab. MeCab is morphological analysis engine that was developed through open source. It can be used with php for Japanese word processing. MeCab can also be used with other programming languages like Java, Python etc..  

For linux,we have to install c++ compiler, MeCab, a dictionary for language processing, php development module for building php functions and php extension for MeCab.

1. Install gcc
       > sudo yum install gcc-c++   (fedora)  / > sudo apt-get install gcc-c++  (Ubuntu)

2. Download and install MeCab
     Create a folder and download the libraries to it.

      > mkdir Download
     cd Download

      > tar xvzf mecab-0.98.tar.gz
      > cd mecab-0.98
      > ./configure
      > make
      > sudo make install
      > cd ..

3. Download and install a MeCab dictionary (ipadic)

     > wget
     > tar xvfz mecab-ipadic-2.7.0-20070801.tar.gz
     > cd mecab-ipadic-2.7.0-20070801
     > ./configure --with-charset=utf8
     > make
     sudo make install
     > cd ..

      Check mecab version, make sure it installed by using the below command:
         > /usr/local/bin/mecab -v

   It will show the version number then the  installation is ok.If the  error :
             mecab: error while loading shared libraries: cannot open shared object file: No such file or directory
ipadic is not installed properly, re-install it.

4. Install php development module
     > sudo yum install php-devel (Fedora) /  > sudo apt-get install php5-dev (Ubuntu)

5. Download and install php extension for Mecab

Downoad php-mecab from , unzip and install it. 
       > tar xfvz rsky-php-mecab-4193188.tar.gz
       > cd rsky-php-mecab-4193188/
       >  phpize
       >  ./configure --with-php-config=/usr/bin/php-config --with-mecab=/usr/local/bin/mecab-config
       > make
       > sudo make install

6. Enable the MeCab
      Add the following  in php.ini and restart Apache
extension =

Testing the MeCab
if (extension_loaded(mecab))
    echo "mecab loaded :)";
    echo "something is wrong :(";

$str = "また、Tagger は Stream をくるくる回すのではなく、一括で文字列を解析するようなので、一旦";
$result = mecab_split($str);


            The avilable php-mecab exteion functions are listed in