Word cloud

From My Wiki
Jump to: navigation, search

Will be using this: https://github.com/paul-nechifor/word_cloud and https://github.com/amueller/word_cloud

These are just installation notes.

Using Ubuntu 13.10

First need to install this: http://scikit-learn.org/stable/install.html and some python stuff

sudo apt-get install build-essential python-dev python-numpy python-setuptools python-scipy libatlas-dev libatlas3-base
sudo apt-get install python-matplotlib
sudo apt-get install python-pip
pip install -U scikit-learn
sudo apt-get install gcc build-essential python-pip python-dev unzip

necessary libraries

sudo apt-get install libjpeg-dev libfreetype6 libfreetype6-dev zlib1g-dev
sudo ln -s /usr/lib/x86_64-linux-gnu/libjpeg.so /usr/lib
sudo ln -s /usr/lib/x86_64-linux-gnu/libfreetype.so /usr/lib
sudo ln -s /usr/lib/x86_64-linux-gnu/libz.so /usr/lib

now this: http://www.pythonware.com/products/pil/

wget http://effbot.org/downloads/Imaging-1.1.7.tar.gz
tar zxvf Imaging-1.1.7.tar.gz
cd Imaging-1.1.7
vim setup.py
change relevant lines from None to "/usr/lib" so that the libraries can be found

python setup.py install

python selftest.py

install some fonts

mkdir -p /usr/share/fonts/truetype/
git clone https://github.com/grays/droid-fonts.git
mv droid-fonts/* ./

last preq

pip install Cython

finally

wget https://github.com/paul-nechifor/word_cloud/archive/master.zip
unzip master.zip
cd word_cloud-master
sudo python setup.py install
cd ..
sudo rm -r word_cloud-master master.zip


view examples folder


I turned those files of tweets (Twitter) in json format to plain text tweets using a bit of C# code

// Displays a SaveFileDialog so the user can save the Image
            // assigned to Button2.
            SaveFileDialog saveFileDialog1 = new SaveFileDialog();
            saveFileDialog1.Filter = "Text|*.txt";
            saveFileDialog1.Title = "Save text File";
            saveFileDialog1.ShowDialog();

            // If the file name is not an empty string open it for saving.
            if (saveFileDialog1.FileName != "")
            {
                // Saves the Image via a FileStream created by the OpenFile method.
                System.IO.FileStream fs =
                   (System.IO.FileStream)saveFileDialog1.OpenFile();

                StreamWriter sw = new StreamWriter(fs);
                DialogResult dr = this.openFileDialog1.ShowDialog();
                all = "";
                if (dr == System.Windows.Forms.DialogResult.OK)
                {
                    // Read the files
                    foreach (String file in openFileDialog1.FileNames)
                    {
                        textBox1.Text += file + Environment.NewLine;
                        string text = System.IO.File.ReadAllText(file);
                        string rep = text;

                        rep = @"{ info: " + rep + "}";
                        dynamic stuff = JObject.Parse(rep);

                        for (int i = 0; i < stuff.info.Count - 1; i++)
                        {
                            all = stuff.info[i].text + '\n';
                            sw.WriteLine(all);
                        }


                        textBox1.Text += "Done!" + Environment.NewLine;
                        //MessageBox.Show("" + (stuff.info[0].id));
                        //MessageBox.Show("" + (stuff.info.Count - 1));
                        //MessageBox.Show(""+(stuff.info[stuff.info.Count - 1].id));
                    }
                }


                sw.Close();

                fs.Close();
            }

Making the word cloud

Once all the tweet text are combined into one single file you can write python code (based on example) to generate the word clouds. simple.py

#!/usr/bin/env python2

import sys
import os
import wordcloud

text = open('billgates2.txt').read()
words, counts = wordcloud.process_text(text)
elements = wordcloud.fit_words(words, counts)
wordcloud.draw(elements, 'billgates_2_simple.png')

more.py

#!/usr/bin/env python2

import sys
import os
import wordcloud

text = open('billgates2.txt').read()
words, counts = wordcloud.process_text(text, max_features=2000)
elements = wordcloud.fit_words(words, counts, width=500, height=500)
wordcloud.draw(elements, 'billgates_2_more.png', width=500, height=500, scale=2)

Unfortunately need to clean up the text of

  • t.co (url shortener)
  • bit.ly (url shortner)
  • RT (twitter retweet)
  • s (as in 's ?)
  • amp (html encoded thing)

so use unix sed to help

cp billgates.txt billgates2.txt
sed -i 's/t.co//g' billgates2.txt
sed -i 's/bit.ly//g' billgates2.txt
sed -i 's/RT//g' billgates2.txt
sed -i "s/’[A-Za-z]//g" billgates2.txt
sed -i 's/amp//g' billgates2.txt

Note that forth sed line removes appostrophe and the letter next to it in a word. Also had trouble with that. The apostrophe in the text (’) slightly differs from the one entered by typing (') which means you'll have to copy that apostrophe directly from the text.

Now it looks like this: