Corpus Linguistics Tutorial: Using SocSciBot 4 Cyclist for Text Analysis/Basic Corpus Linguistics

Overview

SocSciBot 4 Cyclist is a program that produces word frequency statistics and a concordance/search engine interface for the web sites downloaded by SocSciBot 4 web site crawls. This tutorial goes through the steps needed to obtain a word frequency vocabulary from a set of web sites. The first few steps are similar to those in Tutorial 1, so if you have done this, you may be able to skip through to Step 4 .

Step 1: Installing SocSciBot 4

  1. Go to the SocSciBot web site http://socscibot.wlv.ac.uk/ and follow the link to download SocSciBot 4 if you agree with the conditions of use. Choose a place to save SocSciBot 4 to where you have plenty of storage space to save data. This will typically be your computer's hard drive, e.g. the C: drive.

Step 2: Crawling a first site with SocSciBot 4

  1. Start up SocSciBot 4 by double clicking on the file called either SocSciBot or SocSciBot.exe where you unzipped it to on your computer. This should produce a dialog box similar to the one below.
  2. Confirm that the folder chosen by SocSciBot 4 to store your data is acceptable by clicking OK. Also enter your correct email address, if a box is provided for this. It will be used to email the webmasters of any sites that you crawl. This is both ethical practice and may save you from getting into trouble if a webmaster is unhappy with you crawling their site - they can email you directly instead of emailing your boss or network manager. You can also enter a message to be included in the email to give the purpose of the crawl. You may wish to include the URL of a page with additional information about your project. Also, answer any questions about the location of Microsoft Excel and Pajek. These are not needed for Cyclist, only for the link analysis carried out by SocSciBot Tools.
  3. Enter small test as the name of the project at the bottom of the next dialog box, Wizard Step 1, and then click on the start new project button. All crawls are grouped together into projects. This allows you to have different named groups of crawls which are analysed separately.
  4. In the Wizard Step 2 dialog box, enter http://linkanalysis.wlv.ac.uk/ as the starting URL of the site to crawl, and then click the Crawl Site with SocSciBot button.
  5. The crawl is ready to go. Click the Crawl Site button. After half a minute or so, the crawl should end.
  6. You can read information about the crawl in the title bar at the top during the crawl and also at the end of the crawl.

  7. Click OK to shut down SocSciBot when the crawl is complete. You have now crawled all pages on the http://linkanalysis.wlv.ac.uk site. Before doing some interesting simple analyses, the next stages will crawl two more sites.

Step 4: Crawling two more sites with SocSciBot 4

  1. Start up SocSciBot again by double clicking on the SocSciBot or SocSciBot.exe file where you unzipped it to on your computer. This should take you straight through to Wizard step 1. Click on small test in the project list to select this project to add another crawl to.
  2. .Enter http://cybermetrics.wlv.ac.uk/ as the URL of the second site to crawl in the Wizard Step 2 dialog box (not shown), and then click the Crawl Site with SocSciBot button to go to the main crawling screen.
  3. Click on the Crawl Site button on the next screen (not shown) and wait for the crawler to finish.
  4. Click OK to end the crawl.
  5. Repeat steps 1 to 4 for the URL http://socscibot.wlv.ac.uk/
  6. You have now successfully crawled three web sites and are read to analyse them!

Step 4: Using Cyclist as a concordancer

  1. Start up SocSciBot 4 by double clicking on the file called either SocSciBot or SocSciBot.exe. Select your project and then start Cyclist by clicking on the Cyclist button in the SocSciBot Wizard Step 2. Cyclist is a text search engine, not a link analysis program.
  2. You will be asked whether you want to import a stoplist. Answer No to this question.
  3. After 20 seconds or so of calculations you will get a standard search engine type interface. Try searching for a common word, like "link" and then clicking on the results in the right hand side to see what information you are given. In the example below, 77 pages in the project contain the word link (or links) and the first 10 are listed, with some extra information about them.
  4. Navigation to other results pages is possible. Use the arrow buttons to go to the next or previous pages, or type the number of a result to skip to a different page.
  5. More information about each page can be found by clicking on the results in the right-hand side of the screen.

Step 5: Obtaining the corpus statistics

Cyclist calculates various word frequency statistics, as follows.

For a small site, a spreadsheet such as Excel is useful to sort the vocabulary.

Other information and options

Setting up your own project and crawling the sites

Start by setting up a new project at the initial SocSciBot 4 startup screen, giving it a name appropriate to your research. Do not add the crawls to an existing project because this will jumble up the link analysis results. In order to set up a new project, when you start SocSciBot, in the first screen, instead of clicking on an existing project name, enter the name in the box at the bottom and click on the Create New Project button. Once you have selected the new project, you are ready to start crawling the sites by entering their home page URLs in the second SocSciBot screen. For each new site in your data set, you will need to shut shut down SocSciBot at the end of the previous crawl and then start it up again. Consult the instructions in Tutorial 1 if you have difficulty with crawling your sites.

Excluding unwanted pages from the sites

You may find that your crawl of the sites has found too many words. For example, the crawl may have included both the Dutch language version of a site and the English language version of the site, but you only want the English language version. You can get pages excluded from the crawling and/or vocabulary creation using the banned list feature. This excludes URLs from a crawl based upon patterns. You must register URLs as unwanted by specifying either the full URL of each page or the leftmost part of all URLs that need to be excluded. For example, if a web site had an English and a Dutch part, with all Dutch pages having URLs beginning with http://www.wlv.ac.uk/du/ then registering http://www.wlv.ac.uk/du/ would at a stroke exclude all of the Dutch pages.

Selecting different options for indexing the sites

If you need a different type of word stemming in the sites, there are some available options, including the Porter Algorithm. To create a new vocabulary for a different word stemming option you will need to re-index the site. To do this, in Cyclist select Make Index from the GoTo menu. Select the options you want and click on the 1. Make Index button. The new vocabulary will be created, overwriting the old vocabulary. When this process is complete, select Search Engine Interface from the GoTo menu.