Corpus Linguistics Tutorial: Using SocSciBot 4 Cyclist for Text Analysis/Basic Corpus Linguistics
SocSciBot 4 Cyclist is a program that produces word frequency statistics and a concordance/search engine interface for the web sites downloaded by SocSciBot 4 web site crawls. This tutorial goes through the steps needed to obtain a word frequency vocabulary from a set of web sites. The first few steps are similar to those in Tutorial 1, so if you have done this, you may be able to skip through to Step 4 .
Step 1: Installing SocSciBot 4
- Go to the SocSciBot web site http://socscibot.wlv.ac.uk/ and follow the link to download SocSciBot 4 if you agree with the conditions of use. Choose a place to save SocSciBot 4 to where you have plenty of storage space to save data. This will typically be your computer's hard drive, e.g. the C: drive.
Step 2: Crawling a first site with SocSciBot 4
- Start up SocSciBot 4 by double clicking on the file called either SocSciBot or SocSciBot.exe where you unzipped it to on your computer. This should produce a dialog box similar to the one below.
- Confirm that the folder chosen by SocSciBot 4 to store your data is acceptable by clicking OK. Also enter your correct email address, if a box is provided for this. It will be used to email the webmasters of any sites that you crawl. This is both ethical practice and may save you from getting into trouble if a webmaster is unhappy with you crawling their site - they can email you directly instead of emailing your boss or network manager. You can also enter a message to be included in the email to give the purpose of the crawl. You may wish to include the URL of a page with additional information about your project. Also, answer any questions about the location of Microsoft Excel and Pajek. These are not needed for Cyclist, only for the link analysis carried out by SocSciBot Tools.
- Enter small test as the name of the project at the bottom of the next dialog box, Wizard Step 1, and then click on the start new project button. All crawls are grouped together into projects. This allows you to have different named groups of crawls which are analysed separately.
- In the Wizard Step 2 dialog box, enter http://linkanalysis.wlv.ac.uk/ as the starting URL of the site to crawl, and then click the Crawl Site with SocSciBot button.
- The crawl is ready to go. Click the Crawl Site button. After half a minute or so, the crawl should end.
- Click OK to shut down SocSciBot when the crawl is complete. You have now crawled all pages on the http://linkanalysis.wlv.ac.uk site. Before doing some interesting simple analyses, the next stages will crawl two more sites.
You can read information about the crawl in the title bar at the top during the crawl and also at the end of the crawl.
Step 4: Crawling two more sites with SocSciBot 4
- Start up SocSciBot again by double clicking on the SocSciBot or SocSciBot.exe file where you unzipped it to on your computer. This should take you straight through to Wizard step 1. Click on small test in the project list to select this project to add another crawl to.
- .Enter http://cybermetrics.wlv.ac.uk/ as the URL of the second site to crawl in the Wizard Step 2 dialog box (not shown), and then click the Crawl Site with SocSciBot button to go to the main crawling screen.
- Click on the Crawl Site button on the next screen (not shown) and wait for the crawler to finish.
- Click OK to end the crawl.
- Repeat steps 1 to 4 for the URL http://socscibot.wlv.ac.uk/
- You have now successfully crawled three web sites and are read to analyse them!
- Start up SocSciBot 4 by double clicking on the file called either SocSciBot or SocSciBot.exe. Select your project and then start Cyclist by clicking on the Cyclist button in the SocSciBot Wizard Step 2. Cyclist is a text search engine, not a link analysis program.
- You will be asked whether you want to import a stoplist. Answer No to this question.
- After 20 seconds or so of calculations you will get a standard search engine type interface. Try searching for a common word, like "link" and then clicking on the results in the right hand side to see what information you are given. In the example below, 77 pages in the project contain the word link (or links) and the first 10 are listed, with some extra information about them.
- Navigation to other results pages is possible. Use the arrow buttons to go to the next or previous pages, or type the number of a result to skip to a different page.
- More information about each page can be found by clicking on the results in the right-hand side of the screen.
- Clicking on the top line of a result will load the original page into Internet Explorer, (with some extra information about the page at the top, which you probably are not interested in).
- Clicking on the second line of a result will load the HTML source code of the original page into Notepad (with some extra information about the page at the top, which you probably are not interested in).
- Clicking on the third and last line of a result will show some additional words from the page.
Step 5: Obtaining the corpus statistics
Cyclist calculates various word frequency statistics, as follows.
- For a full list of words that occur in any web pages of any web site in the project, select the menu item Info | Save Word Frequency Summary of Whole Project. You will then be prompted for a file name and the entire vocabulary will be saved to that file, together with the number of occurrences of each word.
- For a full list of words that occur in the web pages each site in the project, select the menu item Info | Save Word Frequency Summary of Each Individual Domain. You will then be prompted for a file name and the entire vocabulary of each site will be saved to a separate file (names based upon the file name you enter), together with the number of occurrences of each word. The first number in each line is the ID of the word, which you will probably not need.
- For the total number of words in the project, or the total number of unique words, select Info | Word Count For all Sites. Note that the total number of words in each separate site is given in the files created by Info | Save Word Frequency Summary of Each Individual Domain as above.
Other information and options
Setting up your own project and crawling the sites
Start by setting up a new project at the initial SocSciBot 4 startup screen, giving it a name appropriate to your research. Do not add the crawls to an existing project because this will jumble up the link analysis results. In order to set up a new project, when you start SocSciBot, in the first screen, instead of clicking on an existing project name, enter the name in the box at the bottom and click on the Create New Project button. Once you have selected the new project, you are ready to start crawling the sites by entering their home page URLs in the second SocSciBot screen. For each new site in your data set, you will need to shut shut down SocSciBot at the end of the previous crawl and then start it up again. Consult the instructions in Tutorial 1 if you have difficulty with crawling your sites.
Excluding unwanted pages from the sites
You may find that your crawl of the sites has found too many words. For example, the crawl may have included both the Dutch language version of a site and the English language version of the site, but you only want the English language version. You can get pages excluded from the crawling and/or vocabulary creation using the banned list feature. This excludes URLs from a crawl based upon patterns. You must register URLs as unwanted by specifying either the full URL of each page or the leftmost part of all URLs that need to be excluded. For example, if a web site had an English and a Dutch part, with all Dutch pages having URLs beginning with http://www.wlv.ac.uk/du/ then registering http://www.wlv.ac.uk/du/ would at a stroke exclude all of the Dutch pages.
- Pages can be excluded before or during a SocSciBot 4 crawl by clicking on the Pause and Ban URLs button. Please read the instructions carefully as this is tricky to get right.
- Pages can be excluded after a SocSciBot 4 crawl, but only by using the SocSciBot Tools program. To exclude pages after a crawl, start SocSciBot Tools from the SocSciBot 4 Wizard Step 2 and select Add Extra Banned Pages from the Banned menu. It is a good idea to back up your crawl data (e.g. by taking a zip file copy of the project directory) before trying any data cleansing, in case it goes wrong. It is also a good idea to try it out on some small sites first.
Selecting different options for indexing the sites
If you need a different type of word stemming in the sites, there are some available options, including the Porter Algorithm. To create a new vocabulary for a different word stemming option you will need to re-index the site. To do this, in Cyclist select Make Index from the GoTo menu. Select the options you want and click on the 1. Make Index button. The new vocabulary will be created, overwriting the old vocabulary. When this process is complete, select Search Engine Interface from the GoTo menu.