Home | Tutorial 1 | Tutorial 2 | Tutorial 3 | Linguistics | FAQ |Reports| Book
Corpus Linguistics Tutorial: Using SocSciBot and Cyclist for Text Analysis/Basic Corpus Linguistics
Cyclist is a program that produces word frequency statistics and a concordance/search engine interface for the web sites downloaded by SocSciBot. This tutorial goes through the steps needed to obtain a word frequency vocabulary from a set of web sites. The first few steps are similar to those in Tutorial 1, so if you have done this, you may be able to skip through to Step 4 .
Step 1: Installing SocSciBot, SocSciBot Tools and Cyclist
- Go to the SocSciBot web site http://socscibot.wlv.ac.uk/ and follow the link to download the programs, only if you agree with the conditions of use. When prompted by your computer, choose a place to save the programs to where you have plenty of storage space to save data. This will typically be your computer's hard drive, e.g. the C: drive.
- Next, unzip the file SocSciBotAll.zip from the place where you saved it. This will create several new files, including the programs SocSciBot, SocSciBot Tools, and Cyclist.
Step 2: Crawling a first site with SocSciBot
- Start up SocSciBot by double clicking on the file called either SocSciBot or SocSciBot.exe where you unzipped it to on your computer. This should produce a dialog box similar to the one below.
- Confirm that the folder chosen by SocSciBot to store your data is acceptable by clicking OK. Also enter your correct email address, if a box is provided for this. It will be used to email the webmasters of any sites that you crawl. This is both ethical practice and may save you from getting into trouble if a webmaster is unhappy with you crawling their site - they can email you directly instead of emailing your boss or network manager. You can also enter a message to be included in the email to give the purpose of the crawl. You may wish to include the URL of a page with additional information about your project. Also, answer any questions about the location of Microsoft Excel and Pajek. These are not needed for Cyclist, only for the link analysis software in SocSciBot Tools, which is an integrated part of the software suite.
- Enter small test as the name of the project at the bottom of the next dialog box, Wizard Step 1, and then click on the start new project button. All crawls are grouped together into projects. This allows you to have different named groups of crawls which are analysed separately.
- Click No to answer the strange question that you are asked next. This is an advanced data cleansing facility that you are unlikely to need before you become an expert user.
- Click No to answer the second strange question that you are asked next. This is another advanced facility that you are unlikely to need before you become an expert user.
- In the wizard step 2 dialog box, enter http://linkanalysis.wlv.ac.uk/ as the starting URL of the site to crawl, and then click Start a new crawl of this site.
- The crawl is ready to go. Click the Crawl Site button. After half a minute or so, the crawl should end.
You can read information about the crawl in the title bar at the top during the crawl and also at the end of the crawl.
- Click Yes to shut down SocSciBot when the crawl is complete. You have now crawled all pages on the http://linkanalysis.wlv.ac.uk site. Before doing some interesting simple analyses, the next stages will crawl two more sites.
Step 3: Crawling two more sites with SocSciBot
- Start up SocSciBot again by double clicking on the SocSciBot or SocSciBot.exe file where you unzipped it to on your computer. This should take you straight through to Wizard step 1. Click on small test to select this project to add another crawl to.
- .Enter http://cybermetrics.wlv.ac.uk/ as the URL of the second site to crawl, and click Start a new crawl of this site.
- Click on the Crawl site button on the next screen and wait for the crawler to finish.
- Click Yes to end the crawl.
- Repeat 1 to 4 above for the URL http://socscibot.wlv.ac.uk/
- You have now successfully crawled three web sites and are read to analyse them!
Step 4: Using Cyclist as a concordancer
- Start up Cyclist by double clicking on the file called either Cyclist or Cyclist.exe where you unzipped it to on your computer. Cyclist serves the functions of a concordancer/search engine, and of a corpus word frequency analyser.
- Answer the questions and after 20 seconds or so of calculations you will get a standard concordancer/search engine type interface. Try searching for a common word, like "link". In the example below, 41 pages in the project contain the word link (or links) and the first 10 are listed, with some extra information about them.
- Navigation to other results pages is possible. Use the arrow buttons to go to the next or previous pages, or type the number of a result to skip to a different page.
- More information about each page can be found by clicking on the results in the right-hand side of the screen.
- Clicking on the top line of a result will load the original page into Internet Explorer, (with some extra information about the page at the top, which you probably are not interested in).
- Clicking on the second line of a result will load the HTML source code of the original page into Notepad (with some extra information about the page at the top, which you probably are not interested in).
- Clicking on the third and last line of a result will show some additional words from the page.
Step 5: Obtaining the corpus statistics
Cyclist calculates various word frequency statistics, as follows.
For a small site, a spreadsheet such as Excel is useful to sort the vocabulary.
- For a full list of words that occur in any web pages of any web site in the project, select the menu item Info | Save Word Frequency Summary of Whole Project. You will then be prompted for a file name and the entire vocabulary will be saved to that file, together with the number of occurrences of each word.
- For a full list of words that occur in the web pages each site in the project, select the menu item Info | Save Word Frequency Summary of Each Individual Domain. You will then be prompted for a file name and the entire vocabulary of each site will be saved to a separate file (names based upon the file name you enter), together with the number of occurrences of each word. The first number in each line is the ID of the word, which you will probably not need.
- For the total number of words in the project, or the total number of unique words, select Info | Word Count For all Sites. Note that the total number of words in each separate site is given in the files created by Info | Save Word Frequency Summary of Each Individual Domain as above.
Other information and options
Setting up your own project and crawling the sites
Start by setting up a new project at the initial SocSciBot startup screen, giving it a name appropriate to your research. Do not add the crawls to an existing project because this will jumble up the link analysis results. In order to set up a new project, when you start SocSciBot, in the first screen, instead of clicking on an existing project name, enter the name in the box at the bottom and click on the Create New Project button. Once you have selected the new project, you are ready to start crawling the sites by entering their home page URLs in the second SocSciBot screen. For each new site in your data set, you will need to shut shut down SocSciBot at the end of the previous crawl and then start it up again. Consult the instructions in Tutorial 1 if you have difficulty with crawling your sites.
Excluding unwanted pages from the sites
You may find that your crawl of the sites has found too many words. For example, the crawl may have included both the Dutch language version of a site and the English language version of the site, but you only want the English language version. You can get pages excluded from the crawling and/or vocabulary creation using the banned list feature. This excludes URLs from a crawl based upon patterns. You must register URLs as unwanted by specifying either the full URL of each page or the leftmost part of all URLs that need to be excluded. For example, if a web site had an English and a Dutch part, with all Dutch pages having URLs beginning with http://www.wlv.ac.uk/du/ then registering http://www.wlv.ac.uk/du/ would at a stroke exclude all of the Dutch pages.
- Pages can be excluded before or during a SocSciBot crawl by clicking on the Pause and Ban URLs button. Please read the instructions carefully as this is tricky to get right.
- Pages can be excluded after a SocSciBot crawl, but only by using the SocSciBot Tools program. To exclude pages after a crawl, start SocSciBot Tools (which is in the same zip file as SocSciBot and Cyclist) and click on the Data Cleansing button at the bottom of the screen after you have selected your project. It is a good idea to back up your crawl data (e.g. by taking a zip file copy of the project directory) before trying any data cleansing, in case it goes wrong. It is also a good idea to try it out on some small sites first.
Selecting different options for indexing the sites
If you need a different type of word stemming in the sites, there are some available options, including the Porter Algorithm. To create a new vocabulary for a different word stemming option you will need to re-index the site. To do this, in Cyclist select Make Index from the GoTo menu. Select the options you want and click on the 1. Make Index button. The new vocabulary will be created, overwriting the old vocabulary. When this process is complete, select Search Engine Interface from the GoTo menu.