Home | Tutorial 1 | Tutorial 2 | Tutorial 3 | Linguistics | FAQ |Reports| Book

Tutorial 2: Mini Link Analysis Research Project Case Study


This tutorial introduction goes through the stages of a very small pretend link analysis research project, from the initial crawling to analyzing the link data. This project is designed to give you an easy way to learn how to use SocSciBot and SocSciBot Tools for a standard research project. Tutorial 3 will give more general advice on research projects to assist you in your own project, especially if your research projects has significanly different goals/information needs to this one.

Setting up a new project and crawling the sites

The project involves crawling and analysing a small number of web sites. These will be similar to the web sites that were crawled in the first project (Tutorial 1), because they need to be sites that I know will not disappear.

The sites are the following. Please set up a new project called Sample Research Project and crawl the sites below in this new project. In order to set up a new project, when you start SocSciBot, in the first screen, instead of clicking on an existing project name, enter the name in the box at the bottom and click on the Create New Project button. Consult the instructions in Tutorial 1 if you have difficulty with crawling the sites.

Data cleansing

When you have crawled all of the above sites, start SocSciBot Tools and select the new project, Sample Research Project. The first, and most time-consuming step of data analysis is to identify and eliminate anomalies from the data set. Ideally, each page downloaded needs to be checked to make sure that it matches the criteria for your research project. For example, many research projects have required that all pages in a site were created by the owners of the site and are not copies of other people’s web pages. This can make a big difference to the results if there is a very big copy of someone else’s web site with many links. As an example of this, the Linux Gazette web site is copied to many university web servers, as is a lot of computer software documentation. These computing copies typically have a link to their official home page and these links can spoil any link analysis. Since the problems occur if there are many links from a single source, the best way to look for copied pages is to start by identifying the most highly targeted pages. From the SocSciBot Tools main reports menu, there are two reports of the most highly targeted pages in the data set: Known external links with counts, which list all the links between sites in the data set, and Unknown external links with counts, which lists links to sites outside of the data set. View both of these by clinking on them and then clicking on the view report button.


In both lists you should see that no page is very highly targeted, but scanning one of the lists you will see a strange long list of links to .co.uk sites. Could this be an anomaly? It is important to find out because there are so many links involved, if the analysis will include these links. To decide, you will need to visit the pages and see why the links have been created. Finding the pages is time consuming: you will have to check the link structure files manually to find them.

Pick one of the strange link URLs, say .pma.co.uk. To view the link structure files, select Original link structure files for the whole project from the drop down menu in the middle of the screen.

Click on each file name in turn. A Notepad window will appear containing the link structure of the site. The file contains the URL of each page (with any initial http:// or http://www removed) and the URL of each link (tab-indented and above the URL of the page that they came from). You should find the URL .pma.co.uk in one of these files. (tip: use Notepad’s Find facility (Edit|Find) to search for the URL).

Scrolling down from .pma.co.uk, you will find that the URL comes from the page .scit.wlv.ac.uk/~cm1993/penn/p.htm (i.e. http://www.scit.wlv.ac.uk/~cm1993/penn/p.htm). If you visit this page you will see that it is part of a larger site, known as the Penn UK Business Directory, which is simply an enormous listing of UK .co.uk web sites (part of an old research project). Suppose that you decide that this set of pages should not be included in your data set, because it is a copy of a site from other locations. The whole mini-site can be excluded using SocSciBot Tools’s data cleansing feature: the Banned List, as follows.

Click on the DATA CLEANSING button at the bottom of the screen. A new file will appear. To this we must add the following two lines, at the bottom.


The first line in the square brackets identifies the domain name of the site that pages will be excluded from (note that only the domain name should be entered, and any initial www. should be chopped off. The second line is an instruction to exclude all URLs beginning with http://www.scit.wlv.ac.uk/~cm1993/penn so that the whole Penn UK site will be removed.

Now save the file and click on the Yes button. The Penn UK pages will be removed from the data set and the basic statistics will be recalculated. Check that the pages have really gone by viewing the link structure file and the Unknown external links with counts file.

Interpreting the link counts

In order to make inferences about link counts, such as that they represent scholarly communication, you need to take some steps to verify your hypothesis. See the book Link Analysis: An Information Science Approach or the journal article Interpreting social science link analysis research: A theoretical framework for a discussion of this complex issue. One standard step that can be taken, however, is to classify a random sample of links so that some general inferences can be made about appropriate interpretations of link counts. SocSciBot Tools provides a random sample of links for exactly this purpose, the file: random links 20 per site in the Reports and/or Random Links for the whole project reports (with current options) section selected with the drop-down selection box. There are 20 links per site crawled in this file, arranged at random within each site. As a pilot study, we might want to visit 40 links altogether (i.e. the link source page and link target page) and classify the reason for the link creation according to an appropriate set of classification criteria. For example, a very simple scheme could be (a) relates to academic purposes, and (b) does not relate to academic activities. But, in contrast to this simple example, you need to give detailed criteria for each of your categories. Since there are 4 sites and we decided upon 40 links, we can take the first 10 random links from the file random links 20 per site and classify them.

Obtaining the important link information for your research question

This is the easy part: try out the different reports available from the reports area and think about what kinds of research questions they could be used to explore.