Tutorial 4: Summary of how to use SocSciBot 4 for a link analysis research project

Overview

This tutorial summarises the key stages of a link analysis research project, from the initial crawl to the final analysis of the link data. This tutorial is also designed to give you key information to enable you to use SocSciBot 4 for your research project. Please go through Tutorials 1 and 2 before starting this one.

Setting up project goals

The first stage is deciding which web sites to crawl and collecting the URLs of the home pages of these sites. This stage depends entirely on your research question. However, you may wish to use search engine advanced searches to estimate how many pages there are in each site that you want to crawl so that you know if it is practical to crawl your sites. Each crawl will take a while if the sites contain over 100 pages, so please budget enough time to complete your crawling.

It is a good idea to undertake a small-scale pilot study before a full scale study. The purpose of the pilot study is to assess how likely it is that a full-scale study will give you the information that you need for your research.

Setting up a new SocSciBot project and crawling the sites

Start by setting up a new project at the initial SocSciBot 4 startup screen, giving it a name appropriate to your research. Do not add the crawls to an existing project because this will jumble up the link analysis results. In order to set up a new project, when you start SocSciBot, in the first screen, instead of clicking on an existing project name, enter the name in the box at the bottom and click on the Create New Project button. Once you have selected the new project, you are ready to start crawling the sites by entereing their home page URLs in the second SocSciBot screen. For each new site in your data set, you will need to shut shut down SocSciBot at the end of the previous crawl and then start it up again. Consult the instructions in Tutorial 1 if you have difficulty with crawling your sites.

Data cleansing

Recall (Tutorial 2) that, "The first, and most time-consuming step of data analysis is to identify and eliminate anomalies from the data set. Ideally, each page downloaded needs to be checked to make sure that it matches the criteria for your research project." The recommended method for large collection of pages is to

  1. construct a set of criteria for inclusion or exclusion of pages, and
  2. investigate links to the most highly targeted pages to see if these come from an undesired source.

This will not eliminate all unwanted pages, but will eliminate the most influential sources of links. From the SocSciBot Tools main reports menu, the most highly targeted pages in the data set are in the standard reports: Selected external links with counts, (all links between sites in the data set), and Unselected external links with counts, (links to sites outside the data set). Use these to identify the most common link targets.

Finding some source pages for the link targets is more difficult. You could try constructing appropriate advanced search engine link searches or searching for the link targets in each of the link structure files for the project, found by selecting the Raw Data tab. Remember that you can use Notepad’s Find facility (Edit|Find) to search for the URL.

Collections of pages can be excluded using SocSciBot Tools’s data cleansing feature: the Banned List. Before using this feature, which may take a while for SocSciBot Tools to process, follow two steps:

  1. Take a backup copy of your data set (e.g. compress the project folder into a zip file) in case something goes wrong.
  2. Collect a list of all source urls that you wish to exclude before starting up the banned list feature. This means you only have to use the banned list feature once.

Select Add Extra Banned Pages from the Banned menu. A new file will appear. You must add the following lines at the bottom of this file to tell SocSciBot Tools which pages in your crawl should be excluded. Below is an example which is then explained.

[wlv.ac.uk]
http://www.scit.wlv.ac.uk/~cm1993/penn/
http://www.wlv.ac.uk/linuxgazette/

The first line in the square brackets identifies the domain name of the site that pages will be excluded from (note that only the domain name should be entered, and any initial www. should be chopped off. All lines below this until the next square bracketed line are interpreted by SocSciBot Tools as instructions to exclude all pages beginning with the text (i.e.http://www.scit.wlv.ac.uk/~cm1993/penn/ or http://www.wlv.ac.uk/linuxgazette/ in the above example) from crawls of sites containing (wlv.ac.uk) in their domain name. In the above example, the whole penn and linuxgazette directories and any subdirectories of these will be removed.

Check that the pages have really gone by viewing the link structure files, the Selected external links with counts file and the Unselected external links with counts file.

Interpreting the link counts

Recall (Tutorial 2) that, "In order to make inferences about link counts, such as that they represent scholarly communication, you need to take some steps to verify your hypothesis. See the book Link Analysis: An Information Science Approach or the journal article Interpreting social science link analysis research: A theoretical framework for a discussion of this complex issue. One standard step that can be taken, however, is to classify a random sample of links so that some general inferences can be made about appropriate interpretations of link counts. SocSciBot Tools provides a random sample of links for exactly this purpose, the file: random links 20 per site in the Extra Reports from Link Options tab.

You need to construct an appropriate classification scheme for your links and classify a sample of the links. For example, you may wish to classify 40 for a pilot study and 160 for a full study. You need to give detailed criteria for each of your categories. The recommended approach is to start with an intuitive classification scheme, based upon visiting a few links and then extend the classification scheme as necessary, when links are found that do not match the original scheme.

Obtaining the important link information for your research question

The steps to take to obtain the link data that you require depend primarily upon the types of links that you are interested in. If you are interested in the links between the sites that you have crawled, then you can use the reports automatically provided by SocSciBot Tools. If you are interesed in any other types of links, then you will need to register the type that you are interested in before requesting new reports. Some types of links are automacially catered for by SocSciBot Tools options, and other types can be accessed only by using the advanced options. The following list summarises the situation.

  1. Links between sites crawled. This is the default setting for SocSciBot Tools: use the reports provided.
  2. All links in the data set that are one or more of the following can be selected using the subproject selection wizard.
    1. Links within each site crawled (i.e. site self-links)
    2. Links to all sites outside of the crawled set
    3. NOT links to home pages
    4. Links counted using a special counting method (Alternative Document Model) such as links between sites or links between directories or domains.
  3. Links to specified domains, e.g. all links to the .edu domain, or all links to the .co.uk domain, can be selected using the advanced features of SocSciBot Tools, but reports from these will not be calculated automatically: you will have to learn the advanced features of SocSciBot. Alternatively, if there are not too many pages crawled, you may be able to manually edit the link structure files to remove the types of links or pages that you are not interested in.

Note that for 2 above, the subproject selection wizard is found via the Link Type Options menu, Select Types of Links to Include in Reports option.