Making Large International Hyperlink Networks

These instructions describe how to make, display and analyse large hyperlink networks using the web crawler SocSciBot. They are suitable for networks of any practical size, such as up to 1000 web sites. Some of the information will be redundant and can be skipped for small collections of web sites – less than about 15.

Please try the quick network instructions first on a small network (<15 web sites) before trying this section.

Creating the set of sites to crawl

It is important to get a good list of web sites to crawl before starting. Creating this list will take time. For each web site, identify its domain name if it has one. If it shares a domain name then identify the key initial part of the URL that identifies the web site. This is normally a path, such as www.a.com/sportshop/. In some cases the web site is a single page. If this is the case, record the full URL instead. Use all this information to make a plain text file list of the URLs, paths and/or domain names of the sites, one per line. The file should be made in a program like Notepad (in the Accessories program group accessed via the Start menu).

Crawling the sites

This may take 1-2 weeks if you have hundreds of web sites so make sure that you are using a computer that will have an uninterrupted connection to the internet for this period of time.
Start SocSciBot (download from http://socscibot.wlv.ac.uk/), enter a name for a new project in the Wizard Step 1 screen, and select the Download Multiple Sites option in Wizard Step 2 before clicking the Crawl site with SocSciBot option. This takes you to one of the main crawling screens.
From the main crawl screen, select the Crawl Web Sites to a Maximum Depth option and click the Load List of URLs to Crawl button and select the list of URLs of the home pages of your web sites. You will also be asked “Level for Web Site”, with the options, “0=site, 1=domain, 2=full URL up to ?, 3=full URL”. This relates to the type of URLs in your list and it tells SocSciBot how to decide whether new pages it finds when it crawls are part of the web sites crawled or whether they are part of different web sites.

Click the Crawl Above List of Sites/URLs button to start the crawling. When the crawl is finished SocSciBot will present a dialog box with crawl statistics and will shut down.

Creating the network diagram

This may take one hour for large networks.
Restart SocSciBot, select the project that you created for the crawl and then click the Analyse link structure button and answer Yes to all questions asked. The main link analysis may take an hour for a large network. When this finishes you will see the main link analysis screen.
Click the Show Site Network button. This may take ten minutes and then it will draw a network of the sites crawled, with one node (circle) per site and arrows between the sites corresponding to hyperlinks between them.

Working with the network diagram

The main features available to make the network easier to understand and to highlight the patterns are the following. These can be used to create a network as listed below their descriptions.

Assuming that you have a large network with over 50 nodes, and that these nodes come from many different countries, the following steps are recommended to create two attractive versions of the network.

Colour-coded individual nodes network diagram

This diagram consists of all the individual nodes coloured by their TLD and arranged to reveal patterns of interconnectivity between them. Follow steps 1 and 2 above for this.
If you are not happy with the TLDs of the nodes then follow steps 3 and 4 to override the colour schemes for nodes that you would like to change. Alternatively, you could try step 5 instead of steps 3 and 4, then return to step 2. See step 9 to print the network.

Merged country and international nodes network diagram

It is possible to merge nodes together into a single node. This is useful to show country-based patterns, for example by merging all web sites from the same country into a single node.
To merge the nodes, you can either merge each group on of nodes from the same country separately, by using step 8, or do the merging more automatically by using step 2 (and steps 3, 4, and/or 5 if necessary) then step 6 and step 7. Once this is complete, use step 1 to layout the network again.

Analysing the network

Basic social network analysis statistics can be calculated for the network by selecting Calculate social network analysis statistics for the loaded network from the Info menu in the network drawing program. This data is most easily seen when loaded into a spreadsheet program so that the columns align properly.
The network can be analysed in Pajek or the Social Network Analysis (SNA) program UCINET – chose Save As from the File menu and it will be saved in Pajek format and can then be loaded into Pajek  or imported into UCINET.
The network can be exported to a matrix that can be loaded into other software for analysis. For this, select Export Network to Text Matrix from the File menu. This is a matrix of the link counts in tab-separated file format and can be loaded into other software, such as SPSS, for further analysis.

Glossary

Pajek .net file The network created by SocSciBot is automatically saved in the format used by the network drawing program Pajek. This is a plain text format with file name extension .net.