Making Large International Hyperlink Networks
These instructions describe how to make, display and analyse large hyperlink networks using the web crawler SocSciBot. They are suitable for networks of any practical size, such as up to 1000 web sites. Some of the information will be redundant and can be skipped for small collections of web sites – less than about 15.
Please try the quick network instructions first on a small network (<15 web sites) before trying this section.
Creating the set of sites to crawl
It is important to get a good list of web sites to crawl before starting. Creating this list will take time. For each web site, identify its domain name if it has one. If it shares a domain name then identify the key initial part of the URL that identifies the web site. This is normally a path, such as www.a.com/sportshop/. In some cases the web site is a single page. If this is the case, record the full URL instead. Use all this information to make a plain text file list of the URLs, paths and/or domain names of the sites, one per line. The file should be made in a program like Notepad (in the Accessories program group accessed via the Start menu).
Crawling the sites
This may take 1-2 weeks if you have hundreds of web sites so make sure that you are using a computer that will have an uninterrupted connection to the internet for this period of time.
Start SocSciBot (download from http://socscibot.wlv.ac.uk/), enter a name for a new project in the Wizard Step 1 screen, and select the Download Multiple Sites option in Wizard Step 2 before clicking the Crawl site with SocSciBot option. This takes you to one of the main crawling screens.
From the main crawl screen, select the Crawl Web Sites to a Maximum Depth option and click the Load List of URLs to Crawl button and select the list of URLs of the home pages of your web sites. You will also be asked “Level for Web Site”, with the options, “0=site, 1=domain, 2=full URL up to ?, 3=full URL”. This relates to the type of URLs in your list and it tells SocSciBot how to decide whether new pages it finds when it crawls are part of the web sites crawled or whether they are part of different web sites.
- If ALL the URLs in your list are (large sites) based upon their own owned web sites (e.g., wlv.ac.uk, bbc.co.uk, cnn.com) then choose 0.
- Otherwise, if all web sites have their own domain names (e.g., www.wlv.ac.uk, news.blogger.com, nice.small.web.ca) then choose 1.
- Otherwise, if all web sites can be identified by their URLs and no web site URLs are defined by text after any “?” in the URL then choose 2.
- Otherwise, choose 3.
Click the Crawl Above List of Sites/URLs button to start the crawling. When the crawl is finished SocSciBot will present a dialog box with crawl statistics and will shut down.
Creating the network diagram
This may take one hour for large networks.
Restart SocSciBot, select the project that you created for the crawl and then click the Analyse link structure button and answer Yes to all questions asked. The main link analysis may take an hour for a large network. When this finishes you will see the main link analysis screen.
Click the Show Site Network button. This may take ten minutes and then it will draw a network of the sites crawled, with one node (circle) per site and arrows between the sites corresponding to hyperlinks between them.
Working with the network diagram
The main features available to make the network easier to understand and to highlight the patterns are the following. These can be used to create a network as listed below their descriptions.
- Layout the nodes so to reveal patterns of interconnectivity: Select Automatic from the Layout menu. The layout algorithm has a parameter than alters the appearance of the network. Sometimes a better diagram can be obtained by altering this parameter and re-running the algorithm. To do this, click the F-R tab on the right of the screen, then move the slider near the top of the tab and select Automatic from the Layout menu again to see if the network is improved. Try different values to find the one that fits best.
- Colour nodes by Top-Level Domain: Select Colour Nodes by URL TLD from the Appearance menu. This also gives the option to create a key for the colour codes – a list of which TLDs associate with each colour scheme. The key can be converted to use country names instead of TLDs with the Replace TLDs with country names in key option from the Appearance menu.
- Override the colour scheme of individual nodes: Left click the exact centre of the node to change the appearance of, then right-click to access the menu to change properties of the selected node. From the right-click menu select Change colour of selected nodes, choosing an unused colour, if possible. Then while the node is still selected (highlighted in green) right click and select Change border colour of selected nodes from the right-click menu, selecting a colour so that the colour/border colour combination will be unique to this node
- Override the colour scheme of groups of nodes (e.g., before automatic merging, so the nodes merge into a group rather than into separate country nodes). Do this AFTER colouring the nodes by TLD automatically. The idea is to select all the nodes in a group and then give all the nodes the same colour and border colour, and a different one to any given before. This can be done in two ways.
- List the node URLs in a plain text file, one URL per line. Select all the nodes in the list by selecting Select all nodes with description matching any line in file from the Edit menu and then selecting the list of URLs. While these nodes are selected (highlighted in green) right click and select Change colour of selected nodes from the right-click menu, selecting an unused colour, if possible. Then while the nodes are still selected (highlighted in green) right click and select Change border colour of selected nodes from the right-click menu, selecting a colour so that the colour/border colour combination will be unique to this group of nodes.
- If all the nodes have a distinguishing unique piece of text in them, such as earthmarkets.net, then instead of selecting them via their names in a file, select them with Select all nodes with description matching text from the Edit menu and then entering the distinguishing text. After this, follow the above instructions for when the nodes are selected.
- [advanced] Override the Top-Level Domains of individual nodes before colouring or merging by country. This is useful for web sites with a generic Top Level Domain, such as .com, .net or .org, that would otherwise be given the wrong colour by the automatic colouring program. The easiest way to override the TLD of a web site is to edit its URL in the network by adding www.NEW-TLD// at the start. For example, if the node is bbc.com then this could be changed to www.uk//bbc.com. The new fake URL has the UK TLD. There are two ways to change the URL of a node:
- Click on the node, edit its URL in the panel on the right hand side of the screen and then click the Change node properties button. OR
- [advanced] Edit the URL in the Pajek .net text file defining the network using Windows Notepad and being careful not to edit outside the quotation marks around the URL.
- Merge nodes from the same TLD into single nodes (possibly with exceptions – important international sites or networks of sites). For this, first colour the nodes so that nodes with the same TLD have the same internal colour and border colour (this can be done automatically as below), then select Merge all groups of nodes with the same colour and border colour from the Edit menu.
- Replace Node URLs with country names based upon TLD: Select Replace URLs with TLDs. Then select Replace TLDs with country names (both from the Appearance menu). Note that replacing URLs with country domains is not complete – you may have to manually edit some URL names. In both cases, answer “No” to any question asked about using names rather than URLs.
- Merge a group of nodes into a single node: First select a group of nodes to be merged by clicking and dragging across the nodes or using Edit menu options for selecting nodes. Then right click while the nodes are selected (colour green) and select Merge selected nodes from the right-click menu.
- Print the network: For a standard quality picture, press the Prt Scrn Print screen key, then open Paint from the Accessories program group and press Control-V to copy the screen into Paint. Edit the image to show just the network and not the rest of the computer screen. For a high quality picture, select Print from the file menu and select a high quality printer driver (e.g., your printer, or a scan/fax application). If you need a high quality 600 dpi copy then a specialist printer driver, such as TIFF Image Printer, should be bought. If the full network is not printed on the image and you are using TIFF Image Printer, select the properties tab after clicking the print button, then select the Advanced tab and change the print dimensions, using information in the network status bar.
Assuming that you have a large network with over 50 nodes, and that these nodes come from many different countries, the following steps are recommended to create two attractive versions of the network.
Colour-coded individual nodes network diagram
This diagram consists of all the individual nodes coloured by their TLD and arranged to reveal patterns of interconnectivity between them. Follow steps 1 and 2 above for this.
If you are not happy with the TLDs of the nodes then follow steps 3 and 4 to override the colour schemes for nodes that you would like to change. Alternatively, you could try step 5 instead of steps 3 and 4, then return to step 2. See step 9 to print the network.
Merged country and international nodes network diagram
It is possible to merge nodes together into a single node. This is useful to show country-based patterns, for example by merging all web sites from the same country into a single node.
To merge the nodes, you can either merge each group on of nodes from the same country separately, by using step 8, or do the merging more automatically by using step 2 (and steps 3, 4, and/or 5 if necessary) then step 6 and step 7. Once this is complete, use step 1 to layout the network again.
Analysing the network
Basic social network analysis statistics can be calculated for the network by selecting Calculate social network analysis statistics for the loaded network from the Info menu in the network drawing program. This data is most easily seen when loaded into a spreadsheet program so that the columns align properly.
The network can be analysed in Pajek or the Social Network Analysis (SNA) program UCINET – chose Save As from the File menu and it will be saved in Pajek format and can then be loaded into Pajek or imported into UCINET.
The network can be exported to a matrix that can be loaded into other software for analysis. For this, select Export Network to Text Matrix from the File menu. This is a matrix of the link counts in tab-separated file format and can be loaded into other software, such as SPSS, for further analysis.