Tutorial 2: Introduction to SocSciBot 4
This tutorial introduction goes through the full stages of a very small SocSciBot project, from the initial crawling to analyzing the link data. Going through this project is the easiest way of learning what SocSciBot can do.
Step 1: Installing SocSciBot 4
- Go to the SocSciBot web site http://socscibot.wlv.ac.uk/ and follow the link to download SocSciBot 4 if you agree with the conditions of use. Choose a place to save SocSciBot 4 to where you have plenty of storage space to save data. This will typically be your computer's hard drive, e.g. the C: drive.
Step 2: Installing Pajek
If you want to produce network diagrams with SocSciBot data then you are recommended to install Pajek. You need to do this before starting SocSciBot for the first time, because SocSciBot looks for Pajek when it is first started, and will not find Pajek if Pajek is installed after SocSciBot is first run.
- Go to the Pajek home page http://vlado.fmf.uni-lj.si/pub/networks/pajek/ and download and install the latest version of Pajek.
Step 3: Crawling a first site with SocSciBot
- Start SocSciBot 4 by double clicking on the file called either SocSciBot4 or SocSciBot4.exe where you saved it on your computer. This should produce a dialog box similar to the one below. Note - this only happens the first time you start SocSciBot.
- Confirm that the folder chosen by SocSciBot 4 to store your data is acceptable by clicking OK. Also enter your correct email address. It will be used to email the webmasters of any sites that you crawl. This is both ethical practice and may save you from getting into trouble if a webmaster is unhappy with you crawling their site - they can email you directly instead of emailing your boss or network manager. You can also enter a message to be included in the email to give the purpose of the crawl. You may wish to include the URL of a page with additional information about your project. Also, answer any questions about the location of Microsoft Excel and Pajek.
- Enter small test as the name of the project at the bottom of the next dialog box, Wizard Step 1, and then click on the start new project button. All crawls are grouped together into projects. This allows you to have different named groups of crawls which are analysed separately.
- In the Wizard Step 2 dialog box, enter http://linkanalysis.wlv.ac.uk/ as the starting URL of the site to crawl, and then click the Crawl Site with SocSciBot button.
- The crawl is ready to go. Click the Crawl Site button. After half a minute or so, the crawl should end.
- Click OK to shut down SocSciBot when the crawl is complete. You have now crawled all pages on the http://linkanalysis.wlv.ac.uk site. Before doing some interesting simple analyses, the next stages will crawl two more sites.
You can read information about the crawl in the title bar at the top during the crawl and also at the end of the crawl.
Step 4: Crawling two more sites with SocSciBot 4
- Start up SocSciBot again by double clicking on the SocSciBot or SocSciBot.exe file where you unzipped it to on your computer. This should take you straight through to Wizard step 1. Click on small test in the project list to select this project to add another crawl to.
- .Enter http://cybermetrics.wlv.ac.uk/ as the URL of the second site to crawl in the Wizard Step 2 dialog box (not shown), and then click the Crawl Site with SocSciBot button to go to the main crawling screen.
- Click on the Crawl Site button on the next screen (not shown) and wait for the crawler to finish.
- Click OK to end the crawl.
- Repeat steps 1 to 4 for the URL http://socscibot.wlv.ac.uk/
- You have now successfully crawled three web sites and are read to analyse them!
Step 5: Viewing link analysis reports about the project of three sites with SocSciBot Tools
- Start up SocSciBot Tools by double clicking on the SocSciBot4 or SocSciBot4.exe file again. This should take you straight through to Wizard step 1. Click on small test to select this project to analyse, exactly as you have done twice before.
- Select Analyse LINKS in Project with SocSciBot Tools from the Wizard Step 2 to start the link analysis process.
- You will be asked if you want to calculate the link analysis reports for the project (the three web sites crawled). Answer Yes to this question.
- Next you will be asked if you want to standardise home page file names in your data. This improves the results by treating different versions of a web site home page as the same for the analysis. Click Yes standardise home page file names and then wait a few seconds for the reports to be calculated..
- After a few seconds, the reports will have been calculated and you can view them using the tabbed sections in the lower half of the screen.
- Click on All external links (near the top of the bottom left list). More information will be displayed about it on the right of the screen. Then click on View report to see a list of URLs targeting pages outside of each site (site outlinks). Try the same with all of the reports and try to work out what they contain. Notice that full URLs are not normally given, initial http:// and www are chopped off to save space. Try reading the same report in your web browser by clicking View in Internet Explorer - this is the same information but tidied up abit. If you have Excel on your computer, you will sometimes get extra buttons that will allow you to view the reports in Excel. These reports should contain the link information needed for most link analysis investigations. [note: some of the report names are wrong in the list below]
- A key report is ADM count summary so click on this one and then click on the View in Excel button if you have it, (otherwise click on the view report button). This shows the count of links to each site from all the other sites in the project, and the count of links from this site to the other sites in the project. These numbers are reported for each of four ADMs. Most people will only need the file ADM, (i.e. standard link counting) which is the Page inlinks (f-to in the old version of the program) column and the Page outlinks (f-from in the old version of the program) column. For example, reading these two columns for the linkanalysis.wlv.ac.uk row, there are two links to the linkanalysis.wlv.ac.uk from the other two sites, but five links from linkanalysis.wlv.ac.uk to the other two sites.
Step 6: Viewing a network diagram
Network diagrams for the links between sites crawled can be viewed with SocSciBot 4 Network or Pajek, if it is on your system.
- Select the tab: Network diagram for the whole project. If no file is listed on the left of the screen, click the Re/Calculate Network button. When the name single.combined.full appears click on this to see it drawn as a network in SocSciBot 4 Network.
- The initial network is arranged at random. To see a better network arrangement, select Fruchterman Reingold (a network arrangement program) from the Layout menu.
- You can rearrange the network by clicking on nodes and dragging them around. This and also try the options in the tab on the right hand side of the screen to make the nodes and arrows bigger and smaller. Also, select some nodes by clicking and dragging across them and then right click to activate a menu of properties that can be changed. Change the colour of the selected nodes to yellow and try out some other changes.
- See also the online documentation for SocSciBot Network.
- [Skip to Step 7 if you are not interested in using Pajek] To view the site network in Pajek (if Pajek is on your system), select the View Network in Pajek radio button below the single.combined full file and then click on single.combined.full to view it in Pajek.
- Data for the network should now be loaded into Pajek. To view the network, select Draw from the Draw menu in Pajek.
- If the network does not have labels (site domain names), select the Options menu, Mark Vertices using, and Labels (or just Control-L). This should give a network of the inter-site links, excluding the internal site links.
- To get an improved layout of the network diagram try selecting the Kamada-Kawai positioning algorithm by selecting Layout, Energy, Kamada-Kawai, Free and then viewing the result.
Step 7: Viewing site Networks
- If you would like to see a network of each individual site, rather than the inter-site connections, select the Network Diagrams for Individual Sites tab. This does not immediately give networks for individual sites, because the default settuing is to ignore all internal site links. You need to tell SocSciBot Tools that you want the internal site links and not any other type of link if you would like a diagram of the internal structure of a web site. To do this, select Select Types of Links to Include in Reports from the Link Type Options menu, and then ensure that of the 4 square check boxes, only Site Self-links is selected and then click OK.
- Now click the Re-calculate Networks button in the Network Diagrams for Individual Sites tab.
- When the files have been creaed and listed on the screen, view them in SocSciBot 4 Network by clicking on them. If you prefer to see them in Pajek (and have Pajek on your system) check the Pajek radio button before clicking on a site. Below are two of the networks in Pajek, redrawn with the Kamada-Kawai algorithm. The second one has so many lines that it is hard to interpret.
Step 8: Using Cyclist as a Search Engine
- Start up SocSciBot 4 by double clicking on the file called either SocSciBot or SocSciBot.exe. Select your project and then start Cyclist by clicking on the Cyclist button in the SocSciBot Wizard Step 2. Cyclist is a text search engine, not a link analysis program.
- You will be asked whether you want to import a stoplist. Answer No to this question.
- After 20 seconds or so of calculations you will get a standard search engine type interface. Try searching for a common word, like "link" and then clicking on the results in the right hand side to see what information you are given. In the example below, 77 pages in the project contain the word link (or links) and the first 10 are listed, with some extra information about them.
The steps of this tutorial apply equally for small and large projects. The only difference is that for a large project, it may take a significant time for the site crawls and for SocSciBot Tools and Cyclist to process the data.
When collecting data for a real project, please complete your crawls before starting Cyclist or SocSciBot Tools and, if possible, back up your data before starting SocSciBot Tools or Cyclist, just in case anything goes wrong.