Home | Tutorial 1 | Tutorial 2 | Tutorial 3 | Linguistics | FAQ |Reports| Book
Tutorial 1: Introduction to SocSciBot, SocSciBot Tools and Cyclist
This tutorial introduction goes through the full stages of a very small SocSciBot project, from the initial crawling to analyzing the link data. Going through this project is the easiest way of learning what SocSciBot can do.
Step 1: Installing SocSciBot, SocSciBot Tools and Cyclist
- Go to the SocSciBot web site http://socscibot.wlv.ac.uk/ and follow the link to download the programs, only if you agree with the conditions of use. When prompted by your computer, choose a place to save the programs to where you have plenty of storage space to save data. This will typically be your computer's hard drive, e.g. the C: drive. If given the choice please save them to the same folder (especially Windows Vista).
- Next, unzip the file SocSciBotAll.zip from the place where you saved it. This will create several new files, including the programs SocSciBot, SocSciBot Tools, and Cyclist.
Step 2: Installing Pajek
If you want to produce network diagrams with SocSciBot data then you are recommended to install Pajek. You need to do this before starting SocSciBot for the first time, because SocSciBot looks for Pajek when it is first started, and will not find Pajek if Pajek is installed after SocSciBot is first run.
- Go to the Pajek home page http://vlado.fmf.uni-lj.si/pub/networks/pajek/ and download and install the latest version of Pajek.
Step 3: Crawling a first site with SocSciBot
- Start up SocSciBot by double clicking on the file called either SocSciBot or SocSciBot.exe where you unzipped it to on your computer. This should produce a dialog box similar to the one below. Note - this only happens the first time you start SocSciBot.
- Confirm that the folder chosen by SocSciBot to store your data is acceptable by clicking OK. Also enter your correct email address, if a box is provided for this. It will be used to email the webmasters of any sites that you crawl. This is both ethical practice and may save you from getting into trouble if a webmaster is unhappy with you crawling their site - they can email you directly instead of emailing your boss or network manager. You can also enter a message to be included in the email to give the purpose of the crawl. You may wish to include the URL of a page with additional information about your project. Also, answer any questions about the location of Microsoft Excel and Pajek.
- Enter small test as the name of the project at the bottom of the next dialog box, Wizard Step 1, and then click on the start new project button. All crawls are grouped together into projects. This allows you to have different named groups of crawls which are analysed separately.
- Click No to answer the strange question that you are asked next. This is an advanced data cleansing facility that you are unlikely to need before you become an expert user.
- Click No to answer the second strange question that you are asked next. This is another advanced facility that you are unlikely to need before you become an expert user.
- In the wizard step 2 dialog box, enter http://linkanalysis.wlv.ac.uk/ as the starting URL of the site to crawl, and then click Start a new crawl of this site.
- The crawl is ready to go. Click the Crawl Site button. After half a minute or so, the crawl should end.
You can read information about the crawl in the title bar at the top during the crawl and also at the end of the crawl.
- Click Yes to shut down SocSciBot when the crawl is complete. You have now crawled all pages on the http://linkanalysis.wlv.ac.uk site. Before doing some interesting simple analyses, the next stages will crawl two more sites.
Step 4: Crawling two more sites with SocSciBot
- Start up SocSciBot again by double clicking on the SocSciBot or SocSciBot.exe file where you unzipped it to on your computer. This should take you straight through to Wizard step 1. Click on small test to select this project to add another crawl to.
- .Enter http://cybermetrics.wlv.ac.uk/ as the URL of the second site to crawl, and click Start a new crawl of this site.
- Click on the Crawl site button on the next screen and wait for the crawler to finish.
- Click Yes to end the crawl.
- Repeat steps 1 to 4 for the URL http://socscibot.wlv.ac.uk/
- You have now successfully crawled three web sites and are read to analyse them!
Step 5: Viewing basic reports about the project of three sites with SocSciBot Tools
- Start up SocSciBot Tools by double clicking on the SocSciBot Tools or SocSciBot Tools.exe file where you unzipped it to on your computer. This should take you straight through to Wizard step 1. Click on small test to select this project to analyse. Note: SocSciBot Tools needs to be in the same folder as SocSciBot to work best. If it does not show SocSciBot data then it may be in a different folder. See the Windows Vista fix for solutions to this.
- Select Use this project in the following dialog box.
- Answer Yes to the question about whether you would like a set of basic reports.
- After a few seconds, the reports will have been calculated and you can view them using the drop down menu in the middle of the screen. Click on All external links (at the top of the list). More information will be displayed about it on the right of the screen. Then click on View report to see a list of URLs targeting pages outside of each site (site outlinks). Try the same with all of the reports and try to work out what they contain. Notice that full URLs are not normally given, initial http:// and www are chopped off to save space. If you have Excel on your computer, you will sometimes get extra buttons that will allow you to view the reports in Excel. These reports should contain the link information needed for most link analysis investigations.
- A key report is ADM count summary so click on this one and then click on the View in Excel button if you have it, (otherwise click on the view report button). This shows the count of links to each site from all the other sites in the project, and the count of links from this site to the other sites in the project. These numbers are reported for each of four ADMs. Most people will only need the file ADM, (i.e. standard link counting) which is the Page inlinks (f-to in the old version of the program) column and the Page outlinks (f-from in the old version of the program) column. For example, reading these two columns for the linkanalysis.wlv.ac.uk row, there are two links to the linkanalysis.wlv.ac.uk from the other two sites, but five links from linkanalysis.wlv.ac.uk to the other two sites.
Step 6: Viewing a network diagram with Pajek
If you have installed Pajek on your system, you can view network diagrams that it has created.
- Use the drop-down box in the middle of the screen to select the option Pajek matrix for the whole project (with current options). If you are asked if you want to calculate the file, click Yes.
- Click on the report single.combined.full to view it in Pajek.
- Data for the network should now be loaded into Pajek. To view the network, select Draw from the Draw menu in Pajek.
- If the network does not have labels (site domain names), select the Options menu, Mark Vertices using, and Labels (or just Control-L). This should give a network of the inter-site links, excluding the internal site links.
- To get an improved layout of the network diagram try selecting the Kamada-Kawai positioning algorithm by selecting Layout, Energy, Kamada-Kawai, Free and then viewing the result.
Step 7: Viewing site diagrams with Pajek
- If you would like to see a diagram of each individual site, rather than the inter-site connections, this is also possible. If you select Pajek matrices for each individual site (with current options) from the drop-down menu, you will not quite get this, because the default for SocSciBot Tools is to ignore all internal site links. What diagrams do you get, can you tell? You need to tell SocSciBot Tools that you want the internal site links and not any other type of link if you would like a diagram of the internal structure of a web site. To do this in SocSciBot Tools, select Options and Subproject and ADM selection wizard from the File menu, and select just the site self-links options.
- Now select Pajek matrices for each individual site (with current options) from the drop-down menu and view the files, by clicking on them. You should get individual site networks in Pajek. Below are two of the networks, redrawn with the Kamada-Kawai algorithm. The second one has so many lines that it is hard to interpret.
Step 8: Using Cyclist as a Search Engine
- Start up Cyclist by double clicking on the file called either Cyclist or Cyclist.exe where you unzipped it to on your computer. Cyclist is a text search engine, not a link analysis program. Note: Cyclist needs to be in the same folder as SocSciBot to work best. If it does not show SocSciBot data then it may be in a different folder. See the Windows Vista fix for solutions to this.
- Answer the questions and after 20 seconds or so of calculations you will get a standard search engine type interface. Try searching for a common word, like "link" and then clicking on the results in the right hand side to see what information you are given. In the example below, 41 pages in the project contain the word link (or links) and the first 10 are listed, with some extra information about them.
The steps of this tutorial apply equally for small and large projects. The only difference is that for a large project, it may take a significant time for the site crawls and for SocSciBot Tools and Cyclist to process the data.
When collecting data for a real project, please complete your crawls before starting Cyclist or SocSciBot Tools and, if possible, back up your data before starting SocSciBot Tools or Cyclist, just in case anything goes wrong.