Frequently Asked Questions for SocSciBot 4

See also the SocSciBot 4 Blog and the list of reports generated by SocSciBot 4.

SocSciBot 4 - crawling the web sites

SocSciBot 4 Tools/ SocSciBot Network- analysing the links after crawling

Cyclist - analysing the text after crawling

Can SocSciBot 4 be used for Blog link analysis?

Yes but there are some issues. People who have studied blog links in the past have found that they are quite rare - most blogs don't have any links to other sites other than the blogrolls. But this is not true for some genres of blogs - e.g., news filtering blogs.

There is a problem with automatically extracting accurate and complete data from blogs. Blogs are quite repetitive - the same content could be on several pages in different formats and with different URLs (e.g., a page with one post; a monthly archive containing the same post and other posts, as well as the home page which may also contain the same post) - and there are different kinds of links on pages. The current recommendations are as follows:

Crawl all the blogs that you are interested in, keeping all the crawls together in the same SocSciBot 4 project.

Once the crawls are finished, analyse the links using the "domain" links rather than normal full URLs. In SocSciBot 4 this is in the File Type options menu. This stops multiple repeated links from the same site, but only reports on the domains (i.e. web sites) targetted by the links.

If you want to exclude some of the links from your analysis but not all (e.g., you are only interested in links to blogspace.com and livejournal.com blogs) then you can use this technique to separate the links into two groups, one of which contains the links you are interested in.

Can I analyse a subset of the links, other than the ones given (i.e., links between crawled sites, links within crawled sites, links to all uncrawled sites)?

This is possible in SocSciBot 4, using the tools option. First create a text file containing partial URLs to match the links that you would like to include. For example, if you are only interested in links to blogspace.com and livejournal.com then your text file should contain only the following two lines:

blogspace.com
livejournal.com

Do not include the http:// or http://www at the start of any URL.

Now use the Banned|Split project menu option to split your project into two new parts: one will contain all links matching any of the lines in your file; the other will contain the remainder of the links. The new project containing the matching links can now be analysed with SocSciBot 4.

How do I delete a crawl from a project?

Click on the name of the crawl in the second SocSciBot 4 Wizard window, and it will prompt you for options to delete it.

What type of pages are automatically banned during a SocSciBot 4 crawl?

SocSciBot 4 bans pages if the site itself requests that they are banned, through the use of the robots.txt protocol. It also bans URLs containing any of the following - all of which are commonly found in mirror sites or large collections of dynamic pages:
/cgi-bin/
.cgi
.dll
archive
/calendar/
/ftp/
ftp.
/handbook/
hypermail
javadoc
java/doc
/JDK1.
/JDK/
/JDK2.
/manual/
/manuals/
mirror
/parser.pl/
pipermail
/record=
/roombooking/
sashtml
/search/
sessionid
timetable
twiki
unixhelp
wwwstats
webstats

and if the ban bulletin boards option is selected then it also bans
bbs.
wwwboard

How does SocSciBot 4 perform duplicate elimination and why has it not identified all of the duplicates?

Duplicate elimination in SocSciBot is done through comparison of the full HTML of the two pages. This can be 'fooled' if the pages contain changeable elements, such as a text counter, rotating adverts or other dynamic elements.

How do I carry out topological analyses of my links?

This is not possible with SocSciBot 4 but is possible with SocSciBot Tools, which is part of SocSciBot 3 and is fully compatible with SocSciBot 4. First download SocSciBot Tools. Switch on the advanced options in SocSciBot Tools, using the Advanced menu. Then explore the new menu options available - some brief explanations are given in the programs themselves, but there are no proper instructions yet. Please experiment with a small data set first as some of the programs take a long time on large data sets.

How do I register a new location for Pajek, Excel or the data files after initially starting the program and having completed these options?

In SocSciBot 4, after selecting a project and opening it with SocSciBot Tools, select Pajek and Excel Program Options from the File menu.

Why does SocSciBot Tools take so long to process large files the first time?

Partly its author is not an efficient programmer, and partly the text processing involved requires a lot of CPU time.

The data from the crawls is taking up too much space on my computer! or I am getting an error message telling me that I have less than 10Mb left!

If you intend to use cyclist or want to keep all the web pages then you must move all the data to another computer.

If you only want to do a link analysis but do not want to use cyclist and do not want to look at the web pages crawled by the program then you can save a lot of space by deleting these web pages. To do this, use windows explorer to navigate the project folder and delete all the folders with the domain names of completed crawls. For example, if you have completed a crawl of http://www.wlv.ac.uk then the folder www.wlv.ac.uk can be deleted. Do not delete any folders that are not domain names. Do not delete any folders that are of partially completed crawls. The folders “link results” and “info” are particularly important.

I want to pause a SocSciBot 4 crawl and continue crawling the next day

You can do this but must not close the program down. Click on the Exit/End button and then do not answer the question in the dialog box that will appear. When you want to continue the crawl, click on “no” to signify that you do not want to finish the crawl. SocSciBot will then continue.

I want to shut down SocSciBot 4 and resume the same crawl the next day.

This is not possible, sorry.

I get an error about Proxy Servers! What can I do?

It is possible to work through some Proxy Servers with SocSciBot. Open the socscibot.ini file in Notepad and at the bottom of the file add this line   [Proxy cache URL and port]  Directly below this line, add the URL of your proxy cache (your IT help should know this).Directly below this line add the proxy port number  (your IT help should know this).   For example this might be as follows

[Proxy cache URL and port]
http://proxy.wlv.ac.uk/
39

If you do this then it is possible that SocSciBot will start to work again by working through the proxy server.

I want to crawl a list of URLs rather than a Web site. Is this possible?

Yes. In the SocSciBot 4 Wizard Step 2, check on the Download multiple sites/URLs in one combined crawl option and then follow the instructions on the new crawling screen. An alternative method is below, if you prefer.

  1. First, create a plain text file in Windows Notepad, containing the list of URLs. It must have one URL per line and must not contain any blank lines at the beginning or end of the file. There must not be any spaces or tabs after any of the URLs. The file must be called start.txt.
  2. Set up but do not start a "dummy" crawl of the immaginary web site www.fake.fz in a project folder in SocSciBot. Continue until you get the crawl window with the Start Crawl button but do not start the crawl yet.
  3. Click on the "Preload start list start.txt" option near the middle of the SocSciBot screen.
  4. Copy the file start.txt into the new folder created by SocSciBot for your www.fake.fz crawl. To find out where this is, let your mouse hover over the "Preload start list start.txt" text in SocSciBot.
  5. Click on the Crawl Site button and wait for it to finish.

I want a different type of index - e.g., keeping plurals and/or words containing numbers. Is this possible?

Yes. When asked if you want to create an index for the site, answer "no". Select Index from the main screen menu and then select all the options that you want for the indexing and click the "Build Index" button. This may take a long time.

If I change the domain from .domain.com to just domain.com what does this do?

Without the dot at the start there is a chance that a wrong domain name will be matched, For example a.com would also match ba.com since the matching is done via a simple string comparison.

What does the augmented crawl do?

The augmented crawl adds all "sub-paths" to the list of URLs to be crawled. For example if the URL
http://news.bbc.co.uk/2/hi/middle_east/6279864.stm
is added to the crawl list then the following URLs will also automatically be added.
http://news.bbc.co.uk/2/hi/middle_east/
http://news.bbc.co.uk/2/hi/
http://news.bbc.co.uk/2/
http://news.bbc.co.uk/

How does the list of project names, imported at the start, impact on the process?

The list of domain names don't have any impact on the process unless there are web sites in your list with multiple domain names (e.g., wlv.ac.uk = wolverhampton.ac.uk) in which case you can specify this in a domain name file by including the line
wlv.ac.uk <tab> wolverhampton.ac.uk

If I crawl on different machines, how do I combine the results?

If you are just doing a link analysis then you need to (a) combine the domain_names.txt files in the info folders, putting the result in the second computer info folder (with the same name) and (b) copy the files in the "link results" folder from the first computer to the second. Do this BEFORE running SocSciBot Tools for the first time on the data.

I'm using the banned list. Should I crawl twice, the second time after building the banned list?

No, once is enough. You can add to the banned list during and even after completing the crawl. The banned pages will then be retrospectively removed by SocSciBot Tools, if you add to the banned list before the first use of SocSciBot Tools on the data. But if you do this then you need to create a blank text file in the same folder as the modified banned.txt called exactly ""banned list updated.txt". The existence of a file with this name is the 'flag' to SocSciBot Tools to look for and remove extra banned pages.

SocSciBot 4 misses a lot of pages on the sites crawled - can this problem be fixed?

The problem might be due to JavaScript, Java or Flash links on the home page or at key points within the site. To try to get round this, add a list of URLs of extra pages in the site to give the crawler alternative points. To do this, browse the site by yourself and then make a text file listing lots of URLs (including the one above) and rename this as startl.txt and put it in the folder created by SocSciBot 4 just before crawling (called with the same name as the domain name of your web site) and check the Preload start list start.txt option just before the crawl button. This will ensure that all the pages you have found are added and - especially if you have found a page like a site map that links to loads of pages - then the crawler should be able to find more pages.

How can I find the information about the crawl shown in the dialog box at the end?

The dialog box information can be found by going into the SocSciBotTools link analysis part and selecting Crawl log from the Help menu.

Should I use SocSciBot or Webometric Analyst to analyse the links between a set of large web sites?

SocSciBot crawls web sites and extracts the hyperlinks from them. You can use it to get comprehensive information about the links between large web sites. SocSciBot will take a long time to crawl these web sites if there are many (more than 10) that you are interested in.

Webometric Analyst gets data on links from Yahoo! and can also give information about the links between large web sites. For example, you can use the Network Diagram wizard option to get a network diagram of the links. This is much faster than SocSciBot but gives less complete lists of links between the web sites.

It is recommend to start with Webometric Analyst and switch to SocSciBot only if you really need the extra information that it gives. This will save a lot of time.

Error message something like "cannot ... it is being used by another person or program"

Please switch the computer off and on again (it must be full system restart) and try again, but make sure that no other programs are running at the same time as SocSciBot.

My reports in SocSciBot Tools have different names!

Please see this list for equivalent report names for old and new versions of SocSciBot.

How can I get a network of the links between my crawled web sites where the arrow thicknesses are proportional to the number of links between the web sites?

If you draw a network diagram in SocSciBot of the links between the web sites then the default is that all arrows have the same width. To get round this, you can draw a network of the individual pages in all the web sites (if there are not too many) and then get the network drawing program to merge together the nodes from the same web site. To do this, first crawl your sites, restart SocSciBot and click the Analyse Links button in the SocScibot Wizard. This will take you to the main link results screen (there may be a few buttons to click first). Now click the "Network Diagrams for Whole Project" tab and then the Re/Calculate network button. This will create a network diagram in which the individual pages crawled are shown as well as the links between web sites. To see this diagram, click "single.combined.full" in the text box. The network of links between pages should now appear.

To aggregate the links between sites, select Aggregate Nodes by Domain Names, Site or TLD from the Appearance menu. Enter 1 in the colouring dialog box if all your web sites use only one domain name, or 2 if they use multiple domain names. This will colour the nodes by web site. Now the nodes of the same colour can be merged by selecting Merge all groups of nodes with the same colour from the Edit menu.

If the lines are too thick, from the Network menu, select the option to multiply arrow widths by a constant and enter a number less than 1 to multiply the widths by a fraction.

SocSciBot keeps attempting to download huge non-text files (e.g., pdf, xml) and crashes

Before crawling go to the Experimental tab at the top of the screen and check the option "Only download pages with text/html MIME type". This stops SocSciBot from working on some sites but guards against downloading non-HTML pages.