Home | Tutorial 1 | Tutorial 2 | Tutorial 3 | Linguistics | FAQ |Reports| Book

Frequently Asked Questions for SocSciBot, SocSciBot Tools and Cyclist

See also the Korean SocSciBot 3 help manual by a student of Han Woo Park and a Korean SocSciBot3.presentation by Han Woo Park.

SocSciBot

SocSciBot Tools

Cyclist

Can SocSciBot be used for Blog link analysis?

It can be but there are some issues. People who have studied blog links in the past have found that they are quite rare - most blogs don't have any links to other sites other than the blogrolls. But this is not true for some genres of blogs - e.g., news filtering blogs.

There is a problem with automatically extracting accurate and complete data from blogs. Blogs are quite repetitive - the same content could be on several pages in different formats and with different URLs (e.g., a page with one post; a monthly archive containing the same post and other posts, as well as the home page which may also contain the same post) - and there are different kinds of links on pages. The current recommendations are as follows:

Use SocSciBot 4 rather than SocSciBot 3

Crawl all the blogs that you are interested in, keeping all the crawls together in the same SocSciBot project.

Once the crawls are finished, analyse the links using the "domain" links rather than normal full URLs (In SocSciBot 4 this is in the File Type options menu). This stops multiple repeated links from the same site, but only reports on the domains (i.e. web sites) targetted by the links.

If you want to exclude some of the links from your analysis but not all (e.g., you are only interested in links to blogspace.com and livejournal.com blogs) then you can use this technique to separate the links into two groups, one of which contains the links you are interested in.

Can I analyse a subset of the links, other than the ones given (i.e., links between crawled sites, links within crawled sites, links to all uncrawled sites)?

This is possible in SocSciBot 4, using the tools option. First create a text file containing partial URLs to match the links that you would like to include. For example, if you are only interested in links to blogspace.com and livejournal.com then your text file should contain only the following two lines:

blogspace.com
livejournal.com

Do not include the http:// or http://www at the start of any URL.

Now use the Banned|Split project menu option to split your project into two new parts: one will contain all links matching any of the lines in your file; the other will contain the remainder of the links. The new project containing the matching links can now be analysed with SocSciBot Tools.

How do I delete a crawl from a project?

Click on the name of the crawl in the second SocSciBot Wizard window, and it will prompt you for options to delete it.

What type of pages are automatically banned during a SocSciBot crawl?

SocSciBot bans pages if the site itself requests that they are banned, through the use of the robots.txt protocol. It also bans URLs containing any of the following - all of which are commonly found in mirror sites or large collections of dynamic pages:
/cgi-bin/
.cgi
.dll
archive
/calendar/
/ftp/
ftp.
/handbook/
hypermail
javadoc
java/doc
/JDK1.
/JDK/
/JDK2.
/manual/
/manuals/
mirror
/parser.pl/
pipermail
/record=
/roombooking/
sashtml
/search/
sessionid
timetable
twiki
unixhelp
wwwstats
webstats

and if the ban bulletin boards option is selected then it also bans
bbs.
wwwboard

How does SocSciBot perform duplicate elimination and why has it not identified all of the duplicates?

Duplicate elimination in SocSciBot is done through comparison of the full HTML of the two pages. This can be 'fooled' if the pages contain changeable elements, such as a text counter, rotating adverts or other dynamic elements.

How do I carry out topological analyses of my links?

Switch on the advanced options in SocSciBot Tools, using the Advanced menu. Then explore the new menu options available - some brief explanations are given in the programs themselves, but there are no proper instructions yet. Please experiment with a small data set first as some of the programs take a long time on large data sets.

How do I register a new location for Pajek, Excel or the data files after initially starting the program and having completed these options?

In SocSciBot Tools, sleect Program Options from the File menu.

Why does SocSciBot Tools take so long to process large files the first time?

Partly its author is not an efficient programmer, and partly the text processing involved requires a lot of CPU time.

The data from the crawls is taking up too much space on my computer! or I am getting an error message telling me that I have less than 10Mb left!

If you intend to use cyclist or want to keep all the web pages then you must move all the data to another computer.

  1. If you only want to do a link analysis but do not want to use cyclist and do not want to look at the web pages crawled by the program then you can save a lot of space by deleting these web pages. To do this, do either (a) or (b) below.
    Use windows explorer to navigate the project folder and delete all the folders with the domain names of completed crawls. For example, if you have completed a crawl of http://www.wlv.ac.uk then the folder www.wlv.ac.uk can be deleted. Do not delete any folders that are not domain names. Do not delete any folders that are of partially completed crawls. The folders “link results” and “info” are particularly important.
  2. Start SocSciBot Tools and select the project, but do not allow it to process the data. Select “Delete web pages” from the File menu and allow it to delete the folders of all completed projects. Do not delete any folders that are of partially completed crawls.

I want to pause a SocSciBot crawl and continue crawling the next day

You can do this but must not close the program down. Click on the Exit/End button and then do not answer the question in the dialog box that will appear. When you want to continue the crawl, click on “no” to signify that you do not want to finish the crawl. SocSciBot will then continue.

I want to shut down SocSciBot and resume the same crawl the next day.

This is not possible, sorry.

I get an error about Proxy Servers! What can I do?

It is possible to work through some Proxy Servers with SocSciBot. Open the socscibot.ini file in Notepad and at the bottom of the file add this line   [Proxy cache URL and port]  Directly below this line, add the URL of your proxy cache (your IT help should know this).Directly below this line add the proxy port number  (your IT help should know this).   For example this might be as follows

[Proxy cache URL and port]
http://proxy.wlv.ac.uk/
39

If you do this then it is possible that SocSciBot will start to work again by working through the proxy server.

I want to crawl a list of URLs rather than a Web site. Is this possible?

Yes.

  1. First, create a plain text file in Windows Notepad, containing the list of URLs. It must have one URL per line and must not contain any blank lines at the beginning or end of the file. There must not be any spaces or tabs after any of the URLs. The file must be called start.txt.
  2. Set up but do not start a "dummy" crawl of the immaginary web site www.fake.fz in a project folder in SocSciBot. Continue until you get the crawl window with the Start Crawl button but do not start the crawl yet.
  3. Click on the "Preload start list start.txt" option near the middle of the SocSciBot screen.
  4. Copy the file start.txt into the new folder created by SocSciBot for your www.fake.fz crawl. To find out where this is, let your mouse hover over the "Preload start list start.txt" text in SocSciBot.
  5. Click on the Crawl Site button and wait for it to finish.

I want a different type of index - e.g., keeping plurals and/or words containing numbers. Is this possible?

Yes. When asked if you want to create an index for the site, answer "no". Select Index from the main screen menu and then select all the options that you want for the indexing and click the "Build Index" button. This may take a long time.

If I change the domain from .domain.com to just domain.com what does this do?

Without the dot at the start there is a chance that a wrong domain name will be matched, For example a.com would also match ba.com since the matching is done via a simple string comparison.

What does the augmented crawl do?

The augmented crawl adds all "sub-paths" to the list of URLs to be crawled. For example if the URL
http://news.bbc.co.uk/2/hi/middle_east/6279864.stm
is added to the crawl list then the following URLs will also automatically be added.
http://news.bbc.co.uk/2/hi/middle_east/
http://news.bbc.co.uk/2/hi/
http://news.bbc.co.uk/2/
http://news.bbc.co.uk/

How does the list of project names, imported at the start, impact on the process?

The list of domain names don't have any impact on the process unless there are web sites in your list with multiple domain names (e.g., wlv.ac.uk = wolverhampton.ac.uk) in which case you can specify this in a domain name file by including the line
wlv.ac.uk <tab> wolverhampton.ac.uk

If I crawl on different machines, how do I combine the results?

If you are just doing a link analysis then you need to (a) combine the domain_names.txt files in the info folders, putting the result in the second computer info folder (with the same name) and (b) copy the files in the "link results" folder from the first computer to the second. Do this BEFORE running SocSciBot Tools for the first time on the data.

I'm using the banned list. Should I crawl twice, the second time after building the banned list?

No, once is enough. You can add to the banned list during and even after completing the crawl. The banned pages will then be retrospectively removed by SocSciBot Tools, if you add to the banned list before the first use of SocSciBot Tools on the data. But if you do this then you need to create a blank text file in the same folder as the modified banned.txt called exactly ""banned list updated.txt". The existence of a file with this name is the 'flag' to SocSciBot Tools to look for and remove extra banned pages.

I've crawled with SocSciBot but SocSciBot Tools/Cyclist can't find my data!

SocSciBot Tools and Cyclist need to be in the same folder as SocSciBot to work best. If either of them does not show SocSciBot data then it may be in a different folder. See the Windows Vista fix for solutions to this. If not using Windows Vista then either move the other two programs to the same folder as SocSciBot or copy the SocSciBot.ini file from the folder containing SocSciBot to the folders containing the other two programs.

SocSciBot misses a lot of pages on the sites crawled - can this problem be fixed?

The problem might be due to JavaScript, Java or Flash links on the home page or at key points within the site. To try to get round this, add a list of URLs of extra pages in the site to give the crawler alternative points. To do this, browse the site by yourself and then make a text file listing lots of URLs (including the one above) and rename this as startl.txt and put it in the folder created by SocSciBot just before crawling (called with the same name as the domain name of your web site) and check the Preload start list start.txt option just before the crawl button. This will ensure that all the pages you have found are added and - especially if you have found a page like a site map that links to loads of pages - then the crawler should be able to find more pages.