Home | Tutorial 1 | Tutorial 2 | Tutorial 3 | Linguistics | FAQ |Reports| Book
See also the Korean SocSciBot 3 help manual by a student of Han Woo Park and a Korean SocSciBot3.presentation by Han Woo Park.
SocSciBot
SocSciBot Tools
Cyclist
It can be but there are some issues. People who have studied blog links in the past have found that they are quite rare - most blogs don't have any links to other sites other than the blogrolls. But this is not true for some genres of blogs - e.g., news filtering blogs.
There is a problem with automatically extracting accurate and complete data from blogs. Blogs are quite repetitive - the same content could be on several pages in different formats and with different URLs (e.g., a page with one post; a monthly archive containing the same post and other posts, as well as the home page which may also contain the same post) - and there are different kinds of links on pages. The current recommendations are as follows:
Use SocSciBot 4 rather than SocSciBot 3
Crawl all the blogs that you are interested in, keeping all the crawls together in the same SocSciBot project.
Once the crawls are finished, analyse the links using the "domain" links rather than normal full URLs (In SocSciBot 4 this is in the File Type options menu). This stops multiple repeated links from the same site, but only reports on the domains (i.e. web sites) targetted by the links.
If you want to exclude some of the links from your analysis but not all (e.g., you are only interested in links to blogspace.com and livejournal.com blogs) then you can use this technique to separate the links into two groups, one of which contains the links you are interested in.
This is possible in SocSciBot 4, using the tools option. First create a text file containing partial URLs to match the links that you would like to include. For example, if you are only interested in links to blogspace.com and livejournal.com then your text file should contain only the following two lines:
blogspace.com
livejournal.com
Do not include the http:// or http://www at the start of any URL.
Now use the Banned|Split project menu option to split your project into two new parts: one will contain all links matching any of the lines in your file; the other will contain the remainder of the links. The new project containing the matching links can now be analysed with SocSciBot Tools.
Click on the name of the crawl in the second SocSciBot Wizard window, and it will prompt you for options to delete it.
SocSciBot bans pages if the site itself requests that they are banned, through the use of the robots.txt protocol. It also bans URLs containing any of the following - all of which are commonly found in mirror sites or large collections of dynamic pages:
/cgi-bin/
.cgi
.dll
archive
/calendar/
/ftp/
ftp.
/handbook/
hypermail
javadoc
java/doc
/JDK1.
/JDK/
/JDK2.
/manual/
/manuals/
mirror
/parser.pl/
pipermail
/record=
/roombooking/
sashtml
/search/
sessionid
timetable
twiki
unixhelp
wwwstats
webstats
and if the ban bulletin boards option is selected then it also bans
bbs.
wwwboard
Duplicate elimination in SocSciBot is done through comparison of the full HTML of the two pages. This can be 'fooled' if the pages contain changeable elements, such as a text counter, rotating adverts or other dynamic elements.
Switch on the advanced options in SocSciBot Tools, using the Advanced menu. Then explore the new menu options available - some brief explanations are given in the programs themselves, but there are no proper instructions yet. Please experiment with a small data set first as some of the programs take a long time on large data sets.
In SocSciBot Tools, sleect Program Options from the File menu.
Partly its author is not an efficient programmer, and partly the text processing involved requires a lot of CPU time.
If you intend to use cyclist or want to keep all the web pages then you must move all the data to another computer.
You can do this but must not close the program down. Click on the Exit/End button and then do not answer the question in the dialog box that will appear. When you want to continue the crawl, click on “no” to signify that you do not want to finish the crawl. SocSciBot will then continue.
This is not possible, sorry.
It is possible to work through some Proxy Servers with SocSciBot. Open the socscibot.ini file in Notepad and at the bottom of the file add this line [Proxy cache URL and port] Directly below this line, add the URL of your proxy cache (your IT help should know this).Directly below this line add the proxy port number (your IT help should know this). For example this might be as follows
If you do this then it is possible that SocSciBot will start to work again by working through the proxy server.
Yes.
Yes. When asked if you want to create an index for the site, answer "no". Select Index from the main screen menu and then select all the options that you want for the indexing and click the "Build Index" button. This may take a long time.
Without the dot at the start there is a chance that a wrong domain name will be matched, For example a.com would also match ba.com since the matching is done via a simple string comparison.
The augmented crawl adds all "sub-paths" to the list of URLs to be crawled. For example if the URL
http://news.bbc.co.uk/2/hi/middle_east/6279864.stm
is added to the crawl list then the following URLs will also automatically be added.
http://news.bbc.co.uk/2/hi/middle_east/
http://news.bbc.co.uk/2/hi/
http://news.bbc.co.uk/2/
http://news.bbc.co.uk/
The list of domain names don't have any impact on the process unless there are web sites in your list with multiple domain names (e.g., wlv.ac.uk = wolverhampton.ac.uk) in which case you can specify this in a domain name file by including the line
wlv.ac.uk <tab> wolverhampton.ac.uk
If you are just doing a link analysis then you need to (a) combine the domain_names.txt files in the info folders, putting the result in the second computer info folder (with the same name) and (b) copy the files in the "link results" folder from the first computer to the second. Do this BEFORE running SocSciBot Tools for the first time on the data.
No, once is enough. You can add to the banned list during and even after completing the crawl. The banned pages will then be retrospectively removed by SocSciBot Tools, if you add to the banned list before the first use of SocSciBot Tools on the data. But if you do this then you need to create a blank text file in the same folder as the modified banned.txt called exactly ""banned list updated.txt". The existence of a file with this name is the 'flag' to SocSciBot Tools to look for and remove extra banned pages.
SocSciBot Tools and Cyclist need to be in the same folder as SocSciBot to work best. If either of them does not show SocSciBot data then it may be in a different folder. See the Windows Vista fix for solutions to this. If not using Windows Vista then either move the other two programs to the same folder as SocSciBot or copy the SocSciBot.ini file from the folder containing SocSciBot to the folders containing the other two programs.
The problem might be due to JavaScript, Java or Flash links on the home page or at key points within the site. To try to get round this, add a list of URLs of extra pages in the site to give the crawler alternative points. To do this, browse the site by yourself and then make a text file listing lots of URLs (including the one above) and rename this as startl.txt and put it in the folder created by SocSciBot just before crawling (called with the same name as the domain name of your web site) and check the Preload start list start.txt option just before the crawl button. This will ensure that all the pages you have found are added and - especially if you have found a page like a site map that links to loads of pages - then the crawler should be able to find more pages.