Re: Corpora: Qs. reg. collection of hypertext corpus

Einat Amitay (einat@mpce.mq.edu.au)
Fri, 18 Sep 1998 14:41:46 +1000

Hi,

I've done similar work (linguistic analysis of homepages) and you can find it in my
thesis which analyses a corpus of a 1000 web pages:

Amitay E. (1997). Hypertext - The importance of being different. MSc Dissertation,
Centre for Cognitive Science, Edinburgh University, Scotland. Also a Technical Report -
No. HCRC/RP-94.
http://www.hcrc.ed.ac.uk/publications/rp-94.ps.gz

I've found that the best way to deal with the real HTML people use (which is very
different from the DTD w3c supplies and which would be considered full of errors) is to
filter things with a tool you write yourself - short scripts are usually the best
choice.

There's also a short Java script that you could use for retrieving each url and then
you can parse the source to proceed (I have the java code if you need it - 20 lines or
so).

Another way is to use the web robots that are programs that traverse the Web
automatically.:
http://info.webcrawler.com/mak/projects/robots/robots.html

Good Luck!
~:o)
einat

rauchc@gmx.de wrote:

> I'm currently working on a paper on linguistic features of private home
> pages & face the following problem - how to collect the data, i.e. the
> pages/websites??? I'm aware of quite a few automatic downloaders/off-line browsers,
> but none of those I've reviewed so far offers the following (in a convenient
> way, that is):
>
> Instead of manually entering the URL to be used as the starting point I'd
> like the tool to use a file that contains the URLs I want to download, then
> browse 'em one by one. While browsing, the prog is to stay within the
> initial site (it's directory on the respective web server, that is).
>
> TeleportPro, for instance, sort of allows for this - however, it regards
> the file itself as the initial URL and treats all URLs contained therein as
> links (which is not what I want, since this renders the 'scan current
> directory/domain only' feature impossible).
>
> Anybody out there who knows of / has written a prog that would do the
> above trick, and perchance offers support for proxy servers & firewalls as well
> (yes, I know that I'm more than a tad optimistic :)?
>
> Thanks in advance,
> Christoph Rauch
>

--
Einat Amitay
einat@mri.mq.edu.au
http://www.mri.mq.edu.au/~einat