CBF: help

Search and display
The tag set
About CBF

1. Search and display

The CBF uses the IMS Open Corpus Workbench (CWB) as its search engine and CQP as the query processor. A very rudimentary web interface has been built on top of CQP to enable searching for word, lemma and part of speech (POS). At present, there are no plans to implement the full CQPweb interface. The result of a search is by default displayed as a key-word-in-context (KWIC) list. A query can be filtered/thinned by gender (female / male), genre and/or decade (1900, 1910, etc.). NB! The classification into genres should be taken with a sizeable pinch of salt.

If a search returns more than a specific number of hits, the number of hits is reduced to a fixed amount. In addition to the search result, the display contains a table showing various kinds of information about the individual texts.

The following list shows some standard (simple) searches:

sober (search for the word 'sober')
SOBER (search for the lemma 'sober')
_MD (search for the pos tag MD (modal); not recommended as it will yield too many hits)
poor WOMAN (search for the word 'poor' followed by the lemma 'woman', i.e. 'woman' and 'women')
current_NN (search for 'current' as a noun)
recogni(s|z)e (search for 'recognise' and 'recognize')
colou?r (search for 'color' and 'colour')
poor * WOMAN (search for 'poor' followed by an arbitrary word or symbol followed by the lemma 'woman'

You may also use the internal CQP search syntax directly for more advanced searches. e.g.

[lemma = 'poor'] []? [pos = 'NN' | pos = 'NP' | pos = 'NNS']

(search for the lemma poor, followed optionally by one word, followed by the POS tags NN or NP or NNS)

Searching for symbols, e.g. ' (single quotation mark) is a bit tricky. Use the corresponding POS tag if possible.
More info on CQP search syntax can be found on the web, e.g. here: CQPTutorial

At the bottom of the result page, the search string, as sent to the CQP query processor, is shown.

Below this there are a couple of links to Shiny apps, which can be used to visualise the search result. NB! Experimental.

2. The tag set

The corpus has been tagged with CLAWS and the tag set can be found here: http://ucrel.lancs.ac.uk/claws7tags.html. Since the version of CLAWS we use, do not lemmatise words, the TreeTagger has been used for this (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). Some post-processing of the output of the tagging has been performed, e.g. where the tagger was not able to recognise unconventional spelling of a word.

3. Corpus of British Fiction (CBF)

What is it?

The CBF consists entirely of (extracts of) novels and short story collections published between 1900 and 2019 by writers born and/or educated in the United Kingdom, i.e. England, Scotland, Wales or Northern Ireland. When in doubt about the Britishness of a writer, Wikipedia has been consulted. If a writer is said to have been born in the UK and/or spent most of his/her formative years there, texts produced by that writer have been included.

As far as possible, only adult fiction has been included. Science fiction and fantasy have also been left out. In the end, a corpus compiler of mainly 20th century fiction is, largely, dependent on the tastes and preferences of the many volunteers out there who have taken the time and effort to scan, proofread and make books available to the public at large. The CBF would not have materialised without these people, and we owe them a big thanks. Their names, if available, have been included in the header accompanying each text in the corpus.

At the time of writing (January 2022), the corpus consists of approx. 1290 (extracts of) texts, totalling more than 100 million words. Note that the corpus is not balanced in the sense that there are equal number of books and words in each decade, nor can it be said to be representative of all (sub-)genres of fiction, as it contains many more crime novels and novels labelled general fiction, than e.g. romance or adventure.

Spelling variation

Since many of the texts have been harvested from Project Gutenberg and similar sources, the spelling is not consistently British English. This means that if one, for instance, is interested in words or multiword expression where the spelling between US and UK varies, one must make sure to include both alternatives when searching, e.g. color vs. colour ("colou?r"). Moreover, the extent to which certain publishers, e.g. in the US, have changed the text/spelling according to in-house styles when publishing the American version is unclear.

Last updated 26 January 2022