1. Search and display

The CBF uses the IMS Open Corpus Workbench (CWB) as its search engine and CQP as the query processor. A very rudimentary web interface has been built on top of CQP to enable searching for word, lemma and part of speech (POS). At present there are no plans to implement the full CQPweb interface. The result of a search is by default displayed as a key-word-in-context (KWIC) list. A query can be filtered by gender (female / male) and/or decade (1900, 1910, etc.). If a search returns more than a spesific number of hits, the number of hits is reduced to a fixed amount. In addition to the search result, the display contains a table showing the distribution of hits per text, the number of hits per 100,000 words and decade and gender of each text.

The following list shows some standard (simple) searches:

  • sober (search for the word 'sober')
  • SOBER (search for the lemma 'sober')
  • _MD (search for the pos tag MD (modal); not recommended as it will yield too many hits)
  • poor WOMAN (search for the word 'poor' followed by the lemma 'woman', i.e. 'woman' and 'women')
  • current_NN (search for 'current' as a noun)
  • recogni(s|z)e (search for 'recognise' and 'recognize')
  • colou?r (search for 'color' and 'colour')
  • poor * WOMAN (search for 'poor' followed by an arbitrary word or symbol followed by the lemma 'woman'

You may also use the internal CQP search syntax directly for more advanced searches. e.g.

[lemma = 'poor'] []? [pos = 'NN' | pos = 'NP' | pos = 'NNS']
(search for the lemma poor, followed optionally by one word, followed by the POS tags NN or NP or NNS)
More info on CQP search syntax can be found on the web, e.g. here: CQPTutorial

At the bottom of the result page, the search string, as sent to the CQP query processor, is shown.

Below this there are a couple of links to Shiny apps which can be used to visualise the search result. NB! Experimental.

2. The tag set

The corpus has been tagged with CLAWS and the tag set can be found here: Since the version of CLAWS that we use, do not lemmatise words, the TreeTagger has been used for that ( Some post-processing of the output of the tagging has been done, e.g. where the tagger was not able to recognise unconventional spelling of a word.

3. Corpus of British Fiction (CBF)

What is it?

The CBF consists entirely of (extracts of) novels and short story collections published between 1900 and the present by writers born and/or educated in the United Kingdom, i.e. England, Scotland, Wales or Northern Ireland. When in doubt about the Britishness of a writer, Wikipedia has been consulted. If a writer is said to have been born in the UK and/or spent most of his/her formative years there, texts produced by that writer have been included.

As far as possible, only adult fiction has been included. Science fiction and fantasy have also been left out. In the end, a corpus compiler of mainly 20th century fiction is, to a large extent, dependent on the tastes and preferences of the many volunteers out there who have taken the time and effort to scan, proofread and make books available to the public at large. The CBF would not have materialised without these people, and we owe them a big thanks. Their names, if available, have been included in the header accompanying each text in the corpus.

At the time of writing (September 2018), the corpus consists of 562 (extracts of) texts, totalling more than 40 million words.

Spelling variation

Since many of the texts have been harvested from Project Gutenberg and similar sources, the spelling is not consistently British English. This means that if one, for instance, is interested in words or multiword expression where the spelling between US and UK varies, one must make sure to include both alternatives when searching, e.g. color vs. colour ("colou?r"). Moreover, the extent to which certain publishers, e.g. in the US, have changed the text/spelling according to in-house styles when publishing the American version is unclear.

Last updated 18 September 2018