What is the Reading Academic Text corpus?

The Reading Academic Text corpus is a collection of academic texts, written by academic staff or students at the University of Reading, and now stored in machine readable form, that has been developed in the Department of Applied Linguistics. The aim of the project, which started in 1995, is to develop a corpus of academic text for linguistic analysis by research students and staff at the Uninversity, which will contribute to the understanding of text construction practices in academic settings. The insights derived from such analyses should then feed into the development of teaching materials for English for Academic Purposes courses, and into teacher training courses.

The initial corpus, running to nearly a million words of text, was composed of twenty research articles written by Reading University academic staff, and a small set of PhD theses, written by, and contributed by, successful doctoral candidates in the Faculty of Agriculture. A single faculty was chosen in order to constrain the variables of disciplinary convention and preferred rhetorical forms, and the Faculty of Agriculture was chosen as it has a high ratio of international students to home students, and many of these students attend Presessional and Insessional language support courses at the Centre. The theses in the corpus are all written by native speakers, which further limits the variables.

Recent developments

Since the corpus was originally established in the academic year 1995­6, the number of theses has increased from 8 to 38. Of the 38 theses, 20 are from the Faculty of Agriculture and this figure can be broken down again to:

  •  4 theses from the Department of Agriculture
  •  9 theses from the Department of Agricultural Botany
  •  8 theses from the Department of Agricultural and Food Economics
The other theses are: 7 from the Department of Psychology, 6 from the Department of Food Science and Technology, 1 from the Department of Meteorology, and 3 from the Department of History. The aim is to expand the corpus further, in the coming years, to represent the discourses of a greater range of disciplines.

Research work conducted on texts in the corpus has investigated the organization of theses in different disciplines, the uses of citations (a report on this study can be viewed online), and of the means by which student writers position themselves within their texts.

Form of the corpus

The theses and research articles are all copyright. The permission of the writers, allowing researchers to use the texts for linguistic analysis and language teaching, has been obtained.

The texts are kept in three formats: the original files, the files converted to HTML format, and the ASCII files that form the source for the HTML files. The HTML versions of the texts allows the full text to be viewed, while the ASCII files are used for linguistic analysis and for coding of the corpus.

In the preparation of the HTML files, all figures, plates and tables (containing numerical data) are removed and a tag inserted to show where the feature originally appeared in the text. Equations and other mathematical expressions are also removed. Special characters are coded by HTML conventions, as far as possible. Information about the texts and the tag types used is given in the <head> section of each document. The HTML files are 'read-only'.

The ASCII files are used for additional tagging, concordancing, and the creation of wordlists. They contain all the HTML tags (indicating paragraphing, headings, type appearance, list formatting, special characters, block quotes) and they also contain non-standard tagging of short quotations. Researchers can work on copies of these files, provided they have signed a resticted use agreement, and they can insert their own tags.

Who can use the corpus?

Use of the corpus is restricted at present to staff and researchers at the University of Reading, and it is only available 'on-site'. It is possible for people outside the University to make use of the corpus on a Research Attachment arrangement.

Future developments

The corpus will be expanded in the coming years so that it will include theses from at least ten different departments, covering both the natural and social sciences. We are also planning to expand the range of text types in the corpus to include dissertations, projects, laboratory reports, and samples of textbook readings for Master's courses in a range of disciplines. This corpus would then be available to materials developers at the University as a source of authentic academic text data that can be used to build up an academic vocabulary list, and to provide examples of authentic academic language use for analysis.

Another planned development is to create a special section of the corpus which will contain a wide range of texts relating to research in agriculture, to augment the small collection of theses already gathered.

Would you like to know more about the corpus?

For technical information about the RAT corpus or for information regarding access to the corpus, contact:
Paul Thompson, Department of Applied Linguistics, University of Reading, P.O.Box 241, Reading RG6 6AA.

