Collocaid: combining learner needs, lexicographic data and text editors to help learners write more idiomatically

Lead Research Organisation: University of Surrey
Department Name: English


Over the past decades, the UK has produced a series of world-leading corpus-based pedagogical dictionaries that provide users not just with the definitions of words, but also with a wealth of information on how words are actually used in context. There have also been considerable advances with regard to dictionary format. Nowadays, all major English language dictionaries have digital interfaces. Yet research on dictionary use shows that the spectacular developments in terms of dictionary content and format that have taken place over the past decades have not had a dramatic influence on actual dictionary-user behaviour. Dictionaries - both paper-based and digital - remain by and large underused, and it is widely acknowledged that more needs to be done with regard to teaching people how to use dictionaries to their full potential. This proposal stems from the realization that an arguably better solution would be to develop alternative, dictionary-like tools that do not require much in the way of training or instructions.

This project aims to research how information to help writers produce more accurate and idiomatic texts can be migrated from dictionaries and corpora to digital writing environments in an optimum, minimally intrusive way, without disrupting writing processes. Rather than attempting to cover every possible aspect of writing, we will focus on supporting non-native speakers of English with information to help them deal with collocation. Violating collocation conventions can result in errors (e.g. *They trust in us) or awkward, non-idiomatic text (e.g. *a large difference). Additionally, writers who are unable to retrieve idiomatic collocates (e.g. a narrow/daring/lucky escape) often make do with bland, less interesting alternatives (e.g. a fantastic escape). Although there are dictionaries that focus precisely on collocation, writers are often unaware of them or simply cannot be bothered to use them. Moreover, the simple fact that learners have to stop writing to look up a collocate can disrupt the flow of their words. It is in this context that we propose to research how writers can retrieve information on collocation directly from within digital writing environments in an intuitive and minimally intrusive way so that (1) writers do not need to be trained to look up this information and (2) the flow of writing is not disrupted in the process.

The research will begin with a needs analysis to identify which collocation difficulties to focus on. We will then carry out lexicographic work to address those needs, using, among other resources, computerized language corpora and state-of-the-art lexicographic tools. Next, we will research how to integrate information on collocation with text editors in an easy, helpful and minimally disruptive way. Different models of human-computer interaction and data visualization will be developed and the team will carry out usability studies and test them with a sample of the target population.

The investigators responsible for this project are three well-known academics with many years of teaching and research experience in the fields of second language writing, lexicography, corpus linguistics and human-computer interaction. The team's advisory board counts with Michael Rundell (editor-in-chief of Macmillan Dictionaries), Pete Whitelock (principal language engineer at Oxford University Press dictionary division) and Milos Jakubicek (CEO of Lexical Computing Ltd).

This research will contribute to further the UK's reputation of world-leading developments in the field of pedagogical lexicography. The project has tangible impacts on society, culture and the economy, as its outputs include data and software that can help writers using English as a medium of communication. We will be exploiting the potential of digital technologies to enhance the creation of knowledge through writing, enabling people of different backgrounds to better express themselves in written English.

Planned Impact

In addition to the academic beneficiaries, the present project will generate tangible outputs with a potential to impact society, culture and the economy. There are a number of non-academic stakeholders at a national and international level who can benefit from this. At first instance, these include but are not limited to the following:

a. Writers using English as a medium of communication, especially non-native writers of English (e.g. undergraduate and postgraduate students as well as researchers and lecturers in the UK and abroad, in addition to wider audiences including politicians, journalists and other professionals who need to communicate in written English), will benefit from the development of a user-friendly digital writing environment that can help them produce more grammatical and idiomatic texts.

b. Native English speakers wishing to develop further writing skills (this could include children, students and professionals less fluent in writing) could benefit in similar ways as the beneficiaries in (a).

c. English as a Foreign Language (EFL) and English for Academic Purposes (EAP) tutors in the UK and abroad will have new resources to draw on. They will be welcome to use the information collected on collocation difficulties and collocation solutions in their day-to-day teaching practice. While the primary data generated by the project will be made easily accessible to them through the project website, this group can also benefit from the edited tools and resources developed by group (d) below.

d. The collocation data generated by this project can be commercially valuable to academic publishers producing EAP materials such as Oxford University Press, Cambridge University Press and Pearson ELT, and English language testing services like Cambridge Language Assessment, IELTS and TOEFL. This data can be used to develop books, interactive online exercises and tests. The edited materials and resources they produce using our data will further benefit groups (a) and (b) above and (c) above.

e. Software developers will benefit by having novel visualization methods that focus on personal data. Personal visualization is a fast-growing area, and as of yet there are few techniques to interactively display personal textual data dynamically and interactively.

f. The linguistic tools and resources created for English in this project can have an indirect impact on other languages, fostering the development of similar projects for languages other than English.

In short, the outputs of the present proposal can have a strong societal, economic and cultural impact, with benefits not only to special professional and practitioner groups but also the wider public. By using technology to foster improved writing and by enabling people of different cultural and language backgrounds to better express themselves in written language, we hope to enhance the creation of knowledge and promote greater understanding and communication among different communities.


