Searching in the Corpus

Information can be retrieved from the Corpus by means of three search engines based on:

  • Key words;
  • Words and parts thereof;
  • Romanian glosses.

Users may choose the scope within which they wish to search the Corpus: all dialect texts, the texts recorded in one or more localities, or the texts representing one or more Transdanubian dialects. The returned results for the first two engines are limited to data included in utterances made by informants. The search by Romanian glosses is the exception because, for the sake of intelligibility, all Romanian elements are glossed regardless who uses them.

1. Searching by key words

The constituent dialect texts of the Corpus have been divided into fragments according to the topic of conversation. These fragments, called themes, have been numbered and provided with headings that give a general idea of what the fragment is about. The themes have been indexed with key words of different rank: types of themes and tags. The search by key words returns a list of the relevant fragment headings, accompanied by theme ID numbers, links to these themes and names of the localities where the recordings were made.

2. Searching by strings of letters

The search by word does not require that the entire word be written out. Users may choose to search by strings of letters that may either be arbitrary word segments or correspond to roots, affixes or endings. Since the search engine does not take accents into account, the retrieved results will include any relevant forms regardless of stress. The results appear as a list of the forms that comply with the search criteria followed by the number of hits. If users so desire, the results may be viewed in context as a concordance with six, twelve or twenty-four words on both sides of the search item. When users hold the cursor on a line of the concordance, they will see a floating window with the name of the locality, the theme ID number and the number of the line in the text where the form is located. The same information is also available at the beginning of each line of the concordance and as a link to the respective theme. When necessary, users may select the concordance, and copy and paste it in their own file.

Among the greatest obstacles for users searching non-standard texts are the variations inherent in them. The existing index-card dialect databases and published dialect glossaries rely on the ability of their users to identify the range of variations a given form may have and search for them alphabetically. Cross-referencing of alternative variants is inconsistent if available at all. Obviously, some users are more experienced and better equipped to successfully accomplish such searches. One should also take into account the conventions of Bulgarian dialectological transcription which replaces the letters я, ю, щ and ь with йа (йъ, ’а, ’ъ), йу (’у), шт and apostrophe, and the digraphs дж and дз with џ and s.

Some of the pervasive dialect phenomena contributing to the variability of forms are the following:

  • Reduction of unstressed vowels; that is, the replacement of unstressed а, о and е with ъ, у and и: бу̀лкъта cf. бу̀лката ‘the bride’, бра̀шну cf. бра̀шно ‘flour’, да̀дуфми cf. да̀дохме pf. ‘we gave’;
  • Devoicing of voiced consonants in certain positions: вра̀пкъ cf. вра̀бка ‘sparrow’, глу̀паф cf. глу̀пав m. ‘silly’, флѐзна cf. влѐзна pf. ‘I come in’;
  • Voicing of voiceless consonants in certain positions: збѝраме cf. сбѝраме impf. ‘we gather’, одвъ̀ржа cf. отвъ̀ржа pf. ‘I untie’;
  • Omission of vowels and consonants in certain positions: бла̀тту cf. бла̀тото ‘the marsh’, ва̀ште cf. ва̀шите pl. definite ‘your’, ра̀птъ cf. ра̀бота ‘work’, тва̀ cf. това̀ n. ‘this’, вѝкаа cf. викаха impf. ‘they shouted’, исабѝш cf. исхабѝш pf. ‘you waste’, у̀баф cf. ху̀бав m. ‘nice’, г’а̀ол cf. г’а̀вол ‘devil’, ку̀чеата cf. ку̀четата ‘the dogs’, ма̀к’а cf. ма̀йк’а ‘mother’;
  • Replacement of palatalized consonants with their non-palatalized counterparts and vice versa: га̀зът cf. га̀зят impf. ‘they wade’, дѝган’е cf. дѝгане ‘raising’, ма̀йк’а cf. ма̀йка ‘mother’;
  • Alternative pronunciation of certain consonants: въ̀lна cf. въ̀лна ‘wool’;
  • Replacement of consonants with other consonants or approximants in certain positions: дрѐйата cf. дрѐхата ‘the garment’, ѝмаўа cf. ѝмаха impf. ‘they had’, наsа̀т cf. наза̀т ‘back’;
  • Alternative continuants of syllabic л and р: връвѐа cf. вървя̀ха impf. ‘they walked’, къ̀рсницата cf. кръ̀сницата ‘the godmother’;
  • Alternative continuants of Old Bulgarian ѣ: голѐма cf. гул’а̀мъ f. ‘big’.

In its current state, the search engine does not automatically retrieve all alternative variants of a given form. When the Corpus is finalized and all the forms are known, we will compile a complete list of the possible variations and create an updated version of the search engine which will ideally retrieve all variants of a form at once.

3. Searching by Romanian glosses

Speakers of Transdanubian Bulgarian dialects are bilingual and include many Romanian elements in their Bulgarian speech. All such elements will gradually be glossed in Standard Bulgarian according to the meaning that Romanian words, phrases or sentences have in the context in which they were pronounced. The body of Romanian glosses will be searchable by the Romanian element spelled according to contemporary Standard Romanian orthography or by its Bulgarian translation. The results will show up as a list of the glosses corresponding to the search criteria, accompanied by the names of localities, headings and theme ID numbers, links to these themes, and the respective line numbers where they can be found in the text.

For users’ convenience, brief instructions accompany each of the three search engines.

Authors of the text: Olga Mladenova and Vesselin Stoykov

