
This whitepaper seeks to examine both the current state of the art of NLP and illustrate government-oriented use cases that are feasible among resource-rich languages. Many proposals and examples exist illustrating how this can be done for multiple domains - from registering public complaints, to conversing with citizens, to tracking policy changes across bills and Acts. Under ideal conditions, NLP technologies can assist in the processing of these texts, thus potentially providing significant improvements in speed and efficiency to various departments of government. A large number of the functions of a government today revolve around vast amounts of text data - from interactions with citizens to examining archives to passing orders, acts, and bylaws. Software using NLP technologies, if engineered for that purpose, generally have the advantage of being able to process large amounts of text at rates greater than humans. Natural Language Processing (NLP) is a broad umbrella of technologies used for computationally studying large amounts of text and extracting meaning - both syntactic and semantic information. Out-domain data, however, did not give a positive impact, either in filtered or unfiltered forms, as the writing style was different and the context was much more general than that of the official government documents. Use of pseudo in-domain data showed an improvement for both the test sets as the language is formal and context was similar to that of the in-domain though the writing style varies. With the motive to improve translation, more data was collected from, (a) different government sources other than official letters (pseudo in-domain), and (b) online sources such as blogs, news and wiki dumps (out-domain). Test data from the same sources as training and tuning gave a higher score due to over-fitting, while the test data from a different source resulted in a considerably lower score. The translation system was evaluated with two different test datasets. The baseline was built with a small in-domain parallel data set containing official government letters.
:max_bytes(150000):strip_icc()/GettyImages-114216714-56a0c5695f9b58eba4b3adf8.jpg)
This paper evaluates the impact of different types of data sources in developing a domain-specific statistical machine translation (SMT) system for the domain of official government letters, between the low-resourced language pair Sinhala and Tamil.
