A brand new project on developing open source interface and engine for corpus sharing, learning, analysis and retrieval.

A framework for exploration of large text document collections. It allows search of custom patterns and identification of examples that verify or falsify hypotheses provided by a user, leading to insights on general and domain-specific linguistic properties and distributions. Unlike most IR services, this framework dynamically builds search indices from incoming requests, thus allowing for arbitrary request expressions composed of a set of basic blocks. Additionally, it serves to bring machine learning and analysis computations to the server side data, significantly improving performance and protecting collections from undesired crawling. The framework supports interactive communication to the client side, both suitable for human users and machine learning systems.

Desiderata:

safe, fast, universal open source tool for text collections and corpora sharing
privacy protection: access to text collections without violation of IP/data sharing policy (i.e. providing an interface to perform computations and analysis of data as well as access to data snippets). Limited downloads, unlimited usage [for statistics and analysis]
unified feature annotation/retrieval[search] model based on text spans
fast search of examples and counterexamples to claims and hypotheses
storage of pre-computed results of analysis
user queries represented in formal syntax that allows for nested expressions over text spans

A brand new project on developing open source interface and engine for corpus sharing, learning, analysis and retrieval.

Desiderata:

A simplified demo version based on 4-gram index. Next update planned on 21 Dec