Meta-Headers

Working with several different tools can be a complete pain if the formats don't match up. To remidy this, we are designing a small human readable header to be included in as many file types as possible describing the file contents in a cross-platform, cross-tool way!

Take me to the (Alpha) Open Standard document (sources)!
The standard is still in a pre-1.0 stage, but available for comment and additions

On-Line Validation Tool

Take me to the on-line validation tool!

Check if your corpus files are Metaheader Compliant with our on-line validation tool - it runs locally in your browser, so you don't have to upload anything!

Paper: Corpus File Meta-Headers - An Open Standard For Corpus Meta Data

With the increasingly large variety of tools available to the modern corpus linguist, there is a pressing need to be able to efficiently and effectively move data from one tool to another with no (or minimal) loss. Though other standards exist for the representation of corpora at some level (such as TEI, or more general standards such as XML), these often presume a limited set of use-cases, offering a top-down restriction on the information within in order to impose reliable structure. These systems serve to restrict the data that can be stored in such formats, and ultimately this limits their usefulness. Our approach focuses on documenting common features, allowing for variation in document format. Such a method permits innovation whilst maintaining maximal compatibility between tools. Herein we demonstrate a method for representing, parsing, and using such documentation in practice.

Take me to the paper! Take me to the files!

About the Authors

John Vidler

PhD Candidate, Computer Science
Department of Computing and Communications
Lancaster University

John is working on operating systems and networking research, with corpus linguistics on the side

Homepage

Stephen Wattam

PhD Computer Science
Department of Computing and Communications
Lancaster University

Stephen a lecturer at Lancaster University in the UK, and works with text and social data, automating manual methods and examining data at scale.

Homepage