When to accept or reject

The purpose of rejecting documents is to focus the annotations on full texts. Reject when

  • the text consists of short list items (e.g. product listings on a web shop page, lists of links)
  • the main body of text are short photo captions
  • the sentences don’t form a coherent text
  • there are only individual lines of actual text
  • the amount of coherent text is very small compared to the ‘junk’ text (otherwise focus on annotating the actual text and ignore the junk text)
  • there are no complete sentences, or fewer than two (e.g. lists of short news introductions)
  • the text is not in the target language
  • the text is poorly extracted (not representative of the web page)
  • the document consists of special characters or numbers only