How to spell check function/variable in Emacs

2018-06-17 03:16 | Source

emacs
en
flyspell

CREATED: <2018-06-17 Sun>

UPDATED: <2019-04-04 Thu>

This article explains how to spell check camel cased names of functions and variables in Emacs. It uses options --run-together from GNU Aspell.

But this solution is not perfect. It wrongly identifies two character sub-word as typo. For example, "onChange" is regarded as typo because the sub-word "on" is identified as typo. Another issue is namespace of function name. For example, "MS" from "MSToggleButton" is alias of "Microsoft". If "MS" is identified as typo, every word containing "MS" is regarded as typo.

In this article,

I will explain how Emacs spell checker works
Then we study the algorithm of aspell
Finally, a complete solution which works with both aspell and hunspell is provided

Emacs built in plugin Fly Spell does spell checking. It passes the options and plain text to command line tool aspell. Aspell sends back the typos into Fly Spell. Fly Spell then picks up typos to display. For example, flyspell-prog-mode only displays typos in comments and strings.

Aspell doesn't understand syntax of any programming language. It simply reports typos in plain text.

Aspell has two options:

--run-together-limit is "Maximum number of words can be strung together"
--run-together-min is "Minimal length of sub-words"

Aspell C++ code has to be studied in order to understand above two options. Let's start from Working::check_word in modules/speller/default/suggest.cpp.

Here is code,

class Working : public Score {
  unsigned check_word(char * word, char * word_end, CheckInfo * ci, unsigned pos = 1);
};
unsigned Working::check_word(char * word, char * word_end,  CheckInfo * ci,
                             /* it WILL modify word */
                             unsigned pos)
{
  // check the whole word before go into run-together mode
  unsigned res = check_word_s(word, ci);
  // if `res` is true, it's a valid word, don't bother run-together
  if (res) return pos + 1;
  // it's typo because number of sub-words is greater than "--run-together-limit"
  if (pos + 1 >= sp->run_together_limit_) return 0;

  // `i` is the `end` of sub-word, the poition AFTER last character of sub-word
  for (char * i = word + sp->run_together_min_; 
       // already checked the whole word; besides, any sub-word whose size is less 
       // than "--run-together-min" is regarded as invalid
       i <= word_end - sp->run_together_min_;
       ++i)
  {
    char t = *i;

    // read the sub-word by set the character at `end` position to '\0'
    *i = '\0';
    res = check_word_s(word, ci);
    // restore original character at `end` position
    *i = t;

    // Current sub-word is invalid, we need append the character at current
    //  `end` position to creata new sub-word.
    //  Inncrement `i` because `i` always points to the `end` of sub-word
    if (!res) continue;

    // Current sub-word is valid, strip it from the whole word to create a totally
    // new word for `check_word`, `check_word` is a recursive function
    res = check_word(i, word_end, ci + 1, pos + 1);
    if (res) return res;
  }
  memset(ci, 0, sizeof(CheckInfo));
  return 0;
}

Say first parameter of check_word is "hisHelle",

word points to string "hisHelle" (in C/C++, string is character array. The last character of array is character '\0')
sp->run_together_min_ is 3, so i initially points to the character "H", at the end of sub-word "his"
check_word_s return true for sub-word "his"
So we strip "his" from "hisHelle" and recursively call check_word to check new word "Helle"
In the new context of check_word, we extract "Hel" from "Helle" initially
"Hel" is invalid. So "Hell" is extracted from "Helle". The remaining charcter "e" is a treated as as new word with check_word recursively applying on it
"e" is not valid It's the last word could be extracted. So recursion is over. It's concluded that "hisHelle" is a typo

Key points:

--run-together-limit could be bigger with enough memory. It's default value is 8. I prefer 16.
--run-together-min can't be 2 because too many typos are combination of "correct" two character sub-words ("hehe", "isme", …)
--run-together-min can't be greater than 3, or else, too many "correct" three character sub-words are regarded as invalid ("his", "her", "one", "two")
So --run-together-min should always be 3

If --run-together-min is 3, the word "onChange" is typo because the first two characters "on" is typo by aspell. This is obviously wrong.

The solution is a Emacs Lisp predicate which supports both aspell and hunspell.

A predicate could be attached to specific major-mode. The predicate file all the typos reported by CLI program. If predicate returns t, the feed in typo is finally approved to be typo,

A example of predicate for js2-mode,

(defun js-flyspell-verify ()
  (let* ((font-face (get-text-property (- (point) 1) 'face))
         (word (thing-at-point 'word)))
    (message "font-face=%s word=%s" font-face word)
    t))
(put 'js2-mode 'flyspell-mode-predicate 'js-flyspell-verify)

Typo candidates are filtered by js-flyspell-verify. So predicate is where to fix typos wrongly reported by CLI program.

Here is complete setup you can paste into .emacs (I setup for js2-mode and rjsx-mode but code is generic enough).

Please note function split-camel-case splits a camel case word into a list of sub-words. Sub-word with less one or two characters is not typo.

(defun split-camel-case (word)
  "Split camel case WORD into a list of strings.
Ported from 'https://github.com/fatih/camelcase/blob/master/camelcase.go'."
  (let* ((case-fold-search nil)
         (len (length word))
         ;; ten sub-words is enough
         (runes [nil nil nil nil nil nil nil nil nil nil])
         (runes-length 0)
         (i 0)
         ch
         (last-class 0)
         (class 0)
         rlt)

    ;; split into fields based on class of character
    (while (< i len)
      (setq ch (elt word i))
      (cond
       ;; lower case
       ((and (>= ch ?a) (<= ch ?z))
        (setq class 1))
       ;; upper case
       ((and (>= ch ?A) (<= ch ?Z))
        (setq class 2))
       ((and (>= ch ?0) (<= ch ?9))
        (setq class 3))
       (t
        (setq class 4)))

      (cond
       ((= class last-class)
        (aset runes
              (1- runes-length)
              (concat (aref runes (1- runes-length)) (char-to-string ch))))
       (t
        (aset runes runes-length (char-to-string ch))
        (setq runes-length (1+ runes-length))))
      (setq last-class class)
      ;; end of while
      (setq i (1+ i)))

    ;; handle upper case -> lower case sequences, e.g.
    ;;     "PDFL", "oader" -> "PDF", "Loader"
    (setq i 0)
    (while (< i (1- runes-length))
      (let* ((ch-first (aref (aref runes i) 0))
             (ch-second (aref (aref runes (1+ i)) 0)))
        (when (and (and (>= ch-first ?A) (<= ch-first ?Z))
                   (and (>= ch-second ?a) (<= ch-second ?z)))
          (aset runes (1+ i) (concat (substring (aref runes i) -1) (aref runes (1+ i))))
          (aset runes i (substring (aref runes i) 0 -1))))
      (setq i (1+ i)))

    ;; construct final result
    (setq i 0)
    (while (< i runes-length)
      (when (> (length (aref runes i)) 0)
        (setq rlt (add-to-list 'rlt (aref runes i) t)))
      (setq i (1+ i)))
     rlt))

(defun flyspell-detect-ispell-args (&optional run-together)
  "If RUN-TOGETHER is true, spell check the CamelCase words.
Please note RUN-TOGETHER will make aspell less capable. So it should only be used in prog-mode-hook."
  ;; force the English dictionary, support Camel Case spelling check (tested with aspell 0.6)
  (let* ((args (list "--sug-mode=ultra" "--lang=en_US"))args)
    (if run-together
        (setq args (append args '("--run-together" "--run-together-limit=16"))))
    args))

;; {{ for aspell only, hunspell does not need setup `ispell-extra-args'
(setq ispell-program-name "aspell")
(setq-default ispell-extra-args (flyspell-detect-ispell-args t))
;; }}

;; ;; {{ hunspell setup, please note we use dictionary "en_US" here
;; (setq ispell-program-name "hunspell")
;; (setq ispell-local-dictionary "en_US")
;; (setq ispell-local-dictionary-alist
;;       '(("en_US" "[[:alpha:]]" "[^[:alpha:]]" "[']" nil ("-d" "en_US") nil utf-8)))
;; ;; }}

(defvar extra-flyspell-predicate '(lambda (word) t)
  "A callback to check WORD.  Return t if WORD is typo.")

(defun my-flyspell-predicate (word)
  "Use aspell to check WORD.  If it's typo return t."
  (let* ((cmd (cond
               ;; aspell: `echo "helle world" | aspell pipe`
               ((string-match-p "aspell$" ispell-program-name)
                (format "echo \"%s\" | %s pipe"
                        word
                        ispell-program-name))
               ;; hunspell: `echo "helle world" | hunspell -a -d en_US`
               (t
                (format "echo \"%s\" | %s -a -d en_US"
                        word
                        ispell-program-name))))
         (cmd-output (shell-command-to-string cmd))
         rlt)
    ;; (message "word=%s cmd=%s" word cmd)
    ;; (message "cmd-output=%s" cmd-output)
    (cond
     ((string-match-p "^&" cmd-output)
      ;; it's a typo because at least one sub-word is typo
      (setq rlt t))
     (t
      ;; not a typo
      (setq rlt nil)))
    rlt))

(defun js-flyspell-verify ()
  (let* ((case-fold-search nil)
         (font-matched (memq (get-text-property (- (point) 1) 'face)
                             '(js2-function-call
                               js2-function-param
                               js2-object-property
                               js2-object-property-access
                               font-lock-variable-name-face
                               font-lock-string-face
                               font-lock-function-name-face
                               font-lock-builtin-face
                               rjsx-text
                               rjsx-tag
                               rjsx-attr)))
         subwords
         word
         (rlt t))
    (cond
     ((not font-matched)
      (setq rlt nil))
     ;; ignore two character word
     ((< (length (setq word (thing-at-point 'word))) 2)
      (setq rlt nil))
     ;; handle camel case word
     ((and (setq subwords (split-camel-case word)) (> (length subwords) 1))
      (let* ((s (mapconcat (lambda (w)
                             (cond
                              ;; sub-word wholse length is less than three
                              ((< (length w) 3)
                               "")
                               ;; special characters
                              ((not (string-match-p "^[a-zA-Z]*$" w))
                               "")
                              (t
                               w))) subwords " ")))
        (setq rlt (my-flyspell-predicate s))))
     (t
      (setq rlt (funcall extra-flyspell-predicate word))))
    rlt))

(put 'js2-mode 'flyspell-mode-predicate 'js-flyspell-verify)
(put 'rjsx-mode 'flyspell-mode-predicate 'js-flyspell-verify)

UPDATE: Now you can use wucuo. It's an out of box solution supporting both aspell and hunspell.