How to spell check function/variable in Emacs

  |   Source

CREATED: <2018-06-17 Sun>

UPDATED: <2018-07-06 Fri>

This article explains how developers can check typos of function/variable while programming in Emacs.

It uses the --run-together option of GNU Aspell to check camel cased word.

But this solution is not perfect. It wrongly identifies two character interior word as typo. For example, "onChange" is regards as typo because the interior word "on". Another issue is namespace of function name. For example, "MS" from "MSToggleButton" is alias of "Microsoft". If "MS" is identified as typo, every word containing namespace "MS" is regarded as typo.

In this article,

  • I will explain how Emacs spell checker works
  • Then we study the algorithm of aspell
  • Finally, I will show you a complete setup

In Emacs, a built in plugin Fly Spell is in charge of spell check. It passes the options and plain text to command line tool aspell. Aspell sends back the typos of text into Fly Spell. Fly Spell then select certain typos to display. For example, when flyspell-prog-mode is on, only typos in comments and strings are visible.

So aspell doesn't understand syntax of any programming language. It scans plain text and report all typos to Fly Spell.

In aspell, there are two extra "run-together" word options:

  • --run-together-limit is "Maximum number of words can be strung together"
  • --run-together-min is "Minimal length of interior words"

Let's study the code of aspell to understand these two options. The "run-together" algorithm in implemented in function Working::check_word of file "modules/speller/default/suggest.cpp".

In order to help you understand this function, I documented the code line by line,

class Working : public Score {
  unsigned check_word(char * word, char * word_end, CheckInfo * ci, unsigned pos = 1);
};
unsigned Working::check_word(char * word, char * word_end,  CheckInfo * ci,
                             /* it WILL modify word */
                             unsigned pos)
{
  // check the whole word before go into run-together mode
  unsigned res = check_word_s(word, ci);
  // if `res` is true, it's a valid word, don't bother run-together
  if (res) return pos + 1;
  // it's typo because number of interior words is greater than "--run-together-limit"
  if (pos + 1 >= sp->run_together_limit_) return 0;

  // `i` is the `end` of interior word, the poition AFTER last character of interior word
  for (char * i = word + sp->run_together_min_; 
       // already checked the whole word; besides, any interior word whose size is less 
       // than "--run-together-min" is regarded as invalid
       i <= word_end - sp->run_together_min_;
       ++i)
  {
    char t = *i;

    // read the interior word by set the character at `end` position to '\0'
    *i = '\0';
    res = check_word_s(word, ci);
    // restore original character at `end` position
    *i = t;

    // Current interior word is invalid, we need append the character at current
    //  `end` position to creata new interior word.
    //  Inncrement `i` because `i` always points to the `end` of interior word
    if (!res) continue;

    // Current interior word is valid, strip it from the whole word to create a totally
    // new word for `check_word`, `check_word` is a recursive function
    res = check_word(i, word_end, ci + 1, pos + 1);
    if (res) return res;
  }
  memset(ci, 0, sizeof(CheckInfo));
  return 0;
}

Let's use "hisHelle" as demo how check_word runs:

  • "word" points to string "hisHelle" (in C/C++, string is character array. The last character of array is character '\0')
  • "sp->run_together_min_" is 3, so "i" initially points to the character "H", at the end of interior word "his"
  • "check_word_s" return "true" for interior word "his"
  • So we strip "his" from "hisHelle" and recursively call "check_word" to check new word "Helle"
  • In the new context of "check_word", we extract "Hel" from "Helle" initially
  • "Hel" is invalid. So we extract "Hell" from "Helle" and get new word "e" and recursively apply "check_word" on "e"
  • "e" is not valid and at the end of recursion. So "hisHelle" is a typo

Here is our conclusion after studying the code:

  • --run-together-limit could not be bigger if your computer got enough memory. It's default value is 8. I prefer 16.
  • --run-together-min can't be 2 because too many typos are combination of "correct" two character interior words ("hehe", "isme", …)
  • --run-together-min can't be greater than 3, or else, too many "correct" three character interior words are regarded as invalid ("his", "her", "one", "two")
  • --run-together-min should always be 3 which is its default value. Actually, it should never be tweak-able by user at the beginning

Since --run-together-min is 3. the word "onChange" is always regarded as typo because of two character interior word "on". Since there is nothing we can do at aspell side, we have to turn to Emacs to fix this problem.

When Emacs got potential typo on Emacs side, we can strip out all the two character interior word from original word and spell check new word again.

We can attach a predicate into specific major-mode. The predicate return t if current word at cursor is typo,

(defun js-flyspell-verify ()
  (let* ((font-face (get-text-property (- (point) 1) 'face))
         (word (thing-at-point 'word)))
    (message "font-face=%s word=%s" font-face word)
    t))
(put 'js2-mode 'flyspell-mode-predicate 'js-flyspell-verify)

As you can see from above code, we have full control on what typos should be displayed in js-flyspell-verify. You could regard predicate as the last chance to fix wrongly identified typos.

Here is complete setup you can paste into .emacs (I setup for js2-mode and rjsx-mode but code is generic enough).

Please note function split-camel-case split a camel case word into a list of sub-words. I just assume any sub-word whose length is less than three is not a typo.

(defun split-camel-case (word)
  "Split camel case WORD into a list of strings.
Ported from 'https://github.com/fatih/camelcase/blob/master/camelcase.go'."
  (let* ((case-fold-search nil)
         (len (length word))
         ;; ten sub-words is enough
         (runes [nil nil nil nil nil nil nil nil nil nil])
         (runes-length 0)
         (i 0)
         ch
         (last-class 0)
         (class 0)
         rlt)

    ;; split into fields based on class of character
    (while (< i len)
      (setq ch (elt word i))
      (cond
       ;; lower case
       ((and (>= ch ?a) (<= ch ?z))
        (setq class 1))
       ;; upper case
       ((and (>= ch ?A) (<= ch ?Z))
        (setq class 2))
       ((and (>= ch ?0) (<= ch ?9))
        (setq class 3))
       (t
        (setq class 4)))

      (cond
       ((= class last-class)
        (aset runes
              (1- runes-length)
              (concat (aref runes (1- runes-length)) (char-to-string ch))))
       (t
        (aset runes runes-length (char-to-string ch))
        (setq runes-length (1+ runes-length))))
      (setq last-class class)
      ;; end of while
      (setq i (1+ i)))

    ;; handle upper case -> lower case sequences, e.g.
    ;;     "PDFL", "oader" -> "PDF", "Loader"
    (setq i 0)
    (while (< i (1- runes-length))
      (let* ((ch-first (aref (aref runes i) 0))
             (ch-second (aref (aref runes (1+ i)) 0)))
        (when (and (and (>= ch-first ?A) (<= ch-first ?Z))
                   (and (>= ch-second ?a) (<= ch-second ?z)))
          (aset runes (1+ i) (concat (substring (aref runes i) -1) (aref runes (1+ i))))
          (aset runes i (substring (aref runes i) 0 -1))))
      (setq i (1+ i)))

    ;; construct final result
    (setq i 0)
    (while (< i runes-length)
      (when (> (length (aref runes i)) 0)
        (setq rlt (add-to-list 'rlt (aref runes i) t)))
      (setq i (1+ i)))
     rlt))

(defun flyspell-detect-ispell-args (&optional run-together)
  "If RUN-TOGETHER is true, spell check the CamelCase words.
Please note RUN-TOGETHER will make aspell less capable. So it should only be used in prog-mode-hook."
  ;; force the English dictionary, support Camel Case spelling check (tested with aspell 0.6)
  (let* ((args (list "--sug-mode=ultra" "--lang=en_US"))args)
    (if run-together
        (setq args (append args '("--run-together" "--run-together-limit=16"))))
    args))

(setq ispell-program-name "aspell")
(setq-default ispell-extra-args (flyspell-detect-ispell-args t))

(defvar extra-flyspell-predicate '(lambda (word) t)
  "A callback to check WORD.  Return t if WORD is typo.")

(defun my-flyspell-predicate (word)
  "Use aspell to check WORD.  If it's typo return true."
  (if (string-match-p "^&"
                      (shell-command-to-string (format "echo %s | %s %s pipe"
                                                       word
                                                       ispell-program-name
                                                       (mapconcat 'identity
                                                                  (flyspell-detect-ispell-args t)
                                                                  " "))))
      t))

(defun js-flyspell-verify ()
  (let* ((case-fold-search nil)
         (font-matched (memq (get-text-property (- (point) 1) 'face)
                             '(js2-function-call
                               js2-function-param
                               js2-object-property
                               js2-object-property-access
                               font-lock-variable-name-face
                               font-lock-string-face
                               font-lock-function-name-face
                               font-lock-builtin-face
                               rjsx-text
                               rjsx-tag
                               rjsx-attr)))
         subwords
         word
         (rlt t))
    (cond
     ((not font-matched)
      (setq rlt nil))
     ((not (string-match-p "aspell$" ispell-program-name))
      ;; Only override aspell's result
      (setq rlt t))
     ;; ignore two character word
     ((< (length (setq word (thing-at-point 'word))) 2)
      (setq rlt nil))
     ;; handle camel case word
     ((and (setq subwords (split-camel-case word)) (> (length subwords) 1))
      (let* ((s (mapconcat (lambda (w)
                             (cond
                              ;; sub-word wholse length is less than three
                              ((< (length w) 3)
                               "")
                               ;; special characters
                              ((not (string-match-p "^[a-zA-Z]*$" w))
                               "")
                              (t
                               w))) subwords " ")))
        (setq rlt (my-flyspell-predicate s))))
     (t
      (setq rlt (funcall extra-flyspell-predicate word))))
    rlt))

(put 'js2-mode 'flyspell-mode-predicate 'js-flyspell-verify)
(put 'rjsx-mode 'flyspell-mode-predicate 'js-flyspell-verify)

Optionally, you could see https://github.com/redguardtoo/emacs.d/blob/master/lisp/init-spelling.el for my real world setup.

Comments powered by Disqus