XPoseImageCaptioner

Authors: Christopher Millsap
download development version

This addon can create a descriptive caption of any JPG or PNG image in File Explorer, Microsoft Edge, Google Chrome, and Firefox. It does this using machine learning and presents the caption in a window so that the text of the caption can be examined as well as speaking the text.

Usage

First, select an image file in Windows File Explorer, or in a web browser. Chrome, Edge, and Firefox are supported. With the image selected, presss NVDA+x. the addon will respond with "Captioning, please wait..." as the image is analyzed by the machine learning module and captioned. Depending on your machine's CPU speed, captioning will take between two and five seconds. After captioning is complete, a window will open with the caption of the image and the caption will be read. When you are finished browsing the caption, you can press escape to close the caption window.

Getting the most out of XPoseImageCaptioner

There are several things to be aware of when using the XPoseImageCaptioner to get the best results:

XPoseImageCaptioner captioning works best for photographs and cartoons or other artwork. It also can work fairly well for memes and ads. It does not work well for charts and is not a replacement for OCR. If you have an image of a text document, use an OCR addon rather than XPoseImageCaptioner.
AI captioning can tell you what is in an image but can't tell you why it is there. ALT text should still be used to find out about an image's context. For example, on a news site you may see an image with the ALT text "a general gives testimony in a congressional hearing about the millitary budget", the AI caption could be something like "a man in a formal millitary uniform speaks into a microphone while seated in a wood paneled room". The AI caption tells you what is in the image, but the ALT text ideally should tell you why its there.
The BLIP neural network, on which the XPoseImageCaptioner addon is based, can only output English text. Retraining the model to support languages other than English is not feasable at this time.
While the captions produced are currently very close to the state of the art for AI captioning of images, they are not always 100% accurate. Please use with caution and common sense and never in place of OCR. Also, do not rely on the output for dangerous or high-risk situations.
Currently, XPoseImageCaptioner works for websites that don't require a login. For example, the public pages of organizations such as Guiding Eyes for the Blind or CNN. Pages that require a login, such as Facebook or Twitter, are not yet supported because the addon must download an image from the web site independantly to caption it and can't do so if a login is required. As a workaround, any image from sites requiring a login could be downloaded to the local machine and captioned using the addon in File Explorer.
XPoseImageCaptioner only works on FireFox when an image does not have ALT text. FireFox does not provide a direct link to an image file to a screen reader if an image has ALT text. Without this information, the addon can't download the image for the AI to caption. Chrome and Microsoft Edge do not have this limitation, and work regardless of whether an image has ALT text or not.

Copyright:

I learned a great deal from the NAO OCR addon in terms of how it deals with Windows File Explorer in NVDA. Thanks to Alessandro Albano, Davide De Carne, and Simone Dal Maso for their work on that addon. Also, XPoseImageCaptioner uses the BLIP neural network weights and code by Salesforce.com, but is not affiliated with or endorsed by Salesforce.com in any way.

Licence

Licensed under the BSD 3 Clause License. This addon is not endorsed in any way by Salesforce.com.