Document Accessibility
Document Accessibility
Conversion wrap-up and Other Formats
Volker Sorge
Overview
- SVG formulas: Show & Tell
- Loose ends
- Whats missing from our converted documents
- Other documents
- Other sources
- Other formats
Show & Tell
What have you done with the SVG formulas
- Tabbing?
- Navigation?
- Styling?
- Anything else?
Tying Up Loose Ends
Let's look at our example document again
Still Missing
- Bibliography
- Referencing
- Figures (broken, badly displayed)
- Footnotes
- ...
Tweaking Converters: pandoc
- pandoc "filters" (extensions)
pandoc --filter=PROGRAM ...
pandoc --lua-filter=SCRIPTPATH ...
Tweaking Converters: make4ht
make4ht
can use multiple backends
tex4ht
is not that easy to tweak
lua4ht
makes use of luaTeX
luaTeX
offers the possibility of integrating
(post-)processing scripts
Bibliographies and Citations
Support for basic bibliographies is usually good
\begin{bibliography}
, bibtex generally ok
- Other tooling has less support (natbib, biblatex, biber)
- citations should match cross-references and bibliography
- Backlinks are good, also for a11y
- but not well supported, e.g. tex4ht
- advanced markup (DPUB ARIA, schema.org) not supported by conversion
Cross-references.
- cross reference whenever, wherever
- links are at the core of the world wide web
- provides usability internally and externally (deep links)
- use hyperref's
\autoref
or cleveref
- better links (more hit area, better link text, simpler maintenance)
- avoid manual ref ranges
\ref{1} -- \ref{8}
Footnotes
- Avoid footnotes.
- But they are actually well supported.
- back navigation is a must.
- appropriated ARIA markup might be supported
Indices, Glossaries, Appendices
- Indices are rarely seen in web documents - search and links are preferred (and often easier)
- Glossaries are useful but often "outsourced" (e.g., Wikipedia, Scholarpedia)
- Appendices generally work like any other sectioning but counters might vary.
DPUB ARIA markup for appendices is usually not supported by convertors (but little impact in practice).
Theorem environments
- TeX convertors don't treat them well (just creating wrapping divs); post-processing might be feasible
- what are they really? Landmarks with identifiable titles
- we lack a dedicated native solution but can make do
- e.g., section+heading, figure+figcaption, possibly with a roledescription
- don't nest, don't abuse
- amsthm advice for proofs: short = theorem, long = section
- e.g., don't do subclaims in proofs as theorems just because you like the styling
Algorithm layout
- Support in convertors is quite limited
- semantics is usually bad
- raw source code sensible if kept separate (e.g., then raw in HTML plus prism.js)
- algorithmicx, minted ok
- algorithm2e not great
- e.g. with tex4ht (and manually adding whitespace:pre) it's barely passable
Diagram authoring
We'll start with some of those tomorrow!
Considersation for LaTeX Authoring
TeX tips: what to do
Do what good LaTeX authors would do
TeX tips: what to avoid 1
- plain TeX - LaTeX only please!
- TeX programming (changing core macros, active characters, loops etc)
- LaTeX taboos
- box hackery, e.g., parbox, minipage, pbox, fbox, raisebox, rotatebox
- using page dimensions (\textwidth etc)
- using real-world dimensions (in, mm, cm etc)
- manual spacing (\!, hspace, vspace, hfill)
- manual positioning (cover pages, pgf drawing across page)
- custom fonts
TeX tips: what to avoid 2
- color
- rules-based constructs
- generated pictures, pstricks, tikz
- but standalone conversion can work
- custom counters (e.g., lists)
- custom items (\item[...])
- hidden crossrefs (
\ref{1}--\ref{4}
etc.)
- hardcoded linebreaks
- hardcoded linelengths
- hardcoded whitespace
- avoid hacking layout using TeX (a little CSS goes a long way)
Tips for math mode
- custom macros: usually as pass-through, i.e build mathjax config or extension necessary
- avoid mixing text and math mode, e.g.,
- subequations, intertext etc
- (complex) text mode inside math mode
- parboxes etc inside math mode
- avoid large tables in math mode (consider tabular)
- avoid long inline equations (no linebreaks)
- auto aligned environments can surprise
- e.g., multline (without max-width on container)
- auto-numbering support varies
- punctuation near math
Authoring for the Web
- Stop thinking "print only"
- You are no longer bound to a paper size
- Spacing should be flexible, colors can change
- Reflow vs Pagination
- Do not assume pages for referencing, positioning, etc.
- Graphics, tables, images might be in different positions
- Size really matters
- Content can be viewed on various form factors
- Make sure it displays fine even on extreme zoom
- Remember: There might be more than one output format
Note: That does not mean we want you to change your authoring workflow!
Other Formats
Source to Web is easy
What about other formats?
- Hard Sources
- Different Targets
Taxonomy of Sources
- Retro-digitized
- Printed content that is put into digital format
- Scanned, images, electronic documents with scanned images
- Born Digital
- Document has been generated from some electronic source
- However the sources are not available to us
- Born Accessible
- Documents that have been generated with accessibility in mind
- Or at least accessibility is not precluded
- E.g., we can get to the sources
Retro-digitized
- Documents which originally are only available in print
- Retro-digitization includes
Retro-digitized Documents
Scanned version of
- historical manuscripts
- old books and runs of journals
- Sources: libraries, JSTOR, publishers
Optical Character Recognition (OCR)
- From images to print
- Simple programs often come with scanner
Improving results
- From automation to manual transcription
- Correction and Proof reading
- Crowd sourcing with projects like Zooniverse
What about Math
Math OCR is notoriously difficult
- Math detection
- Layout Analysis
- No dictionary analysis
- Multiple Fonts
- Print/Font Variation in particular for legacy documents
- ...
OCR Systems
- General OCR systems are usually poor at Math
- Specialist OCR system like Infty
Snapshotting
Snapshotting as in "Making an isolated observation"
- Crop a formula from an image
- OCR works often more reliably
- its known to be math
- no noisy context
- little need for layout analysis
- Examples:
Born Digital
Documents where source and target are electronic
- compiled for print only
- sources are no longer available
Document types
- Now primarily PDF documents
- Other formats: Rich Text Format (
.rtf
), Postscript (.ps
)
- They can generally be losslessly converted to PDF
PDF as only source
Source: Nothing but inaccessible PDF
- Text only is relatively accessible
- Read-aloud in Acrobat Reader
- Works with multiple screen readers
- Many tools for pure text extraction
- Common problems:
- Poor reading order
- Missing alt texts for images
- Headers, footers, etc. can interfere
Math in Born Digital PDF
- Surprisingly difficult problem
- Requires: pattern recognition or OCR
- Consider our example: original PDF
- Problems with fonts
- Incomplete position and box information for layout analysis
- Some math is in images
Some solutions
Maxtract
system (2008)
- Current Ravi Project at IIT:
- Full extraction for client side rendering
- Comprehensive approach
- Reuses the good part of Maxtract
- Infty system has implemented some of the Maxtract
ideas
- Akio Fujiyoshi's Lab has some
nice solutions for the bounding box problem
Outlook
- Born Accessible Documents
- Diagrams
- Sonfication
- Advanced content