cross reference whenever, wherever
- links are at the core of the world wide web
- provides usability internally and externally (deep links)
use hyperref's \autoref or cleveref
- better links (more hit area, better link text, simpler maintenance)
avoid manual ref ranges \ref{1} -- \ref{8}

123456789101112131415161718192021222324252627282930313233

Footnotes

Avoid footnotes.
But they are actually well supported.
back navigation is a must.
appropriated ARIA markup might be supported

123456789101112131415161718192021222324252627282930313233

Indices, Glossaries, Appendices

Indices are rarely seen in web documents - search and links are preferred (and often easier)
Glossaries are useful but often "outsourced" (e.g., Wikipedia, Scholarpedia)
Appendices generally work like any other sectioning but counters might vary.

DPUB ARIA markup for appendices is usually not supported by convertors (but little impact in practice).

123456789101112131415161718192021222324252627282930313233

Theorem environments

TeX convertors don't treat them well (just creating wrapping divs); post-processing might be feasible
what are they really? Landmarks with identifiable titles
- we lack a dedicated native solution but can make do
- e.g., section+heading, figure+figcaption, possibly with a roledescription
don't nest, don't abuse
- amsthm advice for proofs: short = theorem, long = section
- e.g., don't do subclaims in proofs as theorems just because you like the styling

123456789101112131415161718192021222324252627282930313233

Algorithm layout

Support in convertors is quite limited
semantics is usually bad
raw source code sensible if kept separate (e.g., then raw in HTML plus prism.js)
algorithmicx, minted ok
algorithm2e not great
- e.g. with tex4ht (and manually adding whitespace:pre) it's barely passable

123456789101112131415161718192021222324252627282930313233

Diagram authoring

We'll start with some of those tomorrow!

123456789101112131415161718192021222324252627282930313233

Considersation for LaTeX Authoring

123456789101112131415161718192021222324252627282930313233

TeX tips: what to do

Do what good LaTeX authors would do

pick a style guide and run with it, e.g., Wikibook LaTeX, LaTeX 2e Unofficial Manual, AMS Style Guide
standard document setup
standard sectioning
standard lists, figures, tables
use labels and cross-reference them (\ref, \cite etc)
- $\eqref$ can help with pass through

123456789101112131415161718192021222324252627282930313233

TeX tips: what to avoid 1

plain TeX - LaTeX only please!
TeX programming (changing core macros, active characters, loops etc)
LaTeX taboos
box hackery, e.g., parbox, minipage, pbox, fbox, raisebox, rotatebox
using page dimensions (\textwidth etc)
using real-world dimensions (in, mm, cm etc)
manual spacing (\!, hspace, vspace, hfill)
manual positioning (cover pages, pgf drawing across page)
custom fonts

123456789101112131415161718192021222324252627282930313233

TeX tips: what to avoid 2

color
rules-based constructs
generated pictures, pstricks, tikz
- but standalone conversion can work
custom counters (e.g., lists)
custom items (\item[...])
hidden crossrefs (\ref{1}--\ref{4} etc.)
hardcoded linebreaks
hardcoded linelengths
hardcoded whitespace
avoid hacking layout using TeX (a little CSS goes a long way)

123456789101112131415161718192021222324252627282930313233

Tips for math mode

custom macros: usually as pass-through, i.e build mathjax config or extension necessary
- example: sty file with mathjax extension
avoid mixing text and math mode, e.g.,
- subequations, intertext etc
- (complex) text mode inside math mode
- parboxes etc inside math mode
avoid large tables in math mode (consider tabular)
avoid long inline equations (no linebreaks)
auto aligned environments can surprise
- e.g., multline (without max-width on container)
auto-numbering support varies
punctuation near math

123456789101112131415161718192021222324252627282930313233

Authoring for the Web

Stop thinking "print only"
- You are no longer bound to a paper size
- Spacing should be flexible, colors can change
Reflow vs Pagination
- Do not assume pages for referencing, positioning, etc.
- Graphics, tables, images might be in different positions
Size really matters
- Content can be viewed on various form factors
- Make sure it displays fine even on extreme zoom
Remember: There might be more than one output format

Note: That does not mean we want you to change your authoring workflow!

123456789101112131415161718192021222324252627282930313233

Intermission

123456789101112131415161718192021222324252627282930313233

Other Formats

Source to Web is easy

What about other formats?

Hard Sources
Different Targets

123456789101112131415161718192021222324252627282930313233

Taxonomy of Sources

Retro-digitized
- Printed content that is put into digital format
- Scanned, images, electronic documents with scanned images
Born Digital
- Document has been generated from some electronic source
- However the sources are not available to us
Born Accessible
- Documents that have been generated with accessibility in mind
- Or at least accessibility is not precluded
- E.g., we can get to the sources

123456789101112131415161718192021222324252627282930313233

Retro-digitized

Documents which originally are only available in print
Retro-digitization includes
- Scanning
- OCR
- Correcting

123456789101112131415161718192021222324252627282930313233

Retro-digitized Documents

Scanned version of

historical manuscripts
old books and runs of journals
Sources: libraries, JSTOR, publishers

Optical Character Recognition (OCR)

From images to print
Simple programs often come with scanner

Improving results

From automation to manual transcription
Correction and Proof reading
Crowd sourcing with projects like Zooniverse

123456789101112131415161718192021222324252627282930313233

What about Math

Math OCR is notoriously difficult

Math detection
Layout Analysis
No dictionary analysis
Multiple Fonts
Print/Font Variation in particular for legacy documents
...

123456789101112131415161718192021222324252627282930313233

OCR Systems

General OCR systems are usually poor at Math
- Abby Fine Reader: Proprietary, has some special fonts
- Tesseract: Open source Example, Result
- GOCR: Open source. Example, Result
Specialist OCR system like Infty
- Full document analysis
- Proprietary
- Windows only

123456789101112131415161718192021222324252627282930313233

Snapshotting

Snapshotting as in "Making an isolated observation"

Crop a formula from an image
OCR works often more reliably
- its known to be math
- no noisy context
- little need for layout analysis
Examples:
- Mathpix Snip
- EquatIO

123456789101112131415161718192021222324252627282930313233

Born Digital

Documents where source and target are electronic

compiled for print only
sources are no longer available

Document types

Now primarily PDF documents
Other formats: Rich Text Format (.rtf), Postscript (.ps)
They can generally be losslessly converted to PDF

123456789101112131415161718192021222324252627282930313233

PDF as only source

Source: Nothing but inaccessible PDF

Text only is relatively accessible
Read-aloud in Acrobat Reader
Works with multiple screen readers
Many tools for pure text extraction
- Even just copying works
Common problems:
- Poor reading order
- Missing alt texts for images
- Headers, footers, etc. can interfere

123456789101112131415161718192021222324252627282930313233

Math in Born Digital PDF

Surprisingly difficult problem
Requires: pattern recognition or OCR
Consider our example: original PDF
- Problems with fonts
- Incomplete position and box information for layout analysis
- Some math is in images

123456789101112131415161718192021222324252627282930313233

Some solutions

Maxtract system (2008)
- Full content extraction
- Grammatical approach at PDF reconstruction
- Very limited in terms of PDF versions it could handle
- Single column, result, two columns, result
Current Ravi Project at IIT:
- Full extraction for client side rendering
- Comprehensive approach
- Reuses the good part of Maxtract
Infty system has implemented some of the Maxtract ideas
Akio Fujiyoshi's Lab has some nice solutions for the bounding box problem

123456789101112131415161718192021222324252627282930313233

Outlook

Born Accessible Documents
Diagrams
Sonfication
Advanced content

123456789101112131415161718192021222324252627282930313233

Document Accessibility Conversion wrap-up and Other Formats

Overview

Show & Tell

Tying Up Loose Ends

Still Missing

Tweaking Converters: pandoc

Tweaking Converters: make4ht

Bibliographies and Citations

Cross-references.

Footnotes

Indices, Glossaries, Appendices

Theorem environments

Algorithm layout

Diagram authoring

Considersation for LaTeX Authoring

TeX tips: what to do

TeX tips: what to avoid 1

TeX tips: what to avoid 2

Tips for math mode

Authoring for the Web

Intermission

Other Formats

Taxonomy of Sources

Retro-digitized

Retro-digitized Documents

What about Math

OCR Systems

Snapshotting

Born Digital

PDF as only source

Math in Born Digital PDF

Some solutions

Outlook

Document Accessibility
Conversion wrap-up and Other Formats