Text Encoding and Decoding

This module provides utilities to encode and decode text in various methods.

Base64

These provide text conversions according to RFC XXX.

Base64 coding support notes

Support for urlsafe coding

In this mode, #- instead of #+ and #_ is used instead of #/. For this coding to work, the padding should be turned off during encoding (so there's no #=:s in the output). Also known as the "Base64 with URL and Filename Safe Alphabet" aka base64url of RFC 4648.

Support for presence and absence of end padding

Supports both the older encoding with padding, and the unpadded form that's slightly smaller, that is gaining popularity in decoders currently, currently implemented in decoders such as that of Ecmascript 5 and the apache-commons Java library. We imitate this functionality, while offering the option of not outputting padding on encoding, as well as requiring padding on decoding.

Basically the requirement for padding on base64encoded was motivated by the scarce nature of computing resources in the 1980:s. Today there's a move toward not requiring the padding, as reflected for instance in Ecmascript 5's spec to produce the padding on encode but not to require it on decode.

For some more references see:

To use the bindings from this module:

(import :std/text/base64)

base64-string->u8vector

(base64-string->u8vector str [nopadding-ok?: #t] [urlsafe?: #f]) -> u8vector

  str           := string to convert
  nopadding-ok? := boolean to disable padding
  urlsafe?      := boolean to enable urlsafe coding

Returns newly allocated u8vector with Base64 encoded contents of str. Optional keyword arguments control how the conversion is done. If nopadding-ok? is #t (default) the value is converted. If urlsafe? is #t, the result is URL encoded as specified in RFC XXX ...

base64-substring->u8vector

(base64-substring->u8vector str start end [nopadding-ok?: #t] [urlsafe?: #f]) ->
u8vector

  str           := string of base64 data
  start         := exact integer starting index
  end           := exact integer ending index
  nopadding-ok? := boolean to disable padding
  urlsafe?      := boolean to enable urlsafe coding

Returns newly allocated u8vector containing base64 encoded value of str from start to end like base64-string->u8vector.

u8vector->base64-string

(u8vector->base64-string u8vect [width: 0] [padding?: #t] [urlsafe?: #f]) -> string

  u8vect   := u8vector to convert
  width    := exact integer for padding width
  padding? := boolean to enable padding
  urlsafe? := boolean to enable urlsafe coding

Returns newly allocated base64 string with bytes of u8vect in left-to-right order to base64 encoded string.

subu8vector->base64-string

(subu8vector->base64-string u8vect start end [width: 0] [padding?: #t] [urlsafe?: #f]) -> string

  u8vect   := u8vector to encode data from
  start    := exact integer starting index
  end      := exact integer end index
  width    := exact integer for padding width
  padding? := boolean to enable padding
  urlsafe? := boolean to enable urlsafe coding

Returns newly allocated string with bytes of u8vect base64 encoded in left-to-right order from start to end.

base64-decode

(base64-decode str [nopadding-ok?: #t] [urlsafe: #f]) -> u8vector

This is an alias for base64-string->u8vector. See its definition for details.

base64-decode-substring

(base64-decode-substring str start end [nopaddind-ok?: #t] [urlsafe: #f]) -> u8vector

This is an alias for base64-substring->u8vector. See its definition for details.

base64-encode

(base64-encode u8vect [width: 0] [padding: #t] [urlsafe: #f]) -> string

This is an alias for u8vector->base64-string. See its definition for details.

base64-encode-subu8vector

(base64-encode-subu8vector u8vect start end [width: 0] [padding?: #t] [urlsafe?: #f]) -> string

This is an alias for subu8vector->base64-string. See its definition for details.

Base58

The :std/text/base58 library provides encoding and decoding to base58.

To use the bindings from this module:

(import :std/text/base58)

base58-encode

(base58-encode bytes [alphabet = base58-btc-alphabet]) -> string

  bytes    := u8vector
  alphabet := optional encoding alphabet

Base58 encodes a u8vector, using the given alphabet.

base58-decode

(base58-decode str [alphabet = base58-btc-alphabet]) -> u8vector | error

  str := string; base58 encoded
  alphabet := decoding alphabet

Base58 decodes a string, using the given alphabet. Signals a error on invalid characters.

base58-btc-alphabet

(def base58-btc-alphabet)

The base58 encoding alphabet used by Bitcoin.

base58-flickr-alphabet

(def base58-flickr-alphabet)

The base58 encoding alphabet used by Flickr.

CSV

CSV parser and unparser.

To use the bindings from this module:

(import :std/text/csv)

Overview

It is configurable through parameters to fit whichever CSV options your files use, defaulting to the "standard" from the creativyst specification. Parameters for RFC4180 are just a call-with- function call around.

The parameters are: csv-separator, csv-quote, csv-unquoted-quotequote?, csv-loose-quote?, csv-eol, csv-line-endings, csv-skip-whitespace?, csv-allow-binary?

Functions to locally set the parameters to known values are call-with-creativyst-csv-syntax, call-with-rfc4180-csv-syntax, call-with-strict-rfc4180-csv-syntax

The parsing and unparsing functions are read-csv-line, read-csv-lines, read-csv-file, write-csv-line, write-csv-lines.

read-csv-line

(read-csv-line port) -> list | error

  port := input port

Read one line from port in CSV format, using the current syntax parameters. Return a list of strings, one for each field in the line. Entries are read as strings; it is up to you to interpret the strings as whatever you want. Signals an error on malformed CSV entries.

read-csv-lines

(read-csv-lines port) -> list | error

  port := input port

Examples

> (let ((csv-data "id;city;population\n01;Foobar;1200000\n02;Barville;250001\n03;Baz;21\n"))
    (read-csv-lines (open-input-string csv-data)))
(("id;city;population") ("01;Foobar;1200000") ("02;Barville;250001")("03;Baz;21"))

Read lines from port in CSV format, using the current syntax parameters. Return a list of list of strings, one entry for each line, that contains one entry for each field. Entries are read as strings; it is up to you to interpret the strings as whatever you want. Signals an error malformed CSV entries.

read-csv-file

(read-csv-file path [settings]) -> list | error

  path := file path to csv file as string
  settings := optional settings configuring csv parsing

Open the file designated by the path, using the provided settings if any, and call read-csv-lines on it.

write-csv-line

(write-csv-line fields port) -> void

  fields := list of strings
  port   := output port

Format one line of fields to port in CSV format, using the current syntax parameters.

write-csv-lines

(write-csv-lines lines port) -> void

  lines := list of strings
  port  := output port

Given a list of lines, each of them a list of fields, and a PORT, format those lines as CSV according to the current syntax parameters.

call-with-creativyst-csv-syntax

(call-with-creativyst-csv-syntax thunk) -> any

  thunk := procedure without parameters

Sets CSV parsing to match rules by creativyst and calls procedure thunk returning its value.

call-with-rfc4180-csv-syntax

(call-with-rfc4180-csv-syntax thunk) -> any

  thunk := procedure without parameters

Sets CSV parsing to match rules by RFC4180 and calls procedure thunk returning its value.

call-with-strict-rfc4180-csv-syntax

(call-with-strict-rfc4180-csv-syntax thunk) -> any

  thunk := procedure without parameters

Sets CSV parsing to match rules by strict RFC4180 and calls procedure thunk returning its value.

csv-separator

csv-separator

Separator used between CSV fields, defaults to ,.

csv-quote

csv-quote

Delimiter of string data; pascal-like quoted as double itself in a string. Defaults to ".

csv-unquoted-quotequote?

csv-unquoted-quotequote?

Boolean to control does a pair of quotes represent a quote outside of quotes? M$, RFC says #f, csv.3tcl says #t. Defaults to #f.

csv-loose-quote?

csv-loose-quote?

Boolean to control can quotes appear anywhere in a field? Defaults to #f.

csv-eol

csv-eol

Line ending when exporting CSV. Defaults to [+crlf+ +lf+].

csv-line-endings

csv-line-endings

Acceptable line endings when importing CSV. Defaults to [+cr+ +lf+ +crlf+].

csv-skip-whitespace?

csv-skip-whitespace?

Boolean controls shall we skip unquoted whitespace around separators? Defaults to #t.

csv-allow-binary?

csv-allow-binary?

Boolean to control do we accept non-ascii data? Defaults to #t.

Hex

To use the bindings from this module:

(import :std/text/hex)

hex-encode

(hex-encode bytes [start = 0] [end = #f]) -> string

  bytes := u8vector to convert
  start := exact integer as start index
  end   := exact integer as end index | #f

Returns newly allocated string containing hex encoded characters from given bytes. Optional start gives starting index to start the encoding, end gives ending index. Giving #f in end means reading to the end of byte vector.

hexlify

(defialias hexlify hex-encode)

Short for hex-encode.

hex-decode

(hex-decode str) -> u8vector | error

  str := string to decode

Returns newly allocated u8vector with contents set to hex decoded str.

unhexlify

(defalias unhexlify hex-decode)

Short for hex-decode.

hex

(hex u4) -> character | error

Returns hex character for give u4 integer value. Signals an error if integer value can't be converted to hex.

Examples

> (hex 15)
#\f
> (hex 1)
#\1
> (hex 99)
*** ERROR IN (console)@20.1 -- (Argument 2) Out of range
(string-ref "0123456789abcdef" 99)
1>

unhex

(unhex char) -> exact integer | error

Returns hex value for given char. Signals an error if char isn't valid hex character.

Examples

> (unhex #\f)
15
> (unhex #\3)
3
> (unhex #\i)
*** ERROR IN (console)@12.1 -- Unbound table key
(table-ref '#<table #4> #\i)
1>

unhex*

(unhex* char) -> character | #f

Returns hex value for given char. Returns #f if char isn't valid hex character.

Examples

> (unhex* #\f)
15
> (unhex* #\3)
3
> (unhex* #\i)
#f
>

JSON

To use the bindings from this module:

(import :std/text/json)

read-json

(read-json [port = (current-input-port)]) -> json | error

  port := input-port to read JSON data

Returns JSON object from given port. Signals an error if fails to parse JSON.

write-json

(write-json obj [port = (current-output-port)]) -> void | error

  obj := JSON object

Writes JSON object obj optionally given port. Defaults to using current-ouput-port. Signals an error on failed write.

string->json-object

(string->json-object str) -> json | error

  str := a string of JSON data

Parses given str and returns JSON object or signals an error fails to parse.

json-object->string

(json-object->string obj) -> string | error

  obj := JSON object

Returns newly allocated string with JSON object as a string. Signals an error if fails to parse JSON.

json-symbolic-keys

json-symbolic-keys

Boolean to control should decoded hashes have symbols as keys? Defaults to #t.

UTF8

Faster UTF8 encoding and decoding.

To use the bindings from this module:

(import :std/text/utf8)

string->utf8

(string->utf8 str [start = 0] [end = (string-length str)]) -> u8vector | error

  str   := string of UTF-8 data
  start := exact integer for starting index
  end   := exact integer for end index

Returns newly allocated u8vector with UTF-8 data from str converted to bytes. Optional start and end limit the operation to substring of str.

utf8->string

(utf8->string u8v [start = 0] [end = (u8vector-length str)]) -> string | error

  u8v   := u8vector of data to convert
  start := exact integer for starting index
  end   := exact integer for ending index

Returns newly allocated string with UTF-8 contents from u8v. Optional start and end parameters limit the operation to sub-vector of given indexes. The parsing will signal error on decoding issues.

utf8-encode

(utf8-encode str start end) -> u8vector

  str   := UTF-8 encoded string
  start := exact integer for starting index
  end   := exact integer for ending index

Returns newly allocated u8vector with byte data of UTF-8 string str. Optional start and end.

utf8-decode

(utf8-decode u8v start end) -> string | error

  u8v   := u8vector of input data
  start := exact integer for starting index
  end   := exact integer for ending index

Decodes the bytes in byte vector u8v from start index until end index and returns the results as a string. Will signal an error if fails to parse UTF-8 bytes.

string-utf8-length

(string-utf8-length str [start = 0] [end = (string-length str)]) -> integer | error

Returns the byte length of given UTF-8 string str. Optional start and end indexes limit the operation on substring. Signals an error if str isn't string.

Examples

> (import :std/format)
> (import :std/text/utf8)

> (let ((s  "uber")
        (us "über"))
    (printf "s length: ~a\n" (string-length s))
    (printf "u length: ~a\n" (string-length us))
    (newline)
    (printf "s utf8-length: ~a\n" (string-utf8-length s))
    (printf "u utf8-length: ~a\n" (string-utf8-length us)))
s length: 4
u length: 5

s utf8-length: 4
u utf8-length: 7

YAML

YAML parsing and dumping; this module requires that Gerbil scheme is compiled against libyaml.

To use the bindings from this module:

(import :std/text/yaml)

yaml-load

(yaml-load filename) -> any | error
  filename := string

Loads a YAML data from given filename. Signals an error if fails to parse YAML.

yaml-load-string

(yaml-load-string str) -> any | error
  str := string of YAML data

Parses a YAML data from str. Signals an error if fails to parse YAML.

yaml-dump

(yaml-dump filename . args)
  filename := string

Dumps the arguments to a YAML file.

Zlib

Compression and decompression with zlib.

To use the bindings from this module:

(import :std/text/zlib)

compress

(compress data [level = 6]) -> u8vector | string | port | error

  data  := u8vector, string or input-port
  level := optional integer value, from 0 to 9

Compresses given data using zlib. The return value varies by given data's type. If given u8vector the return value is newly allocated u8vector with contents of data compressed. Signals a error on wrong type of data.

The optional level parameter sets the compression level used. Value of 1 gives best speed, 9 gives best compression, 0 gives no compression at all (the input data is simply copied a block at a time).

compress-gz

(compress-gz data [level = 6]) -> u8vector | string | input-port | error

  data  := u8vector, string, or input-port
  level := optional integer value, from 0 to 9

Compresses data given in data as compress procudere but in addition gzip encodes it. Signals a error on wrong type of data.

uncompress

(uncompress data) -> u8vector | error

  data := u8vector or input-port

Returns uncompressed bytes from data. Signals a error on wrong type of data.