art with code

2008-09-30

I/O in programming languages: open and read

This post is a part of a series where your intrepid host looks at I/O in different programming languages in search of understanding and interesting abstractions.

Part 1: open and read -- you are here
Part 2: writing
Part 3: basic piping
Part 4: piping to processes
Part 5: structs and serialization

Different programming languages have different takes on handling I/O. The simplest way is to wrap kernel syscalls, like this open() call (in assembler):

# define some globals
.equ SYS_OPEN, 5
.equ LINUX_SYSCALL, 0x80
.section .data
filename:
.ascii "my_file\0"

# and a main that calls SYS_OPEN
.globl _start
_start:
movq $SYS_OPEN, %rax
movq $filename, %rbx
movq $0, %rcx # open as read-only
movq $0666, %rdx # permissions for the file
int $LINUX_SYSCALL


C has syntax sugar for doing function calls:

#include <stdio.h>

int main(int argc, char *argv[])
{
int fd;
fd = open("my_file", O_RDONLY);
}


In Python, yet more boilerplate is eliminated:

fd = open("my_file")


To read from a file, you need a file descriptor and a buffer to read into. The assembler goes:

.equ STDIN, 0
.equ SYS_READ, 3
.equ BUFSIZE, 4096
.section .bss
.lcomm buffer, BUFSIZE
.globl _start
_start:
movq $SYS_READ, %rax
movq $STDIN, %rbx # file descriptor
movq $buffer, %rcx # buffer
movq $BUFSIZE, %rdx # max bytes to read
int $0x80


In C, we call read():

const int bufsize = 4096;

int main(int argc, char *argv[])
{
char buffer[bufsize];
read(STDIN_FILENO, buffer, bufsize);
}


In Python, you read by calling the read-method of the file object. The buffer is allocated automatically:

fd = open("my_file")
buffer = fd.read(4096)


In Ruby, the procedure is much the same, except that Ruby can do the file open as a higher-order function, closing the file when the continuation exits:

buffer = File.open("my_file"){|f| f.read(4096) }


Liam Clarke notified me that Python 2.6 has a with-keyword for doing the equivalent:

with open("my_file") as f:
buffer = f.read(4096)

You can use with in Python 2.5 by doing from __future__ import with_statement.

The function passing technique for dealing with files probably comes from Common Lisp's WITH-OPEN-FILE.

In ASM and C, you need to close files explicitly. If you use the with-open-file-style in Python and Ruby, it takes care of closing the file. Both Python and Ruby close file descriptors on GC as well (along with Perl and probably most other garbage collected languages), so there's no real need to close a file if you know that you won't run into problems. Possible problems here being two: 1) running into the process open file limit, and 2) unflushed writes.

Closing a file in ASM:

.equ SYS_CLOSE, 6
movq $SYS_CLOSE, %rax
movq $my_fd, %rbx
int $0x80


The C version is shorter:

close(fd);


Python is object-oriented here:

fd.close()


As is Ruby:

fd.close


OCaml's standard library can't make up its mind on how to deal with files, it has IO channels in Pervasives and C-style file descriptors in the Unix library. The Unix way to read from a file is:

let buffer_size = 4096

let read_data =
let fd = Unix.openfile "my_file" [Unix.O_RDONLY] 0o666 in
let buffer = String.create buffer_size in
let read_count = Unix.read fd buffer 0 4096 in
Unix.close fd;
String.sub buffer 0 read_count

Or with the quirky input channels:

let data =
let rec my_read ic buf count total =
if count = total then String.sub buf 0 count
else
let rb = input ic buf count (total-count) in
match rb with
| 0 -> String.sub buf 0 count
| n -> my_read ic buf (n+count) total in
let ic = open_in "my_file" in
let buffer = String.create buffer_size in
let data = my_read ic buffer 0 buffer_size in
close_in ic;
data

Prelude.ml uses a WITH-OPEN-FILE variant:

let data = withFile "my_file" (read 4096)


SML's IMPERATIVE_IO interface makes things straightforward:

val data =
let
val input = BinIO.openIn "my_file"
val data = BinIO.inputN (input, 4096)
val () = BinIO.closeIn input
in
data
end

By replacing BinIO by TextIO, you can read in Strings and Chars instead of Word8Vectors and Word8s. You could also write a with-open-file without much trouble, see Part 2 for an example.

SML also has a STREAM_IO interface for doing functional reading from input streams, sort of like a lazy list. Conceptually it lies somewhere inside the IMPERATIVE_IO - Clean - Haskell -triangle. I'll try to add an example to Part 3 when I manage.

Haskell actually does treat file contents as lazy lists, and wraps them inside the IO monad to isolate file access from side-effect-free code:

import System.IO

main = do
fd <- openFile "my_file" ReadMode
contents <- hGetContents fd
let buffer = take 4096 contents


Manually closing files in Haskell is error-prone because hClose is strict and reads are lazy, so it's possible to close a file before a preceding read is evaluated. For example, the following code prints out nothing:

import System.IO

main = do
fd <- openFile "my_file" ReadMode
contents <- hGetContents fd
let buffer = take 4096 contents -- read up to 4096 bytes into buffer when buffer is first used
hClose fd -- close fd
putStrLn buffer -- oops

The buffer is empty because fd is closed before buffer is needed by putStrLn, hence nothing has been yet read into buffer, resulting in putStrLn trying to read a closed file. We can fix this either by strictness annotations, a WITH-OPEN-FILE or by using readFile.

Strictness annotations:

let !buffer = take 4096 contents -- forces eager evaluation of buffer

WITH-OPEN-FILE:

import Control.Exception
main =
bracket (openFile "my_file" ReadMode) hClose
(\fd -> do contents <- hGetContents fd
let buffer = take 4096 contents
putStrLn buffer)

readFile:

main = do
contents <- readFile "my_file"
let buffer = take 4096 contents
putStrLn buffer


Using readFile as a segue, let's see how to read the whole file into a memory buffer in the other languages. A simple C version would stat the file to get its size, allocate a buffer and call read a single time. Let's use the FILE functions this time for the heck of it:

int main(int argc, char *argv[])
{
stat st;
off_t size;
char *contents;
FILE *fd;
size_t read_sz;

if (0 != stat("my_file", &st))
exit(1);

contents = (char*)malloc(st.st_size);
if (NULL == contents) exit(3);

fd = fopen("my_file", "r");
if (NULL == fd) exit(2);

/* read st.st_size 1-byte elements into buf from fd */
read_sz = fread(contents, 1, st.st_size, fd);
fclose(fd);

if (read_sz != st.st_size)
exit(4);

/* do stuff */
}


Python gets rid of a good deal of the code:

fd = open("my_file")
contents = fd.read()
close(fd)


Ruby has a convenience function for this:

contents = File.read("my_file")


With the OCaml Unix library, it's close to the C version:

let () =
let sz = (Unix.stat "my_file").Unix.st_size in
let contents = String.create sz in
let fd = Unix.openfile "my_file" [O_RDONLY] 0o666 in
let rb = Unix.read fd contents 0 sz in
if rb <> sz then failwith "read wrong amount of bytes";
Unix.close fd
(* do stuff *)


The OCaml input channels version is like the Unix version but reads into a Buffer and calls Buffer.contents on End_of_file. In principle, the Unix version and the C version should do it this way as well, the C version calling feof to see if it is at the end of the file.

let buf_sz = 256

let () =
let rec my_read_all ic buf res =
let br = input ic buf 0 buf_sz in
if br = 0 then Buffer.contents res
else
let () = Buffer.add_substring res buf 0 br in
my_read_all ic buf res in
let ic = open_in "my_file" in
let buf = String.create buf_sz in
let res = Buffer.create (in_channel_length ic) in
let contents = my_read_all ic buf res in
(* do stuff *)


Prelude.ml has a convenience function:

let contents = readFile "my_file"


Which is an eagerly evaluated version of Haskell's readFile:

contents <- readFile "my_file"


SML has a relatively sane IO library, so there are things like inputAll:

val contents = let
val input = TextIO.openIn "my_file"
val data = TextIO.inputAll input
val () = TextIO.closeIn input
in data end


The ASM version works much like the OCaml input channel version; reading into a buffer, calling brk to increase the data segment size, and copying the buffer to the newly allocated space.

.equ EXIT, 1
.equ READ, 3
.equ WRITE, 4
.equ OPEN, 5
.equ BRK, 45

.equ STDOUT, 1

.equ BUF_SZ, 4096

.section .data
filename:
.ascii "my_file\0"

.section .bss
.lcomm buffer, BUF_SZ

.section .text
.globl _start
_start:
movq $buffer + BUF_SZ, %r12 # save bss end address to r12
movq %r12, %r14 # and a copy to r14

movq $OPEN, %rax
movq $filename, %rbx
movq $0, %rcx
movq $0666, %rdx
int $0x80

movq %rax, %r13 # save fd to r13

read_loop:
movq $READ, %rax
movq %r13, %rbx
movq $buffer, %rcx
movq $BUF_SZ, %rdx
int $0x80

cmpq $0, %rax # end of file
je read_loop_end

movq %r12, %rbp # copy old end to rbp
addq %rax, %r12 # add read sz to r12

movq $BRK, %rax # allocate to new end
movq %r12, %rbx
int $0x80

cmpq %rax, %r12 # if rax != r12, alloc failed
jne error_exit

# copy buf to newly allocated space
# rcx has the buffer address
# rbp has the target address
copy_loop:
cmpq %r12, %rbp
je end_copy
movb (%rcx), %al # load byte from rcx to al
movb %al, (%rbp) # store byte from al to rbp
incq %rcx
incq %rbp
jmp copy_loop

end_copy:

jmp read_loop

read_loop_end: # ok exit
movq $WRITE, %rax # print the file to stdout
movq $STDOUT, %rbx
movq %r14, %rcx
movq %r12, %rdx
subq %r14, %rdx # rdx = r12 - r14 = length of the read data
int $0x80

movq $EXIT, %rax
movq $0, %rbx
int $0x80

error_exit: # error exit
movq $EXIT, %rax
movq $1, %rbx
int $0x80


All of the high-level languages (read: not ASM) we've looked at thus far have convenience functions for processing files a line at a time. Python uses a syntax-level iterator:

f = open("my_file")
for line in f:
print line[-2::-1] # reverses string, strips out linefeed
f.close()

Ruby uses a higher-order function:

File.open("my_file"){|f|
f.each_line{|line|
puts line.chomp.reverse
}
}

Haskell doesn't need to use an iterator to save memory since it does lazy IO:

import Char
main =
mapM_ putStrLn . map reverse . lines =<< readFile "my_file"

C doesn't have a string reverse function, so I'm using a reverse print function:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void rprintln(const char *buf, int len)
{
int i;
for(i=len-1; i>=0; i--)
putchar(buf[i]);
putchar('\n');
}
const int buf_size = 4096;
int main(int argc, char *argv[])
{
FILE *fd;
char buf[buf_size];
int read_sz;

fd = fopen("my_file", "r");
if (NULL == fd) exit(1);

while (!feof(fd)) {
if (NULL == fgets(buf, buf_size, fd)) break;
rprintln(buf, strlen(buf)-1);
}
fclose(fd);
}

With input channels, OCaml has input_line (but sadly no string reverse):

let reverse_string s =
let len = String.length s - 1 in
let d = String.create (len + 1) in
for i=0 to len do
String.unsafe_set d (len-i) (String.unsafe_get s i)
done;
d

let () =
let rec iter_lines ic f =
let res = try Some (f (input_line ic))
with End_of_file -> None in
if res = None then ()
else iter_lines ic f in
let ic = open_in "my_file" in
iter_lines ic (fun line -> print_endline (reverse_string line));
close_in ic

The Unix library version is much the same, using in_channel_of_descr to convert the Unix.file_descr into an in_channel:

...
let fd = Unix.openfile "my_file" [Unix.O_RDONLY] 0o666 in
let ic = in_channel_of_descr fd in
iter_lines ic (fun line -> print_endline (reverse_string line));
Unix.close fd

Prelude.ml has convenience functions:

eachLine (puts @. srev) "my_file"


SML has TextIO.inputLine, which we loop 'til the end:

fun iter_lines f ic =
case TextIO.inputLine ic of
NONE => ()
| SOME l => (f l; iter_lines f ic)

fun srev s = String.implode (List.rev (String.explode s))

val () = let
val input = TextIO.openIn "my_file"
val () = iter_lines (fn l => TextIO.print (srev l)) input
val () = TextIO.closeIn input
in () end


Not really feeling like writing the ASM version, but it'd read into a reverse buffer a byte at a time, exiting on buffer end or newline, return the amount of bytes read, and write the buffer to stdout. Easy to describe, 50 lines to write :P

I am thinking of doing a more diverse comparison, looking at different ways to do piping, sockets and writing files. I'm contemplating dropping Python (too much like C and Ruby), ASM (too verbose, I'm no expert) and OCaml stdlib (sane people use a stdlib replacement) from the languages covered, while adding Clean (uniqueness types), SML (streams) and Bash (getting things done.) Suggestions are welcome!

[edit #1] Added Python's with-block, thanks to Liam Clarke
[edit #2] Fixed the input-using OCaml functions, thanks to mfp.
[edit #3] Added SML examples per request.

6 comments:

Philip Taylor said...

I think Perl is interestingly different, since it has several features and bits of syntax specifically for IO.

Opening files is boring: `open my $fh, 'my_file';`

Reading a fixed amount is boring: `read $fh, my $buf, 4096;`

Explicit closing is usually unnecessary: all values are reference-counted and so the file can be closed immediately when $fh goes out of scope. But `close $fh;` works too.

To read the whole file: `open my $fh, 'my_file'; my $data = do { local $/; <$fh> };`. $/ is the input line terminator, and 'local' sets it to undef for the current dynamic scope; then <$fh> reads a 'line' from $fh, but since there's no line terminator it reads the entire file. Or `use File::Slurp; my $data = read_file('my_file');`

To iterate over lines: `open my $fh, 'my_file'; while (<$fh>) { chomp; print scalar reverse; print "\n"; }`. The line is read into $_, which is implicitly used by 'chomp' and 'reverse', though you can tell all these things to use a different variable if you want. There's some special DWIM magic so that the 'while' loop doesn't terminate too early if the last line of the file is a "0" (not followed by a "\n"), which would usually be considered false and would exit the loop.

But in the TIMTOWTDI spirit, you can also treat $fh as an IO::Handle object, and call `$fh->read(...)` and `$fh->getline` and so on, which is less peculiar.

Ilmari Heikkinen said...

Thank you for the Perl examples, <$fh> certainly is interesting. The implicit argument to the procedures also tickles my mind, it reminds me of a stack language of some kind.

Anonymous said...

Python as of 2.6 has a with keyword which function like with-open-file:

    
with file("foo", "r") as f:
    for line in f:
        #do stuff

Ilmari Heikkinen said...

Thanks for the heads-up, added the Python version next to the Ruby version.

Anonymous said...

Please do add SML! :)

Ilmari Heikkinen said...

Added!

Blog Archive