NeverSawUs

June Recap

June was a busy month, if not a particularly vocal one — I apologize for the radio silence. I started my position at Walmart Labs as a node core contributor as of the 5th of the month, and spent most of June heads-down familiarizing myself with the current state of the C/C++ side of the Node project: running through open issues on the repo trying to find good "starter" tickets, building small projects with the "building blocks" of node — libuv, v8, and gyp — and immersing myself in C++ development by building a small JS tokenizer + parser. I've also been brushing up on syscalls, in particular how mmap is used in building JITs. It was an educational, if a bit scatterbrained, month.

Running through the issues was initially pretty difficult; there's a lot to run through. Luckily, TJ Fontaine has been a great resource: he explained how the issues are triaged into Github milestones based on the oldest version they affect. That is to say, for the most part, issues open against the v0.10 milestone are also part of the v0.11 milestone. Issues are tagged with the subsystem that they affect. Node core contributors "own" subsystems -- for example, Fedor Indutny owns the tls and crypto subsystems. That means that all pull requests and issues related to a subsystem go by its owner. One of the first decisions TJ presented me with on starting was which subsystem I was interested in taking over. I initially deferred, hoping to tackle bugs and expand my knowledge of the codebase; but I'm happy to say that as of Nodeconf I've decided which subsystem I'm going to work towards ownership of: streams.

After Nodeconf (which was amazing, and I'll write up my feels on it more entirely in a later post), I had the opportunity to visit the Joyent offices, where TJ gave me a rundown of the current state of streams. We spent a few hours talking with Dave Pacheco and Josh Clulow about how they've been using streams at Joyent, their pain points, and possible solutions. Dave and Josh are big users of objectMode streams, a use case that's near and dear to my heart since, at Urban Airship, I used objectMode streams heavily on the frontend — and many of my parsers, git tools, and other sundry packages are object-mode streams. There's a weird sort of pleasure in hearing about problems with something that you're going to be working on — an excitement for the chance to make things better, and I have definitely been riding that particular high since the visit.

Here are the things I'm considering for streams moving forward. It should be noted that these things are not guaranteed to happen, but represent my current thinking about the subsystem and where it should go. If you see something that bothers you or otherwise elicits deep feelings, please don't hesitate to let me know. (Thanks to Josh, TJ, and Dave — many of the ideas listed below are theirs!)

Re-evaluating objectMode default high watermark

There's really no way to know for sure what size individual elements of an object mode stream represent, so buffering them intelligently is difficult. Potentially, the default could be as low as 0 — which would have the effect of every outer read triggering an internal _read. Alternatively (and perhaps more grandiosely), it might be desirable to expose an implementable API for userland streams to communicate byte size to the stream machinery — though that might be too much lemon squeezing for too little delicious lemon juice.

Represent EOF as a sentinel object instead of null

This would have to be done carefully, over the course of a few versions -- probably through an additional option flag. The net effect would be that null and undefined would be usable in object mode streams — which, from my experience, would come in very handy for a number of circumstances. The API might eventually look something like so:

var stream = require('stream')
var Readable = stream.Readable
var concat = require('concat-stream')
var r = new Readable({objectMode: true, eofOnNull: false})

r._read = function(n) {
  r.push({gary: 'busey'})
  r.push(null)              // this doesn't end the stream anymore!
  r.push()
  r.push({ok: 1})
  r.push(stream.EOF)
}

r.pipe(concat(function(xs) {
  assert.deepEqual(xs, [
    {gary: busey},
    null,
    undefined,
    {ok: 1}
  ])
})

The option would likely only be set for a single stream at a time.

Better error reporting on runaway push's

There should be better error communication about highwatermark and push. Currently it is possible to push indefinitely over the watermark, ending up with vast amounts of memory reserved by one unassuming part of the pipeline. There should be messaging about this — it's almost certainly an error if it happens.

Documentation

Revisiting the documentation + splitting it up into "explanation", "tutorial", and "reference"; including better illustrations of the state transitions a stream goes through based on your interaction with it. In particular, I'd like to standardize a nomenclature for talking about streams and pipelines, especially with respect to their nature as time-agnostic functions.

Multi-topic streams

In the future, the ability for streams to communicate information about "filtered" data, non-fatal errors, or the provenance (thanks for the term, Josh!) of the outgoing data. This is more of an outstanding problem than a stated solution; I personally lean towards the concept of letting a source stream pipe a particular event topic to a destination stream — i.e., src.pipe(dst, {topic: 'filter'}), and all src.emit('filter', info) events would be forwarded on to the destination stream. This might be overkill, of course! If you have any ideas about how to solve the problem, please get in touch, I'd love to hear what you have to say.

Standardize on a "fail" method

Providing a stream.fail(Error), that causes the destruction of underlying resources in a standard order and in such a way as to leave them present for error handling / useful core dump information generation would be ideal. For instance, in the current state of the world, errors can happen on Socket instances, but by the time that your code sees the error, valuable debugging information may have been lost because the underlying file descriptor has been closed.

Solving this might involve adding a new implementable API (_cleanup, or _fail) that would allow for user cleanup of underlying resources. I'm pretty foggy on this one thus far, and would love your thoughts on a good direction.

Expose accept / close on pending socket connections

For my first feature, I'm going to move the decision whether to accept or close a pending socket connection into JavaScript. The process will turn net.Server into an object mode stream of streams. This should give userland servers greater flexibility when it comes to handling lots of incoming requests.


There's a lot going on right now — I'm excited to work on Node's artisanal, shade-grown flow control system. It affords me a lot of opportunity to interact with the other subsystems, and to be an evangelist for streams in the package ecosystem.