moocow: A Portrait of the Artist as a Young Robot

Daniel Oore (IICSI, MUN), “moocow: A Portrait of the Artist as a Young Robot,” (original musical composition using machine learning AI), NeurIPS Workshop on Machine Learning for Creativity and Design 2020

2021 Jason d’Eon (Dalhousie, Vector), Sri Harsha Dumpala (Dalhousie, Vector), Chandramouli Shama Sastry (Dalhousie, Vector), Daniel Oore (IICSI, MUN), Sageev Oore (Dalhousie, Vector). “Musical Speech: A Transformer-based Composition Tool.” Proceedings of Machine Learning Research (PMLR), August, 2021.

MUSIC VIDEO examples:

All raw MIDI files used (to trigger instrument/synth sounds in this piece) were generated by a machine learning system, based on the transformer architecture, that was fed the recording of the text reading (also included throughout the piece).



moocow: A Portrait of the Artist as a Young Robot
If a robot contributed a portrait of himself as a young man to an online gallery of AI art, what aspects of the robot’s sonic künstlerroman might elicit his pride or shame? Would it be the (lack of) emotional and spiritual impact of his statement? Or the (lack of) seamless integration of his inherited knowledge, demonstrated across both his statement’s small fragmentary scale and large arcing scale?





Using this new transformer-based music composition tool to generate the building block materials for “moocow” and other pieces of music, I find this tool encourages creative interaction and cultivates personal aesthetic voice.

The compositions and demos I have so far created all use MIDI generated exclusively by this transformer-based tool. As a form of data, MIDI can be translated and rendered into sound (or light and other mediums) in different ways. The MIDI data is like a Lego block from which I can build something. Like the sculptor who liberates a figure from within the marble, it is fun to figure out what kind of thing a given transformer-generated MIDI output wants to be built into. The MIDI generated by this transformer system offers a satisfying quality of marble to work with, inspiring me to liberate figures in the blocks of MIDI and help them to flow musically.

The use of the original audio speech recording in the composition, might be analogized as collaging photographs of a human model onto into the sculpture of said model. Incorporating the speech audio into the musical composition helps to ground the listener with an immediately recognizable human element. In one music-video piece, entitled “Singularity,” I gradually fade out the original audio of human speech to evoke the sense that the human speaker is being overtaken by the AI–generated MIDI (the person being overtaken in this video is one of the AI designers concluding a presentation of the system):
In the “moocow” composition (linked at the top of this webpage and also below), my first time working with the transformer based tool, I also added a choir of myself singing to help tie things together with a strong human element. In subsequent music-video pieces I explore other (e.g. sonic, theatric) ways to unite the MIDI ‘bricks’ into something that feels both by and for humans.

A significant part of my process in using the transformer tool has been what in music is termed “orchestration”: choosing which instruments play when (which digital instruments to trigger with the transformer tool-generated MIDI files). The same raw MIDI clip can sound radically different when played by two different digital instruments. This degree of difference can be caused not only by the difference in timbre but also by each instrument having distinct sonic “envelopes,” different rates and degrees of changes (e.g. in amplitude, timbre, pitch). (I.e. the onset and duration times of the initial attack, the sustain, and the decay of amplitude, timbre, pitch, etc., for every given MIDI-triggered note.) Reverb and delay effects can also effectively extend the sustain and decay times and resultantly obscure the attack onsets of the subsequent MIDI notes. Other delay and echo effects can also generate new notes/attacks not present in the original MIDI file. So, the same raw MIDI file triggering two instruments can sound different in many ways. In some moments I use the raw MIDI files to trigger drum sample fragments (where the note information may be arbitrarily assigned to a particular sound) and at other moments I use an “arpeggiator” to elaborate on the pitches generated by the transformer tool. In addition to figuring out the right instrumentation and envelope shapes, I also shift or keep the octave register of the original MIDI file, and sculpt further shifts in volume levels and timbral (e.g. EQ) filters across larger frames of time (than the envelope of a given MIDI note). I also choose how to arrange the fragments of the different MIDI clips together, blending and contrasting their sounds, perhaps even inverting or reversing the MIDI data, developing the tension-release of the different layered events to make a musical statement that reflects the interaction —and perhaps resonance— of the bodies, technologies, and materials involved.

Like clay or marble to a sculptor, the raw MIDI files generated by the system are a starting point for the composer. While there’s no creative limit on what you can do, the generated material also has latent desires and tendencies, that can be harnessed with more or less success in creating a musical flow of tension and release. The use of my own speech as the input for the transformer creates another level of interactivity between my initial spontaneous creation, the AI generated MIDI, and the creative solutions this invites and inspires me to generate. The interactive process has no imposed ending, as I can then take the audio of my composition and feed it back into the transformer system to generate a further raw MIDI output, which I can then continue to use in the same or new composition, and continue the interactive process!



Below, is a screenshot of the Ableton Live software window, representing all the raw MIDI and audio files respectively, as they are sequenced (left to right) in “tracks” (horizontal rows each containing digital software instruments to play MIDI files and audio channels to play audio files) from the beginning to end of the “moocow” audio piece:

Track numbers are indicated in small yellowish boxes along right-hand margin, e.g.:

To the left of yellow track number boxes, are labels with the track name (in yellow, blue and magenta), e.g.:

Each track contains fragments of audio or MIDI files…

From top to bottom of the full screen screenshot:

  • Track 1 (yellow) named “TEXT READING AUDIO” contains (repeating fragments of) the original audio file (yellow and labeled “OnceUponATime”) of the James Joyce text read by Sageev Oore (i.e. “Once upon a time and a very good time it was there was a moocow coming down along the road”), e.g.:


  • Tracks 2-11 (blue) are MIDI (synthesized or sampled) instruments, each of which is triggered by the raw MIDI files (and all raw MIDI files are generated by the machine learning system from the original “OnceUponATime” audio file seen in track 1, above), e.g.:

      • A label on each MIDI track indicates the types of (synthesized/sampled) instrument being triggered by the raw MIDI files (e.g. TRUMPET, FLUTE, ORGAN, TREMOLO [strings], etc.) and in some cases, also indicating whether the MIDI file is transposed by one or two octaves (indicated in the track name with “+12 st” [st = semitone] or “+24 st”, respectively).

All together, the different triggered MIDI instruments (without the playback of the text reading and choir singing audio files), sound like this, e.g.:

      • The MIDI files (green– and blue-tabbed files containing tiny black rectangles of different heights and lengths representing pitch and duration information respectively, and which trigger the MIDI instruments in the tracks that they populate) are each labeled along their green and blue tabs according to the manner in which the given raw MIDI file was generated, e.g.:
      • i.e.:
          • A MIDI file labeled “formant-extraction” (green), triggers a TRUMPET (track 8), e.g.:

      • MIDI files labeled “gap-fill-unconditioned” (blue), trigger an ORGAN (track 7) and electronic drum BEATS (track 6)

      • MIDI files labeled “gap-fill” (green), trigger a TREMOLO cello ensemble (track 9)

      • MIDI files labeled “overwrite-formant” (blue), trigger a MUTED TPT [trumpet] (track 3) and a FLUTE (track 4), both altered with an echo-delay effect

      • MIDI files labeled  “overwrite-gap-fill-unconditioned” (green), trigger a CELLO (track 10) and BASS (track 11)

      • MIDI files labeled “overwrite-gap-fill” (green), trigger a moon guitar [aka yueqin] (labeled “MOONTAR“)

  • Tracks 12-14 (magenta) named “MOO CHOIR (17 voices)” and “MOOCOW CHOIR (9 voices)” contain: audio files of the choirs composed and sung by Daniel Oore (singing “Moocow a commin'” and “Moo, moo, moo… [etc.]”, respectively), e.g.:


  • Red lines running vertically and diagonally along each track are (in all but the top two tracks) volume envelopes (high/low volume), e.g. (image below):
  • Transparent boxes (labeled in the middle of the outline with “Clip Deactivated”) in some tracks, indicate audio or MIDI files that were explored but then deactivated for musical/compositional reasons, e.g.: