How to perform note tracking in librosa

In music theory, notes represent a musical sound that can represent a sound’s pitchPitch refers to the perceived highness or lowness of a sound and duration. A note essentially represents a pitch classA pitch class is a concept in music theory that groups together all musical notes with the same name i.e C3 and C4 belong to the same pitch class that is C. Librosa is a Python library used to perform audio and musical analysis. It provides developers with various functions to perform various tasks on audio-based data, making it a useful tool for audio preprocessing and musical analysis.

In this equation:

$STFT(t, f)$ represents the short-time Fourier transform at time $t$ and frequency $f$ .
$x(τ)$ is the input signal.
$w(t - τ)$ is the window function applied to the signal to create a short segment centered at time $t$ .
$e^{-j2πfτ}$ represents the complex exponential used to analyze the frequency content at frequency $f$ .
$∫$ denotes the integral, indicating that the STFT is computed as the integral of the product of the signal, the window function, and the complex exponential over a small time window.

Once we have the chroma values, we can simply use the frames we obtain from the onset_detect() function to obtain the maximum chroma value at these frames, and we can use the frames_to_time() function to calculate the duration of each note. This way, we can have each note’s pitch intensity and time duration.

Code

The following code tracks the notes in a trumpet audio clip:

import librosa
# Loading the audio file 
audio_file = '../trumpet.ogg'
y, sr = librosa.load(audio_file)
# Extracting the chroma features and onsets 
chroma = librosa.feature.chroma_stft(y=y, sr=sr)
onset_frames = librosa.onset.onset_detect(y=y, sr=sr)
first = True
notes = []
for onset in onset_frames:
  chroma_at_onset = chroma[:, onset]
  note_pitch = chroma_at_onset.argmax()
  # For all other notes
  if not first:
      note_duration = librosa.frames_to_time(onset, sr=sr)
      notes.append((note_pitch,onset, note_duration - prev_note_duration))
      prev_note_duration = note_duration
  # For the first note
  else:
      prev_note_duration = librosa.frames_to_time(onset, sr=sr)
      first = False
print("Note pitch \t Onset frame \t Note duration")
for entry in notes:
  print(entry[0],'\t\t',entry[1],'\t\t',entry[2])

Lines 8–9: We extract the chroma features and onset frames from the audio file. The onset values return an array of frames where a new musical note starts.
Lines 13–15: Here, we go through the onset frames and pick the index of the maximum chroma value at that frame. The note_pitch value indicates the note with the maximum chroma value, meaning it will give us one of the 12 musical notes.
Lines 17–24: If this is the first note that we track, then we convert the frame to time using the librosa.frames_to_time() function and set the first flag as False. If it is not the first note, we take the most recent note we tracked and calculate the difference between the last and current notes’ duration.

Output

In the output, we get a table of all the notes in the audio. We can see that we have tracked different notes and can distinguish them. Moreover, the output also gives us the onset frame, which tells us at which frame the note starts exactly. Lastly, we also get the duration of the note.

Conclusion

In conclusion, note tracking using librosa is a powerful and versatile tool for analyzing and extracting valuable information from audio data. Throughout this task, we explored how librosa can be applied to various use cases, demonstrating its significance in various fields.

Librosa’s note-tracking capabilities offer versatile applications in music transcription, speech analysis, and environmental monitoring. It aids in music interpretation, speech recognition, and acoustic event detection. Librosa is a valuable tool with broad implications for multiple industries, driving innovation and enhancing data analysis and decision-making.

How to perform note tracking in librosa

How it works

Code

Output

Conclusion