This is a method I use for that kind of effect (if I understand you correctly). Just beware that I'm not a particularly experienced programmer, so there might be better ways of doing this.
Every frame you can check if the finger is past a certain z-value (depth) and if it is, you add the time that has passed since the last frame to a variable. Then check if the time variable exceeds your desired waiting time.
Here's some basic code (I'm writing using Java, but it shouldn't be that different):
if (finger.getZ() > threshold) //Check if finger is past the threshold
timeTapped += timeSinceLastUpdate; //Add time to the time variable
timeTapped = 0; //If the finger isn't past the threshold, reset the time
if (timeTapped > TAP_DURATION) //If the finger has been held for the right amount of time, do something.
I hope you understand what I mean and that it can get you started.