One initial thing to understand is that the probability mass/density functions that you get taught in connection with standard probability distributions (binomial, Normal, etc) are functions of the data values: you put in a data value and the function outputs a probability (density), for some fixed parameter values.
At first glance likelihood functions might look the same, but you have to think of them as functions of the parameters; it's the data that's fixed now (it's whatever you observed in your experiment). Once that's clear, the calculus starts to makes sense -- using the derivative of the likelihood function w.r.t. the parameters to find points in parameter space that are local maxima (or directions that are uphill in parameter space etc).
So given a model with unknown parameters, the data set you observe gives rise to a particular likelihood function, in other words the data set gives rise to a surface over your parameter space that you can explore for maxima. Regions of parameter space where your model gives a high probability to your observed data are considered to be regions of parameter space that your data suggests might describe how reality actually is. Of course, that's not taking into account your prior beliefs about which regions of parameter space are plausible, or whether the model was a good choice in the first place, or whether you've got enough data, etc.
An important point here is that the integral of the likelihood function over different parameter values is not constrained to be 1. This is why a likelihood is not a probability or a probability density, but its own thing. The confusing bit is that the likelihood formula is exactly the same as the formula of the original probability density function...
Splitting hairs probably but personally I'd say it the other way around: a likelihood is not a probability or a probability density, so there's no reason to think that it would integrate to 1.
The reason it's not a probability or probability density is that it's not defined to be one (in fact its definition involves a potentially different probability density for each point in parameter space).
But I think I know what you're saying -- people need to understand that it's not a probability density in order to avoid making naive probabilistic statements about parameter estimates or confidence regions when their calculations haven't used a prior over the parameters.
At first glance likelihood functions might look the same, but you have to think of them as functions of the parameters; it's the data that's fixed now (it's whatever you observed in your experiment). Once that's clear, the calculus starts to makes sense -- using the derivative of the likelihood function w.r.t. the parameters to find points in parameter space that are local maxima (or directions that are uphill in parameter space etc).
So given a model with unknown parameters, the data set you observe gives rise to a particular likelihood function, in other words the data set gives rise to a surface over your parameter space that you can explore for maxima. Regions of parameter space where your model gives a high probability to your observed data are considered to be regions of parameter space that your data suggests might describe how reality actually is. Of course, that's not taking into account your prior beliefs about which regions of parameter space are plausible, or whether the model was a good choice in the first place, or whether you've got enough data, etc.