Scrum proposes a set of stuff which is tightly coupled. Some people believe that you have to do everything correctly to actually benefit. I reject that thinking and consider the parts of Scrum as tools that can be useful individually. You shall introducing Scrum to a team piece-wise anyways. Unfortunately, I find it hard to find some documentation on how to pick a single technique from Scrum and understand it. Why is it useful for which problems? What is necessary? What is counter-productive? In this article I do that for "story points" without pulling in other Scrum techniques.
The actual goal we have is to predict how much value we can produce in what time. One of the basic ideas is statistical control, which comes from Walter A. Shewhart. The gist is that you use statistics to establish upper and lower boundaries of your output. If a measurement is outside those boundaries we assume a special cause (outlier) for the variation and we handle the case individually. If you manage to eliminate all special causes for a while the system is considered "stable". Achieving this you can make predictions. You can also try to improve the boundaries since you have clear metrics to guide you. To affect the variations within the boundaries we have to change the underlying system instead of looking at individual things.
Can we apply this method to software development directly? One counter argument would be that work items (tickets, stories, bugs, features) differ a lot. This is not industrial manufacturing pumping out thousands of identical pieces as day. Instead a software engineer might solve between one and ten work items per week. So we just end up with a variation so large that it is useless. How could we make work items comparable? We can assign a weight such that if you divide by that weight you normalize the values. We could call that weights "story points".
This resolves the discussion if Scrum Story Points are "effort" or "complexity" or "time". Make them whatever you want to control for. Usually it will be something corresponding to "man hours".
It also explains why you should not care how much 1 story point is. It does not matter to which base you normalize. The only objective is to make stories comparable. If you want to compare two sprints where the team estimated or delivered a different amount of story points, you just normalize again. There is no inherent meaning in an amount of story points. They only serve to compare. A story with 20 story points is twice as much as a story with 10 story points. That relation is the same if the stories are 4 and 2 story points.
We also see shortcomings in Scrum compared to Shewhart's approach. Something like a burndown chart just assumes that the system is stable and that all points are within the boundaries. Scrum does not care about the distinction between common and special causes. Scrum does not care about the feedback loop: Estimated story points are a prediction. You should review predictions in hindsight to calibrate for future predictions. Scrum probably works in small teams because this learning and calibration happens implicitly. No wonder it fails for larger projects where such learning must be formalized, otherwise the knowledge takes to long to reach everybody.