I think tearing down walls between SysOps and Developers is important. I don't like the "this is mine, stay out" attitude. That being said, I'm all in favor of dependable and repeatable environments, and when you give people carte blanche to do what they want on the boxes, that can be a recipe for disaster.
"Oh, I just upgraded this package, yeah, can you put that in our master image?" sigh
Do it right, and you can get the developers all the info they need without having access to the box itself. Not only that, but you make it easy to spool up new dev environments for new employees.
If access does need to be granted today, then thought should be put in to see if that can be avoided tomorrow. Need a tcpdump today? Great. Access granted. Tomorrow, I'll have a script for you that takes a tcpdump and puts the file where you can access it. Access revoked.
Dependable and repeatable environments are essential.
At my company the issue is that dev/qa/stage/prod are all not ===. We end up with two options to fix something. Make a fix and send it through the deployment process (over and over again). Or sit in a sysadmins lap and try to fix it. Both options are extremely frustrating, for both parties, and generate a lot of animosity.
It's not easy, is it? Imagine being able to spool up a new environment (on one server as opposed to, say, 5 in production) and feed it a copy of production data, run tests, and then tear it down. And imagine knowing that this environment was identical to production in every way except for minor configuration changes to get it to run on one box. And when you're done with it? Throw it away. That's really powerful. A new employee comes on board? Here, here's your environment. Mess with it as much as you want, we can just rebuild it if you totally hose it. All QA except for performance testing, which would need to be done on real hardware anyway, could be done in these environments.
I worked at a company where we had a similar system set up. We had automated database snapshots of production taken every two hours, and stored locally and in our account at Rackspace's Cloud Files. Eventually, I got fed up with people asking me to update staging with a copy of the production database, so I wrote a simple Python script and stuck it in the repository. As long as you weren't running in a production environment, it would give you a fresh copy of the database in a matter of minutes.
We were (slowly) working towards a system where the entire app (gems and all) was self-contained, and we had a list of all the system packages that needed installation; once we reached that point, setting up a new server would have been a five-minute job - a git pull, a package update/install, and a database pull. Sadly, we never did get around to finishing it (actual work took priority), but it would have been nice to have.
If developers want access to production, they should respond to the call from the helpdesk when production goes down in the middle of the night because of the "simple little tweak" the developer decided to make on the server before leaving for the day.
I've worked at a few small startup-y type companies, and in cases where some of the developers had access to production systems, or were deeply entrenched in deployment details, those people also got paged by Pingdom whenever we were down, same as I did.
This made it incredibly convenient, because I'd wake up, grab my laptop, and log in, and then I'd IM the developer and tell him the 'what' of what was going on, and help him/her figure out the 'why'. We'd fix the problem and do another deploy, then go back to bed.
At several bigcorps where I've worked (>1000 in IT, billions of dollars, lots of regulation) this has been a very big deal and cause of never-ending friction.
1) If the admins don't do at least minimal code reviews then it's meaningless. Developers will find ways to circumvent the inefficiencies in the system.
2) Admins don't do even minimal code reviews.
3) If developers don't have read access to production (code and logs) then admins become the weak link. (Obviously ssl keys and such not included.)
4) If a system is so locked-down as to completely eliminate the possibility of anything going wrong it will also be so locked-down that nothing can be fixed or improved.
This is silly. The answer here us, "it depends". For example, if you have sensitive data in a production environment, maybe you don't want developers to have access to the environment regardless of the size of the organization. For smaller, less mature apps, having a full blown IT managed server might be overkill and get in the way. In my previous life as a consultant at Kaiser, we had both situations which was entirely appropriate.
This is what I'm experiencing. As in application team, I'm supporting a reporting app whose users are CO or D. I have zero access to the production app, to the server, and to the data. Reason given was because of the data sensitivity.
But realistically, the test data comes from production data, so what is the conclusion there?
I'm having hard time to ensure if the data is correctly processed every month, since I have no access at all.
Note: the app was done by other people and was handed over to my team.
There are commercial solutions that can pull data from Oracle and massage it for full testing capabilities.. (As in - replace this users name with a generated name, replace this address with a generated address, replace this CC with a generated/valid-format cc).
That might be an option if you can't get at any decent test data?
I've worked for several "mid-sized" web companies (between 75-200 employees) and this has always been an issue. Here's why:
1) The development team interacts directly with a single business unit but the administrators serve the whole company. So say you're a developer that works in the SEM team. All the technical products and processes that manage the company's SEM campaigns, that's what you work on. You directly communicate with the product team and project management. If there's any bottleneck or issue on your projects as you work on them, the entire team is easily made aware of it. You tell your project manager, "Dan in analytics was supposed to get me the data and if I don't get it tomorrow then I'll be late," and your project manager talks to Dan or Dan's boss and you get your data... or you don't, but it's on Dan anyway.
However the administrators serve the whole company and they have entirely different sets of priorities and responsibilities that you have no visibility into. So you're ready to go to production and you submit your ticket with your release notes, which is pretty much just, "push these files from SVN and run this SQL on this DB." And it just sits there. A day passes. Then another day. Dan asks why your project's not available yet given that it's "done." You ask your project manager to ask what's going on, he says he's trying to find out but the sys admin manager hasn't been around. You try dropping by the admins yourself (usually sequestered in some remote location in your building, if they're even on site at all!) and ask about your project and they snarl and say, "load on the consumer site has been up 12% all week, we have bigger problems" and mumble a bunch of other things about permissions and server racks and subnets and all you know is that whatever to-do list they're working off of, you're all the way at the bottom.
And then...
2) The production and development environment just aren't in synch and this never gets addressed. The sys admins finally get to your ticket a week later. Finally, you think, this will go live. Then two hours later you get an e-mail from an admin named Stan that says, "Release failed, please fix the permissions on the directories your application creates and re-submit the ticket." And your ticket's closed. What the hell? Your application didn't even create any directories.
If you're lucky Stan included a copy-paste of his shell with the commands he used to export your code and whatever barfing error he got. If not you have to hunt Stan down, ask him to see what exactly the error he got was. Stan sighs, because they're just sooo swamped and sooo busy and the site load has been up for 12% since last week, but he grudgingly does what you say. Oh, yes, your application uses a directory which is owned by 'application1' in dev, but is owned by root in production. So you tell him to just chown the directory to application1, and he says submit a ticket. You blow up and say, "You're right there! Just type in the freaking chown command!" and he says he can't, you have to submit a ticket, and then submit another ticket for the release again. Then one of the other admins says, "Stan, foosball?" and Stan gets up to play some goddamn fucking foosball, while your project is now going on its second week of being late.
The next time you meet with your boss and mention how much things easier would be if you could have access to production, and your boss sighs and says it's just not happening. So then you talk to him about how you need better integration with the sys admin team, and it's critical to ensure dev and production environments are identical, and your boss agrees to talk to Stan's boss, and ultimately nothing gets done and you just resent the lack of control over your own projects.
1) People trying to bump the priority of their own tickets by showing up in my cube. This is inevitably at the expense of other, more polite, users of the system. Prioritizing requests is one of the toughest parts of operations job - if you think I'm doing it all wrong, you can talk to my manager.
2) People who think that "open a ticket" means "go to hell". It actually means "We need to document all changes done to production and we need to prioritize requests. Please help us do our job by using the system designed to do so. The request will be done within 5 minutes if its urgent/important/really simple. But we still need a ticket to document it and follow up if it causes issues in the future."
On the other hand, if you boss is competent, he can raise the availability and priorities of admins in all kinds of management meetings. Hopefully resulting in better priority for developer requests and maybe even hiring few more admins.
I don't work in such an environment, so what is the proper response to a ticket that took a week to process and is closed with no useful detail? Open another ticket to ask for more detail on why the previous ticket was closed?
Make friends with some sysadmins. When the gears aren't spinning, ping them on IM to find out why. Usually there's a very good reason for that.
Sysadmins are your friends if you treat them well, like human beings, and you respect their work. If you treat them like a ticket processing machine, or worse, like an obstacle, you'll struggle.
I learned this from 4 years of deploying apps at UBS (one of the world's largest banks), which had exactly this kind of ticketing system (called GCMS, iirc). I was IRC-pals with 3 different sysadmins in 3 different continents (yes, there was an internal IRC system), and so whenever things got stuck in the pipes, I could ask them to have a look in the system and see what was going on. As for the other sysadmins, I always treated them courteously, used the ticket system, and got the necessary approvals whenever I could.
The one downside of this is I ended up being stuck managing lots of deployments, because I was good at it.
I think this goes both ways. I understand that sysadmins are human and get busy and/or make mistakes. Most of the time though, nagging developers could be pleased by sysadmins being a bit more proactive in communicating with people.
After all, when a ticket has been open for a week with no response, most people will start to get a bit frustrated and take it out on the sysadmin. A simple "Hey, know you've been waiting on this a while, but I have X, Y, and Z to take care of before I can get to it" will do wonders for sysadmin/developer relations.
It goes a bit deeper than that - a ticketing system by itself is just a tool, and isn't a process or a solution.
If Sysadmin just threw up a ticketing system and said "put your stuff in there and we'll get to it" - then they can't expect things to get much better than email. It's a start, but only a small one.
They need to put the proper process, SLA's (even if they are approximate) and review procedures around the system to make sure it's meeting the needs of the rest of they organisation.
At my last company we had exactly these problems with a central IT service. In the end we got so frustrated with the poor service that we moved to using a 3rd party for hosting and admin. We never looked back. We didn't have a ticketing system, we spoke to them directly in irc. They also took the attitude that any request that might be repeated was automated so that we would either not have to bother in the future or we could easily do it ourselves. Generally this automation was put in place at the time of the request. It should be noted that we did have full access to the production environments and that we did our own QA. We found that by using a strict process we were able to keep quality very high; in the two years that we had this setup I can't recall a single occasion where we experienced a service failure. Of course there were bugs etc. but we were extremely happy with the low incidence rate of these.
On another note, I'm now in one of those startups without any sysadmins. We use heroku with a handful of addons. This setup has quite frankly blown me away. You can have the best of everything, monitoring, backups, cron jobs, error tracking, memcached servers etc. with no admin required beyond switching the services on. Having used such a setup I see the requirement for sysadmins being far diminished.
Your post is both depressing and familiar. I think the reasonable long term solution is that each dev team have one admin as well. That way you have "your guy" who knows what's going on with the code being developed/deployed.
I'm not sure if there would be enough work for a full time admin / team so maybe one admin could be shared among two or more dev teams.
That's how it starts. The scenario goes something like this: Central IT is not responding in a "timely" manner. Bosses meet, but no common ground is reached. <INSERT BAD EVENT HERE> Developers complain to Boss that they are not to blame. Boss goes to his Division Boss. Higher level boss disagreement. Division Boss gets his own admin and servers. Central IT is no longer the only IT as Divisions now have IT. < TIME PASSES > New plan to centralize all IT. < CYCLE REPEATS >
One of the problems with mid-sized companies is that they're big enough to need to have separation of responsibilities and processes but not quite big enough where they have the time, resources or manpower to put processes and the tools that enforce those processes in place. And that's if they at least recognize the issue in the first place.
By the way, the answer in the situation you've detailed is to include the administrators within the project and assign the installation task to them within the project plan. Even if it just seems like a "bookkeeping" change (and it is), it serves the purpose of highlighting who is responsible.
Also, if an install is critical enough, don't be afraid to escalate the issue as high as it needs to be up the management chain. If the sysadmin manager isn't available/responsive, then go to their manager (you might need to go up in parallel on your own management chain first). If you're hunting down the sysadmins on your own, it basically means that management isn't doing their jobs and/or the project really isn't a priority to them.
All of these issues could be fixed with better process, better people (that grumbling about 12% extra load sounds like bullshit, its not like they are serving the page themselves) and better alignment of the dev/ops team.
These problems are not intractable. What you've described is a giant management failure.
never ever seen a company I've worked for admit a problem is because they've employed a bunch of twats; problems are always due to insufficient number of processes. Solution? More processes
I'm not sure how accurate or exagerrated the OP's examples are, but here is how I'd fix a number of them:
they snarl and say, "load on the consumer site has been up 12% all week, we have bigger problems" and mumble a bunch of other things about permissions and server racks and subnets and all you know is that whatever to-do list they're working off of, you're all the way at the bottom.
Have a regularly scheduled release train so everyone in the company knows when new releases can go out (i.e. every Friday night, etc.) and can plan ahead accordingly. If a release misses the train, then it can wait for the next one to leave the station. This allows all groups to plan ahead, make sure they have resources available, etc.
The production and development environment just aren't in synch and this never gets addressed.
This is pretty inexcusable, it's a matter of laziness or unwillingness to spend the time to make the environment better. You can clearly see the effects of this bad practice when you have releases that fail in production and have to be re-tried several times. If the production environment is so complex that it can't easily be mirrored in development, at the very least the release should be tested in a staging environment (not QA) which does mirror production - so you can test the release process itself.
The fundamental problem that nhashem seems to be describing are admins that don't seem to care too much about whether or not new software releases are being released in a timely fashion to production. While I understand that the admins have a whole lot of other areas to also be responsible for, the entire reason why someone wants to release this software in the first place is that there is some business value in the release/new version/feature etc. Not allowing this release to reach customers as fast as possible reduces it's value. If the company is not focused on getting value in front of the customer as quickly as possible, then why are they even spending the time developing the software/changes in the first place?
This is why it's a management failure - a failure to plan, get teams to work together, and cut things down to the bottom line of delivering value to the customer as rapidly as possible while still maintaining stability and other responsibilities. Company's that can't do these things will have their lunch eaten by competitors who can.
Developers can touch the code but not the data, production/operations can touch the data but not the code. That's the oldest quality control rule in IT to prevent fraud.
According to the regulations that govern my employer's domain, developers cannot have access to production. We have a sysOp representative who knows the software (but isn't a dev) who does have access to troubleshoot issues. In cases of serious bugs, we will work directly with the system administrator to troubleshoot.
The point regarding good relationships between engineering and systems is critical. I have a great relationship with our admin, and therefore, when problems arise, we can quickly work together to sort things out.
If your company is obligated to follow regulations like those set forth in something like PCI standards, then devs will certainly not have access to production.
Otherwise, good practice could be to have everything automated in the production env...whatever you're doing on production, make it as scripted/automated as possible. That way, you're forced to examine your processes ahead of time.
I've been at this startup long enough that it's no longer a startup. Originally, the 'developers' were expected to handle everything IT-wise. I had minimal skills in some areas, but it was enough to get the job done and keep the company working.
Now that we're bigger, I keep wishing that we'd divide responsibilities and have SysAdmins that are responsible for servers and their maintenance, and programmers that are responsible for creating code. I've seen so many projects get off-track because programmers are asked to do sysadmin work and end up being late on their official work. And of course, everyone gets pressured for being late then.
Startups need to be nimble and just get things done. That means that many jobs need to get done with only a few people. But once you're no longer a startup, you need to start treating everything like a big company.
I love working for a startup, but working for a company that tries to act like a startup is painful.
Yes. With a deployment tool that provides an audit trail, an easily searchable log with description of changes, locking, change reversion (not revision), and optional change management and code review. They should get read-only access to the webserver configuration and logs, but no actual write access in production. Just enough access that they can fuck up code in production, but enough checks and balances to make un-fucking the code easy for Operations. Obvious practices such as 'never deploy code Friday afternoon' and 'financial code requires change management approval' also help. Limited restarting of services allowed with big fat e-mail warnings to Operations and dev groups, with audit trail.
"Oh, I just upgraded this package, yeah, can you put that in our master image?" sigh
Do it right, and you can get the developers all the info they need without having access to the box itself. Not only that, but you make it easy to spool up new dev environments for new employees.
If access does need to be granted today, then thought should be put in to see if that can be avoided tomorrow. Need a tcpdump today? Great. Access granted. Tomorrow, I'll have a script for you that takes a tcpdump and puts the file where you can access it. Access revoked.