Hacker News new | past | comments | ask | show | jobs | submit login
The GitHub GraphQL API (githubengineering.com)
284 points by samber on Sept 14, 2016 | hide | past | favorite | 66 comments



Neatly coinciding with GraphQL's announcement 'Leaving technical preview' [1].

In hindsight, sending a query, written in a query language from client to server seems obvious. So obvious, that I think I've seen it before...

  select login, bio, location, isBountyHunter
  from viewer
  where user = ?
It's ironic to me that it took Facebook reinventing SQL (or a graph database equivalent thereof [2][3][4]) and Github embracing it to legitimize this practice, since if you were doing this before, you were judged in the eyes of your peers and clients for not being "RESTful" (the fake-REST kind [5]), as if everyone was just itching to PUT and DELETE blobs of JSON of your poorly mapped resources to quasi-hardcoded, templated [6][7] URLs.

What's old is new again, but this time I'll take it.

[1] http://graphql.org/blog/production-ready/ [2] https://neo4j.com/developer/cypher-query-language/#_about_cy... [3] http://tinkerpop.apache.org/ [4] https://www.w3.org/TR/sparql11-query/ [5] https://news.ycombinator.com/item?id=12479370#12480408 [6] http://swagger.io/specification/#pathTemplating [7] http://raml.org/developers/raml-200-tutorial#parameters


> Neatly coinciding with GraphQL's announcement 'Leaving technical preview'

We coordinated with Facebook to do that. ;)


No, what's old is old again. FQL was a SQL-like syntax and it was totally busted. GraphQL isn't just a structured query language, it's a query language on a unified hierarchical datastructure, and clients can make use of this hierarchy and unification to issue queries local to data and batch them to optimize query size. This in turn lets you have both an effective cache and a minimal distance between query and output to UI.


Before GraphQL, the public Facebook APIs had FQL, which provided a SQL-like language that allowed doing server-side projections and joins:

    SELECT uid, name, pic_square
    FROM user
    WHERE uid = me() OR
      uid IN (SELECT uid2 FROM friend WHERE uid1 = me())


Also reminds me of CMIP and GDMO in old ITU network management standards in the mid 90s. They were RESTful, object-oriented, and you could make some pretty expressive queries with them. The standards failed probably because they were way ahead of their times: too many new concepts, too complicated compared to SNMP, and the documents were a very boring read.


Or CAML... hides


The main thing is making it secure and scalable. The old client-server infrastructure for database apps was not designed to be deployed over the Internet.


> secure and scalable

> over the Internet

What does this really mean? How does this substantially differ from the now-facetious term 'web scale' [1] ?

Modern RDBMS'es offer encrypted communications between client and server [2][3][4], but nothing is stopping a deployer from putting a proxy between the database and the Internet-resident end-user to translate between authentication mechanisms. In fact, this happens in every single API today, where the Basic or OAuth or Cookie-Auth request comes in the HTTP body, and gets made into a database query in a query language the DB understands -- SQL or something else.

GraphQL needs to you bring your own authn/z [5][6][7] much the same way that your run-of-the-mill HTTP API needs to you to bring your own authn/z, so I can't accept 'secure' as an innovation over SQL. "Scalable?" How?

I genuinely think this tech is neat but let's consider what it gives you. GraphQL packages together the query language known as GraphQL, the IDL/typespec language known as GraphQL, the server spec known as GraphQL which a GraphQL-compliant server must implement [8], inside which you have to hook up your actual datastore to the GraphQL server [9][10][11][12]. It's a useful tool to build a client-server system where you're going to make a backend query in the end, but I wonder if its full potential will be realized when there are GraphQL-capable datastores to skip that extra data munging layer you have to implement yourself; but if SQL were passed on the TLS-secured wire instead of this bespoke format that no DB yet understands, would we collectively freak out? What's the difference?

[1] https://www.quora.com/What-is-an-explanation-of-the-punch-li... [2] https://technet.microsoft.com/en-us/library/ms191192.aspx [3] https://docs.oracle.com/cd/E11882_01/network.112/e40393/asoj... [4] https://www.postgresql.org/docs/9.5/static/ssl-tcp.html [5] https://medium.com/the-graphqlhub/graphql-and-authentication... [6] https://github.com/mostr/graphql-auth [7] http://stackoverflow.com/questions/34952792/how-do-i-structu... [8] https://facebook.github.io/graphql/#sec-Execution [9] https://www.reindex.io/blog/building-a-graphql-server-with-n... [10] https://www.compose.com/articles/using-graphql-with-mongodb/ [11] https://medium.com/apollo-stack/tutorial-building-a-graphql-... [12] http://stackoverflow.com/questions/35940528/how-to-connect-g...


I think it's secure in the sense that you can easily add in your own authentication scheme that integrates with the schema, whereas this is basically impossible if you were accepting sql statements (short of parsing them yourself, which seems more difficult than parsing GraphQl)


In a traditional RDBMS you would let the database enforce schema level authn if that was a concern, so wouldn't have to parse anything. You'd just connect to the database as the appropriate user and send the query.


I really don't think this is a practical for most rdbms, I don't think database auth schemes are built to support millions of users scenarios. I'm also not sure what sort of support there is for row level authorization.


To me this is sort of like WebGL versus GL. WebGL is a security risk, but I'd still trust browser vendors to lock it down better than I trust the authors of video card device drivers who probably weren't even thinking about the possibility of hostile API calls.

Similarly, I'd expect Facebook to be thinking about Internet attacks due to real-world experience with hostile users all the time, versus a database company where the typical use case is logins from authorized employees and DBA's. They may try to get security right but it's not going to be top of mind and they have other concerns.


Based on how the query executor currently works, I wouldn't call it scalable. Well sure, it can scale, but that implies that it's more efficient when it isn't in many cases. Basically the more complex your GraphQL query gets the more steps the GraphQL server has to take to resolve it, and figuring out how to combine those operations into something resembling a single SQL query isn't an easy thing to do. It scales so well for Facebook because their implementation only reads from a cache so they don't really have to worry about it.

Aggregating various data points manually in a predefined way may be less flexible than GraphQL, but it's much easier to optimize when things get complex imo


except they removed the physical knowledge and just went with logical associations


SQL (at least the subset used for querying and modifying data) contains no physical knowledge either. In a SELECT-query a physical table and a logical view look and behave the same.

If you only give your users access to views instead of physical tables you get the same result as GraphQL, just with different syntax.


Special shout out to the open-source contributors and members of the community who helped us build this:

https://github.com/rmosolgo/graphql-ruby

https://github.com/shopify/graphql-batch

https://github.com/github/graphql-client

https://github.com/graphql

We <3 your work and are thrilled to have built this with you!

Please make sure to give us feedback during this alpha stage! https://platform.github.community/


Are you guys using rails w/ graphql-ruby? Would love to see a blog post with more details about the backend implementation!


The post links https://github.com/github/github-graphql-rails-example - "a small app built with Rails that demonstrates how you might interact with our GraphQL schema."


This just shows that they've made use of a graphql Rails client; the GP asked about the backend which could be in a different language.


I'm impressed, but for other reasons. For once, I have no idea how to properly implement this. I mean, it really looks like a lot of trouble mapping this from GraphQL to.. .SQL? And what if the system is using some kind of NoSQL database which does not really have a very verbose query language, if any? Complexity just seems to explode. Somehow I feel there is also a risk for the client to make a quite sub-optimal query. So, probably some kind of policy should be implemented. All in all, there is a level of management ability that looks lost to me with if GraphQL is implemented improperly, and to be honest, it looks like it is easy not to be. I'm really looking forward to some book or guide, since the implementation is puzzling to me.


Does anybody considered this problem at all? (Giving too much flexibility to client and allowing non-optimal queries like joining several big tables or data collections w/o proper index support.) It's so weird that all materials I saw about GraphQL hushed up this question which is essential for the future of this technology.

And it's so similar to ORM's issues all the industry experienced past 20 years. But perhaps more dangerous due to the public nature of many APIs.


Instead of thinking of it as parsing queries, think about it as nested RPC. It's quite reasonable for an implementor to set a time limit or call limit to keep algorithmic complexity attacks under control.


Ok, but nested loop in so many situations is totally losing to other algorithms to join data (like merge join, hash join) when you deal with large datasets, right? So again, it will be inefficient by default, for so many cases.


If you're talking about GraphQL, the implementation is undefined by default. It's just a query language spec. There isn't a "GraphQL Server" product. You can resolve data from many sources. It could be a no-sql database, sql-backed hadoop cluster, etc.. it's very much just the language the client talks to the server in.

If the client requests exactly what it needs, that shouldn't be more stressful on the server-side than spamming REST requests for all the same resources. Plus, it's easier to optimize when you know what the client wants. If there's something expensive, you could, for example, cache/index something extra. If the client were doing it themselves with a series of REST calls, you wouldn't be able to understand the real use-case. Even if you did know what aggregation they really needed, you wouldn't be able to fix the problem without updates to both the service and the clients.

Either way, it's easier to set sane limits than craft un-DOSable APIs. There is always a cost to satisfying queries. If you're trying to run a free service, it's a much bigger concern. If you're paying the bill, you're incentivized to investigate expensive/slow calls.


Disclaimer: I'm just talking from a REST developer perspective.

The nice thing of REST calls in the current form is that they are that - just calls. With proper monitoring you could just see which ones do you get more or less and with these or those parameters. They can be optimized as best possible, but separately. You are right, it needs more analytics to figure out a series of calls (based on some token?) and maybe bundle them up, introducing a new endpoint (thus not breaking old clients).

But yet again, that is that one "query". With GraphQL it could be anything, and that's what bugs me. I find it challenging, in a good way.

Another thing what I'm also not sure about are the queries themselves, or rather, the number of different ways you can write a query. Multiple users can request the same data, or almost same, with queries written in different ways. Backend developers should then guarantee that those queries will be executed in a similar way, with predictable performance. I guess in a similar way SQL query optimization does. I had the "joy" of working with a database that had hugely different performance just with trivial changes in the query (it was not relational, actually it is discontinued now, thankfully). It was a huge PITA. I wouldn't like to serve an API like that.


It is definitly a problem of GraphQL. Some implementations try to set a time limit for queries or they try to estimate the complexity of the query. Both are quite hard to do reliably.

GraphQL essentially moves a lot of complexity that you usually have on the client side (determining what things you need, making several subqueries, combining results) to the server side. This is great for performance or bandwidth constrained clients, but it might impact the required server performance in a negative way.

I experimented a little bit with GraphQL as a query language for services on performance constrained embedded systems instead of more RPC based approaches but have gone back to the latter one as it's far more predictable and I often didn't need the flexibility of arbitrary queries.


When you implement a GraphQL server, you don't map GraphQL to SQL. Instead, for each object type in your API, you define how to resolve each field, and you can use one of the various GraphQL server libraries to go from those object types to serving a whole API.

I would get more specific but it depends on which programming language you want to use. Check out the code examples & links to libraries in different languages on http://graphql.org/code/


I'd love to hear how Github is doing ACL here. We came up with a pretty neat solution on my team, which we have not yet open-sourced, for JS. But it was a lot of first-principles design work; there don't seem to be any good examples.

This was pretty much all the documentation we had, and it's more a design analysis of edge-vs-node authorization: https://medium.com/apollo-stack/auth-in-graphql-part-2-c6441...

Edit: Our eventual solution looked a lot like

    class SomeTypeOfResolver {
      @allowIfAny(rule1, rule2, rule3)
      someProperty;

      @allowIfAll(rule4, rule5)
      otherProperty = defineRetrieverFunction();
    }


Since the announcement of GraphQL I've been waiting for some 'real world' apis. (Sure the Star Wars GraphQL apis are fun)

Does anyone know any best practices if you want to adopt this is in an existing application using a relation database (i.e. PostgreSQL). I don't know how to implement this without causing N+1 queries. (or Worse).

For example:

{ Post {

    title,
    content,
    Author {
      name,
      avatar,
    },

    Comments(first:10) {
.. } } }

A naive implementation would cause a lot of query, for each "edge" a query.


Yup. That's something we can handle internally, under the hood to batch database requests into a single query for all edges. This problem is actually easier to solve in GraphQL then it is with a traditional REST API.


Exactly, now you put the onus to solve N+1 on the serverside, instead of in the old case the client would do N+1 requests...


you can use a helper library like the wonderful graphql-sequelize [https://github.com/mickhansen/graphql-sequelize] which helps you get from an ORM like sequelize to graphql queries/results fairly easily.


You can use something like dataloader[0] or haxl[1] to help with this issue.

[0]: https://github.com/facebook/dataloader

[1]: https://github.com/facebook/Haxl


In rails/graphql combo it's resolved the same way you usually resolve n+1 queries: you make 1 request per type of edge (i.e. one request for posts, one request for authors, on request for comments)


Fascinating this. But glosses a bit over the cost of generating bespoke responses to every request. Wonder how it works if expensive queries are implied in the request. You also need smarter caching, I imagine.


The implication in this article and elsewhere is that that the most expensive queries happened anyway, but as sequences and patterns of requests before. I imagine that what is lost in caching common requests is gained back by being able to pattern analyze the bespoke requests as a whole, prioritize and even quota them specifically by the type of bespoke requests, much more so than you could by just guessing the pattern of incoming REST URL hits.

Just as a relational database itself can sometimes generate better query plans from its query analyzer if you feed it what you are really after in one big, slow query that narrows to very specific rows rather than lots of small queries that return lots of rows quickly. Amortized against the database's time (CPU, memory) and bandwidth that slightly slower query is still sometimes a big win for overall performance.

(Given that most GraphQL services are typically backed by relational databases, it should probably not be a surprise the savings sometimes get passed right along.)


Yes but if in the past you could blame your internal developer for making slow SQL query, now these rich possibilities to perform really stressful queries to your DBMS are open to the public, right?

The main riddle to me is: in case of RDBMS in backend, how can we guess in advance which indexes are needed and how can we forbid/limit all "heavy" queries?


GraphQL has a notion of query complexity, that you can use to forbid certain queries.

Also, where appropriate, you can still gain a lot with caching.

Simple example: cache the query response containing all fields of a table, and extract the fields required for the response.


Many people smarter than me do think that caching [SQL] queries proved to be really bad idea (regardless of DBMS). Why caching API queries would be better?

Even if I would use query caching, I cannot imagine how it would help me to deal with really long queries (say, lacking proper indexes). Caching doesn't help when your query runs for 30 seconds – nobody will wait (and produce cache for others) so long.


This definitely enables a lot of opportunity to do both smarter querying and smarter caching on the back-end.

While you can indeed perform larger, more complex requests, GraphQL by nature forces queries to explicitly ask for everything you want to get back. As a result, we're not wasting any capacity giving you back a bunch of data for an entire object that you don't need like we would in a normal REST API request.

The thing that I'm most excited about with all this is the fact that we're building new GitHub features internally on GraphQL as well. This means that unlike a traditional REST API, there will no longer be any lag time between features in GitHub and the GitHub API.

API is a first-class product now. API consumers get features as soon as everyone else!

Please make sure to give us feedback during this alpha stage! https://platform.github.community/


There is a lot of cases when user needs just 20 rows from billions but getting them is very hard problem. One of such examples is well-known: it's twitter-like data model, where a user can "follow" thousands of others and you need to fetch top N recent posts from all that thousands streams.

In Postgres, straightforward approach to query such data is based on JOINs and it's absolutely inefficient. This can be dramatically optimized with recursive CTEs, arrays and loose index scan approach, but GraphQL by default it will do straightforward JOINs, right?

I hope GraphQL has (or will have soon) ways to overwrite/redefine queries, but again, it leads us to the same problems "patch driven development" that everyone hated in ORMs during decades. That's why I'm saying that GraphQL is "a new ORM", but it's more dangerous due to it's openness and proximity to web users, that's why it can bring even more dev and devops pain to the world that ORMs did during decades.


Is it available client-side ?(CORS/OAuth/etc)


The dirty little secret of "RESTful" APIs is that everyone was pretty much generating bespoke responses to every request anyway out of data coming out of a datastore.

Nobody has static files sitting on a server anymore except for static auxiliary assets (CSS, scripts, fonts) or if you're actually running a website with static content, which is exceedingly rare. Everyone has some kind of request router that parses the URL and the body and figures out what to do next, makes a query to a backing database, then assembles and massages the response to make it look like the mediatype the client expects.


Using a CDN in front of an api is not that uncommon from what I have seen.

And with clients basically creating their own queries, I imagine the performance implications will be less predictable than with a more rigid REST API.


Fair point.

To get better caching, one could:

- Canonicalize the query to identify queries that look different but are actually equivalent; use the hash of the canonicalized query as a cache key. You can do this on the edge cache, or on the backend.

- Cache more full-bodied resources, where additional fields are present, and perform the filter at evaluation time.


One really great feature of GraphQL (which GitHub doesn't support yet in this alpha, but we plan to) is the ability to store queries for execution later. This lets us optimize and plan for the data and volume of requests being generated from a given query. Other soon-to-be-added features like subscriptions where we only return the data you need when it changes help a lot on this front as well.


I've been looking into implementing something like this @ Gogobot as well. This eliminates the need for all `/v/` type API versioning. The client requests what it needs for this request.

Experimenting with this we often saw 50-70% reduction in the payload being sent to the clients in some requests. If I only need the first, last and avatar from the User object there's no need for my response payload to suffer because other requests need 30 fields from the same object.

Implementing this without causing a lot of N+1 queries is the tricky part and that's where we're currently investing most of our time.

Awesome to see Github adopting this and releasing it to the public API.


I theory a GraphQL API can operate verison-less utilizing things like deprecation notices and field aliasing to smooth over any rough edges. Once we see calls on a certain thing reach zero and sustain that level, we can actually remove it and never have to bump a version anywhere.

That's the dream. We'll see how reality plays out.

For reference, we actually launched with some deprecated fields (see "databaseId" on the "Issue" type -- database IDs will be phased out for global relay IDs eventually) if you want to see what they look like.


That theory is Facebook's practice. Four years later and we're on GraphQL API version 1. We add things which is safe, deprecate fields which we want to remove, and delete when hit rates drop to 0.

Granted, our clients are all Facebook engineers, so we have some pull in helping the migration away from deprecated fields, and GitHub will have to find the right process which works for a broader set of API consumers, but not only is this theory a good one, it's considered GraphQL best practice.


They mentioned using https://github.com/shopify/graphql-batch which was designed to solve this problem. This is similar to https://github.com/facebook/dataloader which solves the same problem for javascript.


Awesome to see GraphQL get some mainstream adoption, hopefully this leads to some more community tools for consuming it :) Relay is an awesome concept, but the learning curve is pretty steep.


Check out these other API consumption clients: http://graphql.org/code/#graphql-clients


Is this then the GitHub v4 API? Should we expect the REST API to be deprecated in the future?


GraphQL increases speed: "Using GraphQL on the frontend and backend eliminates the gap between what we release and what you can consume"


I'm intrigued by GraphQL, but I don't understand what separates it from passing "fields" and "embeds" parameters in a REST API. I don't see what about it would be inherently easier to implement either.

I've sparingly in my free time been working on a project that does exactly this with a REST API[1]. It's in an entirely unfinished state, but the linked documentation is a decent example of the types of queries possible.

[1] http://drowsy.readthedocs.io/en/latest/querying.html


For me, it's easiest to pretend that GraphQL is a DSL (a domain-specific language) that offers special syntax to make it easier to implement certain things.

- You can pretend that each GraphQL query is a JSON object (which, it actually is)

- You can pretend that each GraphQL schema that you declare is actually a JSON-Schema document, which some people use to specify in a machine-readable way your API's inputs and outputs will look

- You can pretend that each GraphQL resolver, which the piece of code you have to write (on the server) to actually dig up the result of a query, is a function that parses your incoming JSON, validates it against your JSON-Schema, and then reaches out to your datastore to produce a result. You'd then have to construct another JSON document which matches your response schema, stuff the data in it, and return that to the user. Except that in GraphQL, you only have to supply the resolver, the rest is handled by the framework.

You can of course do this by hand and many people do (most obviously when you see APIs that include arguments like "operator=eq" or "limit=100" or "page=25"), but GraphQL gives you the tools to do this with less effort, and end up with a cleaner API by passing everything in the query body. And the GraphQL server saves you from having to manually build up the JSON text of every single response.

Reading through your docs, I've seen this style of API in enterprise settings where there was a backing relational database and the designers were basically trying to expose the underlying database through HTTP. It can get the job done, but GraphQL gives you nicer abstractions, a cleaner way of passing parameters, and conveniences like a real type system (known both on the client and server side) and you only have to supply your resolver function implementations.


I totally see the appeal on the surface level of a nicer/cleaner looking query language and API.

Still, pretty much everything else you mentioned seems doable with REST. I'm using Marshmallow schemas in Python which seem to act in a similar way on a field by field basis as a GraphQL resolver does. I'm not sure what exactly I'd be getting outside of a slightly nicer/cleaner abstraction by moving to GraphQL, but maybe that's enough?


GraphQL API Explorer wants this permission:

Public and private This application will be able to read and write all public and private repository data. This includes the following:

Code Issues Pull requests Wikis Settings Webhooks and services Deploy keys

Why'O whyai?


Because GitHub's GraphQL API will let you do all of those things via the GraphQL API Explorer ;-)


I think it's because you can use the GraphQL mutations to do things like create comments, etc.


You don't trust GitHub with access to your GitHub repositories?


Slightly off topic but GraphQL vs Odata?


They're similar; both technologies allow clients to specify the data they need. I would say that GraphQL is more flexible and has a strong type system. For example, the filter semantics are defined within OData, while in GraphQL, you define how your data can be filtered within your schema.

As a personal opinion, I also feel OData exposes an API that is too tightly coupled to the persistence layer. GraphQL objects and properties are all backed by arbitrary "resolver" functions, which means you can stitch together multiple/legacy backends to generate your response.


What method do you use to get data. You can't put a body in GET methods so I assume with GraphQL you use POST to get data?


They use POST ( https://github.com/github/graphql-client/blob/9e1fa16cf88de4... ) which, as far as I know, is common for query APIs like this.


You can send a GraphQL query as part of the query in your HTTP GET request: http://graphql.org/learn/serving-over-http/#get-request.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: