Hacker News new | past | comments | ask | show | jobs | submit login
"How do I set the User-Agent string in Java?" - L. Page (1996) (guyro.typepad.com)
173 points by frisco on Oct 6, 2009 | hide | past | favorite | 39 comments



Why didn't he just Googl... er... uh... nevermind.


This is magnificent. I imagine decades from now, web archeology will be a true academic discipline.


You should read "A Deepness In The Sky" and "A Fire Upon The Deep"; the topic is actually covered fairly well.


yes, but allow me to point out that Usenet is not exactly Web.




Can't blame him for not googling it himself.


This reminds me of Doc Brown in 1955 before he invented his time machine. (In Back to the future)


I imagine his bot has undergone a couple of revisions since then.


I wonder if it still runs under Java 1.0.


it seems they got rid of java pretty early, fortunately.


Why? Java is extremely well suited to crawling and parsing. Extremely well suited to backend tasks running on servers for months on end without crashing.

It's blisteringly fast, low CPU usage, and would suit the core task of crawling websites extremely well. The async NIO libs are fantastic for network io.

What would you use and why? (And why would you not use java)


We've found Java's regular expression capabilities to be a bit frustrating at times. Although they're easy to use, they can be very slow for certain types of regexes. Does anyone know of a "fast regex" class?


Perhaps you're talking about Java SE 6, not Java runtime of 10 years ago?

> Extremely well suited to backend tasks running on servers for months on end without crashing.

Running for months without crashing - maybe. But by the end of that month (well, week really) it will be so slow (i.e. due to memory leaks ironically) that your only option will be autorestarting it every now and then...

AFAIK Larry and Sergey chose Perl at the beginning. Now it should be mostly Python and C.


I've been using Java for backend/net crawl tasks since about 2001. It definitely improved drastically with the addition of nio, and there were some irritating segfault issues a few years ago, but nothing a rollback to earlier JVM didn't fix (Until sun fixed it).

You can certainly run for months without issue (memory/crash/speed) as long as you don't have any leaks in your own code.

I'm pretty sure Java is still widely used at Google.

If I was writing the google crawler from scratch today, I'd certainly start with Java, then probably use perl/python for less critical scripting glue, and maybe rewrite any CPU intensive stuff in C/asm.


Would you really use Java, or just language X that runs on the JVM?


the perl runtime is far more stable than java. I've never encountered perl to crash while it happens that java simply gives up the ghost and goes belly up from time to time (not very often though)


The internet has a neat ability to remember things.


I thought they were using Python back then? Has anyone ever talked about the structure of Google v1? Code made public?


This is probably the closest you'll find:

http://infolab.stanford.edu/~backrub/google.html


Interesting excerpt on their view of advertising back then:

Currently, the predominant business model for commercial search engines is advertising. The goals of the advertising business model do not always correspond to providing quality search to users. For example, in our prototype search engine one of the top results for cellular phone is "The Effect of Cellular Phone Use Upon Driver Attention", a study which explains in great detail the distractions and risk associated with conversing on a cell phone while driving. This search result came up first because of its high importance as judged by the PageRank algorithm, an approximation of citation importance on the web [Page, 98]. It is clear that a search engine which was taking money for showing cellular phone ads would have difficulty justifying the page that our system returned to its paying advertisers. For this type of reason and historical experience with other media [Bagdikian 83], we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers.

Looks like they solved the problem by turning it on its head.


For me the first step in figuring out a solution is to clearly understand the problem. It always seems like the solution is far easier than that first part. Well... then there's implementation :)


I am reminded of the Feynman Algorithm:

Write down the problem.

Think real hard.

Write down the solution.

The Feynman algorithm was facetiously suggested by Murray Gell-Mann, a colleague of Feynman, in a New York Times interview.

The first step is usually the hardest.


with the eclipse u'll never worry about such question~


The hilarious bit is that they dropped Java for Python, probably due to the massive levels of frustration encountered when trying to do simple things like this.


In practice, almost all core code is C++/Java (vast majority is C++, though its slowly shifting), with Python relegated mainly to scripting glue.


Setting a user-agent isn't exactly a core language thing. HTTP is an add-on library, of which there are many to choose from.

I know you saw the title and thought "ahahaha easy time to bash Java again", but it's really ignorant to do so.


It is reasonable to recall that Java was being promoted in 1996 as the "net" language, yet lacked basic mature libraries for being so. Java only became a decent language with mature libs after many years of front running this position. Some of us that used Java early on, mainly because business drivers forced it on us (Sun/IBM/BEA wouldn't lie to corporate America?), don't have fond memories.


What was python being promoted as in 1996?


no idea. I'm not even sure that in 1996 I had heard of python. Not sure why folks on this thread want to keep bringing python into the mix, especially in a 1996 context. Others have stated that python isn't and never has been the core to Google's search engine.


According to their 1998 paper (http://infolab.stanford.edu/~backrub/google.html):

"Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux."

and also

"Both the URLserver and the crawlers are implemented in Python."

So at some point they decided to try out Python for certain tasks.


It was simple. This is the code he had to use (assuming he used URLConnection)

connection.setRequestProperty ("User-agent", "GoogleBot/0.01");

I had never done this before and found it out by looking at the javadoc. I don't see how switching to Python would have made this easier.


You did that in 1996?


May be older versions of Java didn't support this or didn't document it properly--just saying?


What version was he running?

Wikipedia's history of Java verions says 1.0, as does the request string in the article. [http://en.wikipedia.org/wiki/Java_version_history]

Was URLConnection available back then?

According to the URLConnection docs it's been around since JDK 1.0. [http://java.sun.com/j2se/1.3/docs/api/java/net/URLConnection...]

Did URLConnection.SetRequestProperty exist back in JDK 1.0?

The closest I could find were the docs for JDK 1.1.8 in a downloadable zip file, and yes SetRequestProperty existed back in JDK 1.1.8 at least.

Looking at the actual response, and the JDK 1.1.8 docs, he would probably have been using HTTPURLConnection (could not find HttpClient anywhere in the jdk1.1.8 docs) and even HTTPURLConnection in JDK 1.1.8 I could not find the string 'agent' anywhere on the page.

So yea, if the settings were there they were buried and not readily accessible in the documentation of the time.



Yea, but there still isn't anything in those docs referring to the User-Agent string/property.


I was simply curious, thanks for digging in :)


Bottom line is that he saw potential in that setting...and what potential!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: