Finds by @samoli

Ruby/Rails developer and designer in London.

Here are articles that I've read and found interesting, mostly from places like Hacker News and a collection of RSS feeds that I follow. It's mainly here for my own records but you're welcome to follow me.

I'm starring items I find in Google Reader (using Byline) and using ifttt to publish them to Tumblr. This way I can read and reblog articles on the tube (London underground) where there is no Internet connection.

You can also subscribe via email if you like

N.B I'm investigating why the full contents of the post are sometimes posted. They are intended to be links but sometimes the whole post comes through. Also I'd like a way to add a note to the post but since Google removed most of the useful features in reader, I'm not sure how I can do this.

— @samoli on Twitter.

Let’s make Rails on OS X easy again!

[shared via Hacker News]

Mar 29

Note: This post originally appeared in TechCrunch

Here’s the gist:

  • Rather than using conventional feedback loops, companies today are employing a new, stronger habit-forming mechanism to hook users—the desire engine.
  • At the heart of the desire engine is a variable schedule of rewards: a powerful hack that focuses attention, provides pleasure, and infatuates the mind.
  • Our search for variable rewards is about an endless desire for three types of rewards: those of the tribe, the hunt and the self.

[shared via Hacker News]

Mar 29

What happens when you take a monster 4.1 meter telescope in the southern hemisphere and point it at the same patch of sky for 55 hours?

This. Oh my, this:

[Click to embiggen.]

OK, I know. At first glance it doesn’t look like much, does it? Just a field of stars. However, here’s the important bit: I had to take the somewhat larger original image and reduce it in size to fit my 610-pixel-wide blog. So how much bigger is the original?

It’s 17,000 x 11,000 pixels! If you happen to be sitting on a T1 line, then you can grab this massive 250 Mb file. And I surely suggest you do.

Because yeah, the brightest objects you see in this are stars. Probably a few hundred of them. But you have to look at the bigger image ! Why? Because what’s amazing, truly jaw-dropping and incredible is this:

There are over 200,000 galaxies filling this image!

Ye. Gads.

Here’s a zoom of the image, centered on what looked to me to be one of the biggest galaxies in the frame, a nice edge-on spiral.

With the exception of a handful of blue-looking stars, everything in this zoom is a galaxy, probably billions of light years away. Those tiny red dots are galaxies so far away they crush our minds to dust: we’re seeing them with light that left them shortly after the Universe itself formed.

This light is ancient. And it came a long, long way.

By the way, that picture of the spiral there is not even at full resolution! Just to give you an idea, I cropped out just that galaxy in the full-res image and inset it here. If you want to find it in the full frame, it’s about one-third of the way in from the left, and one-third of the way down from the top. Happy hunting.

[Edited to add: I forgot to add that this galaxy is warped! See how the disk flares up on the left and down on the right, just a bit? This is very common in disk galaxies, and our own Milky Way does it too (see #9 at that link). It’s usually caused when a nearby galaxy’s gravity torques on the stars in the disk.]

These images were taken with VISTA, the European Southern Observatory’s Visible and Infrared Survey Telescope for Astronomy (VISTA), a 4.1 meter telescope in Chile. This huge image is actually composed of 6000 separate images, and is the single deepest infrared picture of the sky ever taken with this field of view. Hubble can get deeper, for example, but sees a much, much smaller part of the sky.

By looking in the infrared we can see farther into space. Because space is expanding, light from distant galaxies gets red-shifted (like the Doppler effect on a cosmic scale). Young galaxies are generally furiously forming stars, and that makes them blast out ultraviolet light. But a young galaxy seen from very far away has that UV light red shifted into the infrared. So to us, billions of light years distant, we see it pouring out IR light. Looking there means we can see these extraordinarily remote galaxies more easily. If you scan the full-size image, you’ll see lots of tiny, very red dots. Those are most likely the most distant objects in this picture, appearing redshifted, dimmed, and shrunken due to their terrible distance.

Also, looking in the infrared makes stars that look red to our eyes appear blue in the image! Most of the stars in this image are weak, cool, dim red dwarfs. They look blue because this image is false color. It uses three filters to isolate different colors of infrared: 1.3 microns, 1.7 microns, and 2.2 microns (colored blue, green, and red in this image, respectively). The reddest light the human eye can typically see is about 0.7 microns, so these are well outside human range. Many red dwarfs put out light at 1.3 microns, but not nearly as much at 1.7 and 2.2. Since the 1.3 micron light is colored blue in the image, that makes the stars look blue, even though to you they’d look red!

And what a view! Here’s another interesting bit I happened to stumble on while just scanning this monster image:

Isn’t that interesting? There’s a long jet of material apparently coming from that bloated galaxy on the left (I increased the brightness and contrast of this picture to make it more obvious; it was subtle in the original image but I have a lot of practice picking out things like this). Big galaxies have supermassive black holes in their cores, and these sometimes accelerate huge beams of matter and energy that blast out. But wait! The stream goes right through that smaller galaxy on the right. Is that a coincidence — the jet is coming from the big one and happens to pass in front of a more distant galaxy? Or is that smaller one the source of the jet, and actually has two jets coming out of either side? That’s actually a more common occurrence. Beats me. I could argue either way. We’d need spectra of the galaxies to know for sure.

And funny: I went back to the original image to see where I cut that galaxy out, and now I can’t find it. Holy crap. I mean, seriously, I couldn’t find it. That’s how big this image is.

Of course, you can find a dozen galaxies just like it. I also found several gorgeous spirals (look all the way on the left; one is cut off on the edge of the frame and it’s really something). Some were edge-on like the one above, others face-on. There are countless blobby ones, and even more that are just dots, so far away we see them as dimensionless points.

I’ve spent years studying all this, and it still sometimes gets to me: just how flipping BIG the Universe is! And this picture is still just a tiny piece of it: it’s 1.2 x 1.5 degrees in size, which means it’s only 0.004% of the sky! And it’s not even complete: more observations of this region are planned, allowing astronomers to see even deeper yet.

Science is wonderful. Building on the knowledge developed before us, our tools improve and our ability to explore expands. Piece by piece, photon by photon, galaxy by galaxy, we’re examining this Universe we live in and understanding it better every day.

Image credit: ESO/UltraVISTA team. Acknowledgement: TERAPIX/CNRS/INSU/CASU


Related Posts:

- Another record breaker: ultra-deep image reveals ultra-distant galaxy
- The Helix screams in infrared
- The Milky Way’s buried treasures
- Spectacular VISTA of the Tarantula

[shared via Hacker News]

Mar 27

If you started out building a dating site and instead ended up  building a video sharing site (YouTube) that handles 4 billion views a day, then it’s just possible you learned something along the way. And indeed, Mike Solomon, one of the original engineers at YouTube, did learn a lot and he has given a talk about it at PyCon: Scalability at YouTube.

This isn’t an architecture driven talk where we are led through a description of how a lot of boxes connect to each other. Mike could give that sort of talk. He has worked on building YouTube’s servlet infrastructure, video indexing feature, video transcoding system, their full text search, a CDN, and much more. But instead, he’s taken a step back, took a long look around at what time has wrought, and shared some deep lessons, obviously hard won from experience.

The key takeaway away of the talk for me was doing a lot with really simple tools. While many teams are moving on to more complex ecosystems, YouTube really does keep it simple. They program primarily in Python, use MySQL as their database, they’ve stuck with Apache, and even new features for such a massive site start as a very simple Python program.

That doesn’t mean YouTube doesn’t do cool stuff, they do, but what makes everything work together is more a philosophy or a way of doing things than technological hocus pocus. What made YouTube into one of the world’s largest websites? Read on and see…

Stats

  • 4 billion Views a day
  • 60 hours of video is uploaded every minute
  • 350+ million devices are YouTube enabled
  • Revenue double in 2010
  • The number of videos has gone up 9 orders of magnitude and the number of developers has only gone up two orders of magnitude.
  • 1 million lines of Python code

Stack

  • Python - most of the lines of code for YouTube are still in Python. Everytime you watch a YouTube video you are executing a bunch of Python code.
  • Apache - when you think you need to get rid of it, you don’t. Apache is a real rockstar technology at YouTube because they keep it simple. Every request goes through Apache.
  • Linux - the benefit of Linux is there’s always a way to get in and see how your system is behaving. No matter how bad your app is behaving, you can take a look at it with Linux tools like strace and tcpdump.
  • MySQL - is used a lot. When you watch a video you are getting data from MySQL. Sometime it’s used a relational database or a blob store. It’s about tuning and making choices about how you organize your data.
  • Vitess - a  new project released by YouTube, written in Go, it’s a frontend to MySQL. It does a lot of optimization on the fly, it rewrites queries and acts as a proxy. Currently it serves every YouTube database request. It’s RPC based.
  • Zookeeper - a distributed lock server. It’s used for configuration. Really interesting piece of technology. Hard to use correctly so read the manual
  • Wiseguy - a CGI servlet container.
  • Spitfire - a templating system. It has an abstract syntax tree that let’s them do transformations to make things go faster.
  • Serialization formats - no matter which one you use, they are all expensive. Measure. Don’t use pickle. Not a good choice. Found protocol buffers slow. They wrote their own BSON implementation which is 10-15 time faster than the one you can download.

General Lessons

  • Tao of YouTube: choose the simplest solution possible with the loosest guarantees that are practical. The reason you want all these things is you need flexibility to solve problems. The minute you over specify something you paint yourself into a corner. You aren’t going to make those guarantees. Your problem becomes automatically more complex when you try and make all those guarantees. You leave yourself no way out.
  • That whole process is what scalability is about. A scalable system is one that’s not in your way. That you are unaware of. It’s not buzz words. It’s a general problem solving ethos.
  • Hallmark of big system design: Every system is tailored to its specific requirements. Everything depends on the specifics of what you are building.
  • YouTube is not asynchronous, everything is blocking.
  • Believes more in philosophy than doctrine. Make it simple. What does that mean? You’ll know when you see it. If you do code review that changes thousands of lines of code and many files then there was probably a simpler way. Your first demo should be simple, then iterate.
  • To solve a problem: One word - simple. Look for the most simple thing that will address the problem space. There are lots of complex problems, but the first solution doesn’t need to be complicated. The complexity will come naturally over time.
  • A lot of YouTube systems start as one Python file and become large ecosystems after many many years. All their prototype were written in Python and survived for a surprising amount of time.
  • In a design review:
    • What’s the first solution?
    • How are you going to iterate?
    • What do we know about how this data is going to be used?
  • Things change over time. How YouTube started out has no bearing on what happens later. YouTube started out as a dating site. If they had designed for that they would have different conversation. Stay flexible.
  • YouTube CDN. Originally contracted it out. Was very expensive so they did it themselves. You can build a pretty good video CDN if you have a good hardware dude. You build a very large rack, stick machines in, then take lighttpd, and then override the 404 handler to find the video that you didn’t find. That took two weeks and it’s first day served 60 gigabits. You can do a lot with really simple tools.
  • You have to measure. Vitess swapped out one its protocols for an HTTP implementation. Even though it was in C it was slow. So they ripped out HTTP and did a direct socket call using python and that was 8% cheaper on global CPU. The enveloping for HTTP is really expensive.

Scalability Techniques

  • These are not new ideas, but it’s amazing how a few core ideas can apply in a lot different dimensions.
  • Divide and Conquer - The Scalability Technique
    • This is the scalability technique. Everything is about partitioning out work. Deciding how to execute it. Applies to many things, from web tier, you have a lot of web servers that are more or less identically and independently and you grow them horizontally. That’s divide and conquer.
    • This is the crux of database sharding. How do you partitions things out and communicate between the parts that you’ve subdivided. These are things you want to figure out early on because they influence how you grow.
    • Simple and loose connections are really valuable.
    • The dynamic nature of Python is a win here. No matter how bad your API is you can stub or modify or decorate your way out of a lot of problems.
  • Approximate Correctness - Cheat a Little
    • Another favorite technique. The state of the system is that which it is reported to be. If a user can’t tell a part of the system is skewing and inconsistent, then it’s not.
    • A real world example. If you write a comment and someone loads the page at the same time, they might not get it for 300-400ms, the user who is reading won’t care. The writer of the comment will care, so you make sure the user who wrote the comment will see it. So you cheat a little bit. Your system doesn’t have to have globally consistent transactions. That would be super expensive and overkill. Not every comment is a financial transaction. So know when you can cheat.
  • Expert Knob Twiddling
    • Ask, what do you know about your consistency model? For comments is eventually consistent good enough? Renting a movie is different. When renting there’s money so we’ll do the best we can to never lose that. Different consistency models are needed depending on the data.
  • Jitter - Add Entropy Back into Your System
    • Hot word in their group all of the time. If your system doesn’t jitter then you get thundering herds. Distributed applications are really weather systems. Debugging them is as deterministic as predicting the weather. Jitter introduces more randomness because surprisingly, things tend to stack up.
    • For example, cache expirations. For a popular video they cache things as best they can. The most popular video they might cache for 24 hours. If everything expires at one time then every machine will calculate the expiration at the same time. This creates a thundering herd.
    • By jittering you are saying  randomly expire between 18-30 hours. That prevents things from stacking up. They use this all over the place. Systems have a tendency to self synchronize as operations line up and try to destroy themselves. Fascinating to watch. You get slow disk system on one machine and everybody is waiting on a request so all of a sudden all these other requests on all these other machines are completely synchronized. This happens when you have many machines and you have many events. Each one actually removes entropy from the system so you have to add some back in.
  • Cheating - Know How to Fake Data
    • Awesome technique. The fastest function call is the one that doesn’t happen. When you have a monotonically increasing counter, like movie view counts or profile view counts, you could do a transaction every update. Or you could do a transaction every once in awhile and update by a random amount and as long as it changes from odd to even people would probably believe it’s real. Know how to fake data.
  • Scalable Components - Make Your own Luck
    • You can look at an API and get a good feel. Are the inputs well defined? Do you know what you are getting out? A lot of this ends up being about data. Have a tight specification of what data comes out every function and how it flows actually helps you understand the application without documentation. You can tell what’s happening before and after a function is called.
    • In Python things tend to move towards RPCs. The structure of your code is based on the discipline of your programmers. So establish good conventions, when all else fails there’s an RPC wall so you know what goes in and what comes out.
    • Your components will not be perfect. A component might last a month or six months, who knows. By drawing these lines you are making some of your own luck. When things go south you can swap it out and do something different. Sometimes that rewriting someing in python and C and sometimes that means getting rid of it entirely. You don’t know until you are able to observe.
    • With so many people on a team nobody can know the whole system, so you need to define components. This is video transcode it’s distinct from video search. You want well defined subcomponents. It’s good software design. These things end up talking to each other so having a good data specification is helpful. The greatest sin he made was communication between the servlet layer and the template layer to be a dictionary. Very bad idea. Should have added a WatchPage and said a watch page had a video and some comments and some related videos. This caused a lot of problems because the dictionary can have a few hundred attributes. They don’t always make the right choice.
  • Efficiency - Traded Off for Scalability
    • Efficiency is traded off for scalability. The most efficient thing is to write it in C and cram it into one process, but that’s not scalable.
    • Focus on the macro level, your components, and how they break out. Does it makes sense to do this an RPC or do it inline? Break it into a subpackage and just someday this may be different.
    • Focus on algorithms. In Python the effort to implement a good algorithm is low. There’s the bisect module, for example, where you can take a list, do something meaningful, and serialize it to disk and read it back again. There’s a penalty versus C, but it’s very easy.
    • Measurement. In Python measurement is like reading tea leaves. There’s a lot of things in Python that are counter intuitive, like the cost of grabage colleciton. Most of chunks of their apps spend their time serializing. Profiling serialization is very depending on what you are putting in. Serializing ints is very different than serializing big blobs.
  • Efficiency in Python - Knowing What Not to Do
    • More about knowing what not to do. How dynamic you make things correlates to how expensive it is to run your Python app.
    • Dummer code is easier to grep for and easier to maintain. The more magical the code is the harder is to figure out how it works.
    • They don’t do a lot of OO. They use a lot of namespaces. Use classes to organize data, but rarely for OO.
    • What is your code tree going to look like? He wants these words to describe it: simple, pragmatic, elegant, orthogonal, composable. This is an ideal, reality is a bit different.

Related Articles

[shared via Hacker News]

Mar 27
sn-tumors.jpg

Survivor. When mice with human tumors received doses of anti-CD47, which sets the immune system against tumor cells, the cancers shrank and disappeared.

Credit: Fotosearch

A single drug can shrink or cure human breast, ovary, colon, bladder, brain, liver, and prostate tumors that have been transplanted into mice, researchers have found. The treatment, an antibody that blocks a “do not eat” signal normally displayed on tumor cells, coaxes the immune system to destroy the cancer cells.

A decade ago, biologist Irving Weissman of the Stanford University School of Medicine in Palo Alto, California, discovered that leukemia cells produce higher levels of a protein called CD47 than do healthy cells. CD47, he and other scientists found, is also displayed on healthy blood cells; it’s a marker that blocks the immune system from destroying them as they circulate. Cancers take advantage of this flag to trick the immune system into ignoring them. In the past few years, Weissman’s lab showed that blocking CD47 with an antibody cured some cases of lymphomas and leukemias in mice by stimulating the immune system to recognize the cancer cells as invaders. Now, he and colleagues have shown that the CD47-blocking antibody may have a far wider impact than just blood cancers.

“What we’ve shown is that CD47 isn’t just important on leukemias and lymphomas,” says Weissman. “It’s on every single human primary tumor that we tested.” Moreover, Weissman’s lab found that cancer cells always had higher levels of CD47 than did healthy cells. How much CD47 a tumor made could predict the survival odds of a patient.

To determine whether blocking CD47 was beneficial, the scientists exposed tumor cells to macrophages, a type of immune cell, and anti-CD47 molecules in petri dishes. Without the drug, the macrophages ignored the cancerous cells. But when the CD47 was present, the macrophages engulfed and destroyed cancer cells from all tumor types.

Next, the team transplanted human tumors into the feet of mice, where tumors can be easily monitored. When they treated the rodents with anti-CD47, the tumors shrank and did not spread to the rest of the body. In mice given human bladder cancer tumors, for example, 10 of 10 untreated mice had cancer that spread to their lymph nodes. Only one of 10 mice treated with anti-CD47 had a lymph node with signs of cancer. Moreover, the implanted tumor often got smaller after treatment — colon cancers transplanted into the mice shrank to less than one-third of their original size, on average. And in five mice with breast cancer tumors, anti-CD47 eliminated all signs of the cancer cells, and the animals remained cancer-free 4 months after the treatment stopped.

“We showed that even after the tumor has taken hold, the antibody can either cure the tumor or slow its growth and prevent metastasis,” says Weissman.

Although macrophages also attacked blood cells expressing CD47 when mice were given the antibody, the researchers found that the decrease in blood cells was short-lived; the animals turned up production of new blood cells to replace those they lost from the treatment, the team reports online today in the Proceedings of the National Academy of Sciences.

Cancer researcher Tyler Jacks of the Massachusetts Institute of Technology in Cambridge says that although the new study is promising, more research is needed to see whether the results hold true in humans. “The microenvironment of a real tumor is quite a bit more complicated than the microenvironment of a transplanted tumor,” he notes, “and it’s possible that a real tumor has additional immune suppressing effects.”

Another important question, Jacks says, is how CD47 antibodies would complement existing treatments. “In what ways might they work together and in what ways might they be antagonistic?” Using anti-CD47 in addition to chemotherapy, for example, could be counterproductive if the stress from chemotherapy causes normal cells to produce more CD47 than usual.

Weissman’s team has received a $20 million grant from the California Institute for Regenerative Medicine to move the findings from mouse studies to human safety tests. “We have enough data already,” says Weissman, “that I can say I’m confident that this will move to phase I human trials.”

[shared via Hacker News]

Mar 27

UI responsiveness: OSX vs. Windows, iOS vs. Android

Comments

[shared via Hacker News]

What are the Windows A: and B: drives used for?

Posted on Friday March 30th 2012 at 01:46pm. Its tags are listed below.

What are the Windows A: and B: drives used for?

Comments

[shared via Hacker News]

Puma (Mongrel/WEBrick Alternative for Ruby by EngineYard) Hits 1.0

Comments

[shared via Hacker News]

Prince of Persia creator finds lost source code 23 years later

Comments

[shared via Hacker News]

Kickstarter: rails.app

Posted on Thursday March 29th 2012 at 12:15pm. Its tags are listed below.

Kickstarter: rails.app

Let’s make Rails on OS X easy again!

[shared via Hacker News]

Want To Hook Users? Drive Them Crazy.

Posted on Thursday March 29th 2012 at 12:15pm. Its tags are listed below.

Want To Hook Users? Drive Them Crazy.

Note: This post originally appeared in TechCrunch

Here’s the gist:

  • Rather than using conventional feedback loops, companies today are employing a new, stronger habit-forming mechanism to hook users—the desire engine.
  • At the heart of the desire engine is a variable schedule of rewards: a powerful hack that focuses attention, provides pleasure, and infatuates the mind.
  • Our search for variable rewards is about an endless desire for three types of rewards: those of the tribe, the hunt and the self.

[shared via Hacker News]

Building the worst Linux PC ever: 6 hours to boot Ubuntu

[shared via Hacker News]

Patents Threaten To Silence A Little Girl, Literally

Comments

[shared via Hacker News]

10,000 People Sign Petition to Honor Alan Turing by Putting Him on the £10 Note

The universe is big. This big.

Posted on Tuesday March 27th 2012 at 11:46am. Its tags are listed below.

The universe is big. This big.

What happens when you take a monster 4.1 meter telescope in the southern hemisphere and point it at the same patch of sky for 55 hours?

This. Oh my, this:

[Click to embiggen.]

OK, I know. At first glance it doesn’t look like much, does it? Just a field of stars. However, here’s the important bit: I had to take the somewhat larger original image and reduce it in size to fit my 610-pixel-wide blog. So how much bigger is the original?

It’s 17,000 x 11,000 pixels! If you happen to be sitting on a T1 line, then you can grab this massive 250 Mb file. And I surely suggest you do.

Because yeah, the brightest objects you see in this are stars. Probably a few hundred of them. But you have to look at the bigger image ! Why? Because what’s amazing, truly jaw-dropping and incredible is this:

There are over 200,000 galaxies filling this image!

Ye. Gads.

Here’s a zoom of the image, centered on what looked to me to be one of the biggest galaxies in the frame, a nice edge-on spiral.

With the exception of a handful of blue-looking stars, everything in this zoom is a galaxy, probably billions of light years away. Those tiny red dots are galaxies so far away they crush our minds to dust: we’re seeing them with light that left them shortly after the Universe itself formed.

This light is ancient. And it came a long, long way.

By the way, that picture of the spiral there is not even at full resolution! Just to give you an idea, I cropped out just that galaxy in the full-res image and inset it here. If you want to find it in the full frame, it’s about one-third of the way in from the left, and one-third of the way down from the top. Happy hunting.

[Edited to add: I forgot to add that this galaxy is warped! See how the disk flares up on the left and down on the right, just a bit? This is very common in disk galaxies, and our own Milky Way does it too (see #9 at that link). It’s usually caused when a nearby galaxy’s gravity torques on the stars in the disk.]

These images were taken with VISTA, the European Southern Observatory’s Visible and Infrared Survey Telescope for Astronomy (VISTA), a 4.1 meter telescope in Chile. This huge image is actually composed of 6000 separate images, and is the single deepest infrared picture of the sky ever taken with this field of view. Hubble can get deeper, for example, but sees a much, much smaller part of the sky.

By looking in the infrared we can see farther into space. Because space is expanding, light from distant galaxies gets red-shifted (like the Doppler effect on a cosmic scale). Young galaxies are generally furiously forming stars, and that makes them blast out ultraviolet light. But a young galaxy seen from very far away has that UV light red shifted into the infrared. So to us, billions of light years distant, we see it pouring out IR light. Looking there means we can see these extraordinarily remote galaxies more easily. If you scan the full-size image, you’ll see lots of tiny, very red dots. Those are most likely the most distant objects in this picture, appearing redshifted, dimmed, and shrunken due to their terrible distance.

Also, looking in the infrared makes stars that look red to our eyes appear blue in the image! Most of the stars in this image are weak, cool, dim red dwarfs. They look blue because this image is false color. It uses three filters to isolate different colors of infrared: 1.3 microns, 1.7 microns, and 2.2 microns (colored blue, green, and red in this image, respectively). The reddest light the human eye can typically see is about 0.7 microns, so these are well outside human range. Many red dwarfs put out light at 1.3 microns, but not nearly as much at 1.7 and 2.2. Since the 1.3 micron light is colored blue in the image, that makes the stars look blue, even though to you they’d look red!

And what a view! Here’s another interesting bit I happened to stumble on while just scanning this monster image:

Isn’t that interesting? There’s a long jet of material apparently coming from that bloated galaxy on the left (I increased the brightness and contrast of this picture to make it more obvious; it was subtle in the original image but I have a lot of practice picking out things like this). Big galaxies have supermassive black holes in their cores, and these sometimes accelerate huge beams of matter and energy that blast out. But wait! The stream goes right through that smaller galaxy on the right. Is that a coincidence — the jet is coming from the big one and happens to pass in front of a more distant galaxy? Or is that smaller one the source of the jet, and actually has two jets coming out of either side? That’s actually a more common occurrence. Beats me. I could argue either way. We’d need spectra of the galaxies to know for sure.

And funny: I went back to the original image to see where I cut that galaxy out, and now I can’t find it. Holy crap. I mean, seriously, I couldn’t find it. That’s how big this image is.

Of course, you can find a dozen galaxies just like it. I also found several gorgeous spirals (look all the way on the left; one is cut off on the edge of the frame and it’s really something). Some were edge-on like the one above, others face-on. There are countless blobby ones, and even more that are just dots, so far away we see them as dimensionless points.

I’ve spent years studying all this, and it still sometimes gets to me: just how flipping BIG the Universe is! And this picture is still just a tiny piece of it: it’s 1.2 x 1.5 degrees in size, which means it’s only 0.004% of the sky! And it’s not even complete: more observations of this region are planned, allowing astronomers to see even deeper yet.

Science is wonderful. Building on the knowledge developed before us, our tools improve and our ability to explore expands. Piece by piece, photon by photon, galaxy by galaxy, we’re examining this Universe we live in and understanding it better every day.

Image credit: ESO/UltraVISTA team. Acknowledgement: TERAPIX/CNRS/INSU/CASU


Related Posts:

- Another record breaker: ultra-deep image reveals ultra-distant galaxy
- The Helix screams in infrared
- The Milky Way’s buried treasures
- Spectacular VISTA of the Tarantula

[shared via Hacker News]

7 Years Of YouTube Scalability Lessons In 30 Minutes

If you started out building a dating site and instead ended up  building a video sharing site (YouTube) that handles 4 billion views a day, then it’s just possible you learned something along the way. And indeed, Mike Solomon, one of the original engineers at YouTube, did learn a lot and he has given a talk about it at PyCon: Scalability at YouTube.

This isn’t an architecture driven talk where we are led through a description of how a lot of boxes connect to each other. Mike could give that sort of talk. He has worked on building YouTube’s servlet infrastructure, video indexing feature, video transcoding system, their full text search, a CDN, and much more. But instead, he’s taken a step back, took a long look around at what time has wrought, and shared some deep lessons, obviously hard won from experience.

The key takeaway away of the talk for me was doing a lot with really simple tools. While many teams are moving on to more complex ecosystems, YouTube really does keep it simple. They program primarily in Python, use MySQL as their database, they’ve stuck with Apache, and even new features for such a massive site start as a very simple Python program.

That doesn’t mean YouTube doesn’t do cool stuff, they do, but what makes everything work together is more a philosophy or a way of doing things than technological hocus pocus. What made YouTube into one of the world’s largest websites? Read on and see…

Stats

  • 4 billion Views a day
  • 60 hours of video is uploaded every minute
  • 350+ million devices are YouTube enabled
  • Revenue double in 2010
  • The number of videos has gone up 9 orders of magnitude and the number of developers has only gone up two orders of magnitude.
  • 1 million lines of Python code

Stack

  • Python - most of the lines of code for YouTube are still in Python. Everytime you watch a YouTube video you are executing a bunch of Python code.
  • Apache - when you think you need to get rid of it, you don’t. Apache is a real rockstar technology at YouTube because they keep it simple. Every request goes through Apache.
  • Linux - the benefit of Linux is there’s always a way to get in and see how your system is behaving. No matter how bad your app is behaving, you can take a look at it with Linux tools like strace and tcpdump.
  • MySQL - is used a lot. When you watch a video you are getting data from MySQL. Sometime it’s used a relational database or a blob store. It’s about tuning and making choices about how you organize your data.
  • Vitess - a  new project released by YouTube, written in Go, it’s a frontend to MySQL. It does a lot of optimization on the fly, it rewrites queries and acts as a proxy. Currently it serves every YouTube database request. It’s RPC based.
  • Zookeeper - a distributed lock server. It’s used for configuration. Really interesting piece of technology. Hard to use correctly so read the manual
  • Wiseguy - a CGI servlet container.
  • Spitfire - a templating system. It has an abstract syntax tree that let’s them do transformations to make things go faster.
  • Serialization formats - no matter which one you use, they are all expensive. Measure. Don’t use pickle. Not a good choice. Found protocol buffers slow. They wrote their own BSON implementation which is 10-15 time faster than the one you can download.

General Lessons

  • Tao of YouTube: choose the simplest solution possible with the loosest guarantees that are practical. The reason you want all these things is you need flexibility to solve problems. The minute you over specify something you paint yourself into a corner. You aren’t going to make those guarantees. Your problem becomes automatically more complex when you try and make all those guarantees. You leave yourself no way out.
  • That whole process is what scalability is about. A scalable system is one that’s not in your way. That you are unaware of. It’s not buzz words. It’s a general problem solving ethos.
  • Hallmark of big system design: Every system is tailored to its specific requirements. Everything depends on the specifics of what you are building.
  • YouTube is not asynchronous, everything is blocking.
  • Believes more in philosophy than doctrine. Make it simple. What does that mean? You’ll know when you see it. If you do code review that changes thousands of lines of code and many files then there was probably a simpler way. Your first demo should be simple, then iterate.
  • To solve a problem: One word - simple. Look for the most simple thing that will address the problem space. There are lots of complex problems, but the first solution doesn’t need to be complicated. The complexity will come naturally over time.
  • A lot of YouTube systems start as one Python file and become large ecosystems after many many years. All their prototype were written in Python and survived for a surprising amount of time.
  • In a design review:
    • What’s the first solution?
    • How are you going to iterate?
    • What do we know about how this data is going to be used?
  • Things change over time. How YouTube started out has no bearing on what happens later. YouTube started out as a dating site. If they had designed for that they would have different conversation. Stay flexible.
  • YouTube CDN. Originally contracted it out. Was very expensive so they did it themselves. You can build a pretty good video CDN if you have a good hardware dude. You build a very large rack, stick machines in, then take lighttpd, and then override the 404 handler to find the video that you didn’t find. That took two weeks and it’s first day served 60 gigabits. You can do a lot with really simple tools.
  • You have to measure. Vitess swapped out one its protocols for an HTTP implementation. Even though it was in C it was slow. So they ripped out HTTP and did a direct socket call using python and that was 8% cheaper on global CPU. The enveloping for HTTP is really expensive.

Scalability Techniques

  • These are not new ideas, but it’s amazing how a few core ideas can apply in a lot different dimensions.
  • Divide and Conquer - The Scalability Technique
    • This is the scalability technique. Everything is about partitioning out work. Deciding how to execute it. Applies to many things, from web tier, you have a lot of web servers that are more or less identically and independently and you grow them horizontally. That’s divide and conquer.
    • This is the crux of database sharding. How do you partitions things out and communicate between the parts that you’ve subdivided. These are things you want to figure out early on because they influence how you grow.
    • Simple and loose connections are really valuable.
    • The dynamic nature of Python is a win here. No matter how bad your API is you can stub or modify or decorate your way out of a lot of problems.
  • Approximate Correctness - Cheat a Little
    • Another favorite technique. The state of the system is that which it is reported to be. If a user can’t tell a part of the system is skewing and inconsistent, then it’s not.
    • A real world example. If you write a comment and someone loads the page at the same time, they might not get it for 300-400ms, the user who is reading won’t care. The writer of the comment will care, so you make sure the user who wrote the comment will see it. So you cheat a little bit. Your system doesn’t have to have globally consistent transactions. That would be super expensive and overkill. Not every comment is a financial transaction. So know when you can cheat.
  • Expert Knob Twiddling
    • Ask, what do you know about your consistency model? For comments is eventually consistent good enough? Renting a movie is different. When renting there’s money so we’ll do the best we can to never lose that. Different consistency models are needed depending on the data.
  • Jitter - Add Entropy Back into Your System
    • Hot word in their group all of the time. If your system doesn’t jitter then you get thundering herds. Distributed applications are really weather systems. Debugging them is as deterministic as predicting the weather. Jitter introduces more randomness because surprisingly, things tend to stack up.
    • For example, cache expirations. For a popular video they cache things as best they can. The most popular video they might cache for 24 hours. If everything expires at one time then every machine will calculate the expiration at the same time. This creates a thundering herd.
    • By jittering you are saying  randomly expire between 18-30 hours. That prevents things from stacking up. They use this all over the place. Systems have a tendency to self synchronize as operations line up and try to destroy themselves. Fascinating to watch. You get slow disk system on one machine and everybody is waiting on a request so all of a sudden all these other requests on all these other machines are completely synchronized. This happens when you have many machines and you have many events. Each one actually removes entropy from the system so you have to add some back in.
  • Cheating - Know How to Fake Data
    • Awesome technique. The fastest function call is the one that doesn’t happen. When you have a monotonically increasing counter, like movie view counts or profile view counts, you could do a transaction every update. Or you could do a transaction every once in awhile and update by a random amount and as long as it changes from odd to even people would probably believe it’s real. Know how to fake data.
  • Scalable Components - Make Your own Luck
    • You can look at an API and get a good feel. Are the inputs well defined? Do you know what you are getting out? A lot of this ends up being about data. Have a tight specification of what data comes out every function and how it flows actually helps you understand the application without documentation. You can tell what’s happening before and after a function is called.
    • In Python things tend to move towards RPCs. The structure of your code is based on the discipline of your programmers. So establish good conventions, when all else fails there’s an RPC wall so you know what goes in and what comes out.
    • Your components will not be perfect. A component might last a month or six months, who knows. By drawing these lines you are making some of your own luck. When things go south you can swap it out and do something different. Sometimes that rewriting someing in python and C and sometimes that means getting rid of it entirely. You don’t know until you are able to observe.
    • With so many people on a team nobody can know the whole system, so you need to define components. This is video transcode it’s distinct from video search. You want well defined subcomponents. It’s good software design. These things end up talking to each other so having a good data specification is helpful. The greatest sin he made was communication between the servlet layer and the template layer to be a dictionary. Very bad idea. Should have added a WatchPage and said a watch page had a video and some comments and some related videos. This caused a lot of problems because the dictionary can have a few hundred attributes. They don’t always make the right choice.
  • Efficiency - Traded Off for Scalability
    • Efficiency is traded off for scalability. The most efficient thing is to write it in C and cram it into one process, but that’s not scalable.
    • Focus on the macro level, your components, and how they break out. Does it makes sense to do this an RPC or do it inline? Break it into a subpackage and just someday this may be different.
    • Focus on algorithms. In Python the effort to implement a good algorithm is low. There’s the bisect module, for example, where you can take a list, do something meaningful, and serialize it to disk and read it back again. There’s a penalty versus C, but it’s very easy.
    • Measurement. In Python measurement is like reading tea leaves. There’s a lot of things in Python that are counter intuitive, like the cost of grabage colleciton. Most of chunks of their apps spend their time serializing. Profiling serialization is very depending on what you are putting in. Serializing ints is very different than serializing big blobs.
  • Efficiency in Python - Knowing What Not to Do
    • More about knowing what not to do. How dynamic you make things correlates to how expensive it is to run your Python app.
    • Dummer code is easier to grep for and easier to maintain. The more magical the code is the harder is to figure out how it works.
    • They don’t do a lot of OO. They use a lot of namespaces. Use classes to organize data, but rarely for OO.
    • What is your code tree going to look like? He wants these words to describe it: simple, pragmatic, elegant, orthogonal, composable. This is an ideal, reality is a bit different.

Related Articles

[shared via Hacker News]

One Drug to Shrink All Tumors

Posted on Tuesday March 27th 2012 at 11:45am. Its tags are listed below.

One Drug to Shrink All Tumors

sn-tumors.jpg

Survivor. When mice with human tumors received doses of anti-CD47, which sets the immune system against tumor cells, the cancers shrank and disappeared.

Credit: Fotosearch

A single drug can shrink or cure human breast, ovary, colon, bladder, brain, liver, and prostate tumors that have been transplanted into mice, researchers have found. The treatment, an antibody that blocks a “do not eat” signal normally displayed on tumor cells, coaxes the immune system to destroy the cancer cells.

A decade ago, biologist Irving Weissman of the Stanford University School of Medicine in Palo Alto, California, discovered that leukemia cells produce higher levels of a protein called CD47 than do healthy cells. CD47, he and other scientists found, is also displayed on healthy blood cells; it’s a marker that blocks the immune system from destroying them as they circulate. Cancers take advantage of this flag to trick the immune system into ignoring them. In the past few years, Weissman’s lab showed that blocking CD47 with an antibody cured some cases of lymphomas and leukemias in mice by stimulating the immune system to recognize the cancer cells as invaders. Now, he and colleagues have shown that the CD47-blocking antibody may have a far wider impact than just blood cancers.

“What we’ve shown is that CD47 isn’t just important on leukemias and lymphomas,” says Weissman. “It’s on every single human primary tumor that we tested.” Moreover, Weissman’s lab found that cancer cells always had higher levels of CD47 than did healthy cells. How much CD47 a tumor made could predict the survival odds of a patient.

To determine whether blocking CD47 was beneficial, the scientists exposed tumor cells to macrophages, a type of immune cell, and anti-CD47 molecules in petri dishes. Without the drug, the macrophages ignored the cancerous cells. But when the CD47 was present, the macrophages engulfed and destroyed cancer cells from all tumor types.

Next, the team transplanted human tumors into the feet of mice, where tumors can be easily monitored. When they treated the rodents with anti-CD47, the tumors shrank and did not spread to the rest of the body. In mice given human bladder cancer tumors, for example, 10 of 10 untreated mice had cancer that spread to their lymph nodes. Only one of 10 mice treated with anti-CD47 had a lymph node with signs of cancer. Moreover, the implanted tumor often got smaller after treatment — colon cancers transplanted into the mice shrank to less than one-third of their original size, on average. And in five mice with breast cancer tumors, anti-CD47 eliminated all signs of the cancer cells, and the animals remained cancer-free 4 months after the treatment stopped.

“We showed that even after the tumor has taken hold, the antibody can either cure the tumor or slow its growth and prevent metastasis,” says Weissman.

Although macrophages also attacked blood cells expressing CD47 when mice were given the antibody, the researchers found that the decrease in blood cells was short-lived; the animals turned up production of new blood cells to replace those they lost from the treatment, the team reports online today in the Proceedings of the National Academy of Sciences.

Cancer researcher Tyler Jacks of the Massachusetts Institute of Technology in Cambridge says that although the new study is promising, more research is needed to see whether the results hold true in humans. “The microenvironment of a real tumor is quite a bit more complicated than the microenvironment of a transplanted tumor,” he notes, “and it’s possible that a real tumor has additional immune suppressing effects.”

Another important question, Jacks says, is how CD47 antibodies would complement existing treatments. “In what ways might they work together and in what ways might they be antagonistic?” Using anti-CD47 in addition to chemotherapy, for example, could be counterproductive if the stress from chemotherapy causes normal cells to produce more CD47 than usual.

Weissman’s team has received a $20 million grant from the California Institute for Regenerative Medicine to move the findings from mouse studies to human safety tests. “We have enough data already,” says Weissman, “that I can say I’m confident that this will move to phase I human trials.”

[shared via Hacker News]

_uacct = "UA-687561-1"; urchinTracker();