NASA Data & Computing Architecture Study Workshop – Microsoft Planetary Computer

so good afternoon everyone um uh my name is Elena
stepanitis I'm a program executive in the chief   science data office at NASA headquarters and I
am co-chairing the science Mission directorates   uh data and Computing architecture study with
Mike little who's also here in this meeting and   um I'm really excited that we have two speakers
from Microsoft planetary computer here to speak   with us today uh but first I just want to hand it
over to Kevin Murphy who's NASA's Chief science   data officer for just a quick introduction to
what we're doing here in the study and with open   science at Nasa hey everybody first really can you
hear me Elena just give me a thumbs up okay good I   really do apologize for being late so thanks for
bearing with me um but I really did want to come   here um and and uh kick this meeting off because
I think you know um what we're trying to do um   uh uh is is really really going to be informed
by by how we work with Partners like Microsoft   um and you guys are doing some really wonderful
work um with the planetary computer so   um let's kick it off uh next slide please um so um
we are uh you know conducting a data and Computing   architecture study um because we recognize that
you know kind of traditionally the NASA systems   that we've been using need to be upgraded to
really take advantage of the the high data volumes   and the new technologies and and capabilities
that exist in in kind of the commercial world   um so um we're doing that in addition to uh or
or in alignment with our open source science   activities that we kicked off about a couple well
kick off a couple years ago um and and now really   is that critical time for us to really evaluate
from a fundamental level um how we're we're   supporting open source science um with our data
and Computing infrastructure next slide please um the scope of this study is really in
everything that we do within the science   mission directorate for science it includes
scientific modeling simulation activities   uh data processing primarily from you know uh
satellites or or other non kind of earth-based   systems including Mars and everything else um how
we take the data from Level zero to higher levels   How We Do analytics on that information
and how we integrate a new uh process and   techniques or analysis techniques like Ai and
ml to support those really large data volumes   um the capabilities that we're considering include
we have a number of commercial Cloud environments   that we support we also have high-end Computing
capabilities out at Ames in uh in the Bay Area   along with our scientific Computing capability at
the cottage space flight center next slide please uh we try to do the uh one of the things that we
have with with open source science is this concept   of open meetings or at least hybrid open meetings
so that people can participate in how we uh   uh run our you know well can participate in
how we develop our systems and and and that   will allow them to be better um uh positioned
to help us collaborate with them in the future   um so um we do have a code of conduct
because we do have hybrid open meetings now   um and those include you know being
respectful um and considerate of how   people work evaluating the diversity be
considerate and respectful uh communicate   openly with respect for others including uh
making sure you critique ideas not individuals   um uh avoid personal attacks be mindful the
surroundings and your fellow participants   um and alert us if you have any issues we
don't really um no we don't it's not well   we don't really we do not accept harassment
intimidation or discrimination in any form   um and uh verbal abuse and we have a
number of examples of unacceptable behavior   um you know we have unfortunately run into
this in the past where we have disruptions   um and sometimes we have to take care of this so
um thank you for listening to this part um I know   that sometimes it can be a little cumbersome
to go through it but I think it's really   important that we have these discussions up front
especially as we have other meetings next slide so at this point I'm going to hand it over
and let somebody else talk but before I do so   um uh before I send it over to Peter um
I'd really like to say you know thanks   again for for being patient with me
um as I came over and uh you know   I'm really looking forward to hearing
and discussing what you guys are up to thank you Kevin um I'm Peter Williams I'm going
to be helping just to moderate the conversation   um Fielding questions and and moving them
to uh both uh Bruno and Tom our guests today   um if you're interested in asking a question
or you want to see the questions that are in   the queue um uh there's the the best way to do
that is to submit to NASA's i o tool um your your   questions go in anonymously um but you can upvote
questions submitted by yourself but also others   um and uh I'll try and focus on the more popular
questions but I'll also do what I can to link   questions that seem to be in uh kind of a similar
vein you can use the QR code um and go there I've   also put the link in the chat um and if you prefer
uh you can fall back to the WebEx chat I'll do   what I can to track those and uh either Hannah or
I will grab those from the chat and get them over   to the i o tool um so again they can be a part
of that same upvoting process next slide please well it's my pleasure to uh welcome Tom Augsburger
and Bruno Sanchez Andrade Nuno who are going to   be presenting today they are both involved
with the Microsoft planetary computer Bruno   is director of Microsoft's planetary computer and
Tom is a geospatial software engineer part of the   project as well they're going to provide us with
an overview of the project seen through somewhat   of an open source science lens hopefully we'll
hear some about user needs as well as business   needs that will likely be relevant to NASA's smd's
transition to open source science picking up and   building on some work that has been happening
with SMG and elsewhere in NASA but also really   leveraging the insights and and experience of
folks out outside of NASA including in this case   Microsoft we also are as I guess as part of that
we're looking to hear about some best practices   that might be identifiable through their work and
also some insights regarding data and Computing   architecture which ultimately is kind of a primary
focus for this study we're trying to design or put   together some proposals for a design of the open
source science architecture data and Computing   architecture that again will build on good work
that's been happening to date but also set the   stage for needed work in the future as I mentioned
Bruno is the director of Microsoft's planetary   computer he has a PHD in astrophysics and a
rocket science post-doc he led Big Data innovation   um at the World Bank Innovation labs and served
as the vice president for social impact at the   satellite company saddle logic and chief scientist
at mapbox he has been awarded the resilient   science policy fellow of the U.S national
academies of Science and also a young Global   leader of the world economic Forum uh welcome
Bruno joining Bruno will be Tom Augsburger Tom   um is a geospatial software engineer I
mentioned that working at Microsoft on   the planetary computer he's a member of the
pangeo steering Council and a maintainer of   several open source libraries in the Scientific
Python ecosystem including pandas and Das   with that it is my pleasure to welcome both
Tom and Bruno and uh turn it over to you I mean there you go thank you Gary and
Peter and everyone else it is such a   pleasure to be here uh we have we got some time
so hopefully we can answer all of your questions   first I'll speak um more on the why we are
doing this as director of the program it   means that most of my time is not spent on
coding is spent on Outlook and PowerPoint   um making sure that then Tom and the rest of the
team um can actually deliver our promise Tom has   the really cool title of your special architect
and I understand that this is gonna be a lot of   the questions we will um we will get um and
as you were saying in Peter there I'm gonna   answer my my presentation and while I do that
as you were sharing I leave I understand most   of you who are in the headquarters of NASA in DC
that's what I did my postdoc at the naval research   laboratory with uh with NASA funding um on the
sounding rockets for the solar chromosphere so   it's kind of nice to be a little bit back in a
way so so glad to be here and as I was saying   the structure is going to be we are going to
tell you why we are doing this and I think   it's important because it's gonna hopefully
show you that one of the key values is to   be completely transparent of how we are
building this hey and then I'm gonna do   that presentation maybe half an hour or so we
can have some questions on this why and then   a little bit of the how we are building it but
then Tom is the one that has um is going to do   a more demo workshop and hopefully answer all of
your questions as technical as as they might be   so let's go to those just from the start um when
I Was preparing for this talk I saw the the goals   is to for the science mission directorate is
to coordinate a cloud-based high-end Computing   to capture the the computer needs this all of
this if you could just take copy paste and it   would be why or or when we Pro we propose to
build a brand to the computer so in a sense   the best outcome of what I'm saying here is is to
this financial today actually the the planetary   computer looks very similar to what we want to
do so we're going to choose the same technologies   that you choose not because we chose them and we
we did not invent any language we did not imagine   architecture the entire planet computer is open
source and the choices we made of the tools we   use were because the community uh decided or we
understood that the community decided that was the   way to go the standards we choose the Computing
vitamins we chose is all based on that and it's   part of the reason we built it that way is to
ensure that it is the the least amount of friction   between knowledge Creation with someone on the
academic side to application of that knowledge   to operational dependencies because at the end
of the day I think we all agree that a lot of   the things that you are doing and all the things
we are doing are extremely critical to face issues   like climate change or biodiversity collapse
but it's not saying what's happening is not even   putting numbers in those peer-reviewed articles
is then figuring out okay then how does the   government use this how does clients commercial
clients non-profits ngos everyone Civil Society   use this so and that's why we're building this
completely in the open and completely Trying to   minimize those frictions I know some of the key
questions that you send me some business needs I   already spoke to some of those I really want to go
into the summary of why we're doing this and then   the the the environment where this comes from
our needs or the business needs is that we see   pretty much everyone every single company and
every single NGO and every single government   is looking at sustainability and issues like
extreme events driving because of climate change   driving a lot of Damages or ESG and recommendous
framework there is a lot of need to understand   what's happening and that in many ways means an
extremely hard computational environment that is   hard to scale so our user needs is to make it
as simple as possible but at the same time as   technology advanced as possible so you are working
with the latest available data and the latest   available Frameworks so that's why we chose that
the architecture we're going to saw in a second   we couldn't come back into these questions but
as I said at the end of the day what we want is   to build this credible Earth and when I say we I
don't mean Microsoft I think we as Society we as   the planet we should be able to figure out how
to do this credible Earth how to ask questions   of what is where how much is there how much is
changing what could be there what should be there   all of those things it is in our Collective
interest we figure out how to do it and in a   way that is not is as open as possible it is not
intent that we build this and no one else does it   is the opposite it is the intent that everyone
knows how to build these things and of course   when it comes to large scales planetary scale is
going to be hard for um normal entities to to to   service or to create this repositories so they're
gonna go back to institutions like Europe or the   cloud providers like ours and we can coordinate
and build similar things to the beneficial for   everyone these few slides is a little more on
the the why I don't think this audience which   by the way is almost hitting 100 people so thank
you everyone for being here on Friday hopefully   we hit that 100 number um and this is not really
needed in this audience why we better than this   what I would mention though and then we just
go into the planet computer is what we see   in environmental sustainability which I would
are you you could also say for science itself   and a lot of understandability is underpinning
heavily on science so that is not surprising and   when I say environmental sustainability thing
called Soft science we are saying that it is   increasingly complicated that is increasingly
recognized as a dependency an opportunity   for more stakeholders from governments to
commercial entities and it's all interrelated   you cannot think of sustainability or science
and not think about nature people or livelihoods   these are things that are happening and I would
argue with those three points it's increasingly   complicated because we have where money where
way more data and is way more complex we have   way more tools that are also more complex and
have hopefully we have also more questions and   the questions are complex and all of these things
mean that it's really hard to find someone or an   institution that is able to know all of those
things who knows about Ai and also about your   special and also about sustainability I know
it's about a preservation it is getting harder   and that's why focusing on making it as easy as
possible putting together the people are experts   on these fields and combining all of the leads
into the same platform is critical the second   point I said is that it's increasingly recognized
as a critical dependency I'm sure this audience   knows about the sustainable development goals
and there's the ones agree that obviously in   green art obviously related to sustainability but
you could argue that pretty much every other one   is also related to sustainability from food to
energy to infrastructure Innovation if you like   more talking about money well the world economic
Forum has identified the top risks by likelihood   and by impact and half of them are also related
to environmental sustainability so this is not   only about the moral call it's also about the
risks for our socio-economic way of doing things   it is also seen as the amount of Attraction of
interest and funds that this is getting there's   this uh quote from the economies that everything
in climate understandability is hard everything   except a raising Capital there's more and more
funds that are willing to invest in these issues   and it's great because we need all of this and the
third point is that it's interconnected this is   um basically the top line of the highest report
between the top expression climate change the ipcc   and the top experts on biodiversity at the ipvs
and again it's not only just making that point   it's probably not needed for this audience it's
it's very much interrelated it's not only about   nature it's also about people for example if you
protect a section of the ocean the Overflow of the   species outside of the protective area yields
more uh fish more Cuts than if we didn't have   a protection in place so natural resolutions and
protective environment is not only the right thing   to do is also good for it's an opportunity for
Buddhism for others so why is Microsoft doing   I'm going quickly through this because I know
you are all interested in how we will deprive   computer but it's I think it's important to
know why we're doing this Microsoft cares about   all of the sustainability commitments has four
top commitments for 2030 carbon negative zero   waste water positive and building a planetary
computer it's that important for us so it's not   that we want to build a planet as a product for
Microsoft this is like other products is that   we believe there should be the Technologies for
um for addressing all of these issues and that   is why we did the planetary computer so let's
go into the meat of it ideally you go from data   to decisions and ideally you have some data to
storage and you process it you do some analytics   you have applications and then you get results
that's a very simplistic view because in the end   there's so many types of data there's so many
types of storage decisions and locations and   formats and Analysis and you end up having an
incredibly scattered not interconnected array of   both data storage and formats and and everything
you could try to put everything together you could   try to standardize put everything in one place
and then that should allow you to have more   frictionless process from data to knowledge and
that is the planetary computer that's will build   it so we're building a foundational architecture
for aquarable Earth as I said before basically a   digital twin this is something that is becoming
more and more um well as a as a namesake for   what a lot of people are doing as a service and
basically a planet where Cloud native environment   I should know that I've had this morning this
same conversation with another public entity   that is the data provider um and I'm also having
this income conversation with the European Space   Agency as they are thinking of destination Earth
destination e and this is the the meat of it is   as I've been saying a few times already our hope
is that we can share all these businesses we've   built the plant that computers and completely
open source with open data and by doing that   I'm sharing what we're doing we we can coordinate
and build things together because it would be it   extremely beneficial for everyone if for
example we could share across formats and   data and minimize the work for that everyone is
doing to do their work so the bladder computer   is four things you can think of them as
abstraction abstraction layers we have a   data catalog which is basically file of files um
we we try to keep them as as untouched as possible   but we we also try to have them as Cloud native so
you allow it allows to range streets for example   if you have a big file you'll need to download
the file to get a small section you can just   request the regular section in rasters this is the
cloud optimized geotifs inductors and and polygons   is a bit more of a conversation we should have
that's that's the idea put all the files in the   same Data Center and have this big pile of files
we then ingest all of those files are into one   metadata database basically where every single
byte we have 50 52 petabytes of of data stored   in the same this data center that's the data of
the big pile of files every single byte of those   petabytes is stacked with the metadata of what is
it from when is it from what are the cloudiness   or any other characteristics that they have so
that you you the core of the blender computer   is that metadata API so you say give me data for
Washington DC when it's not cloudy in 2024 and   this I know landsat or sending or whatever thing
it is then separate architecturally from that on   the same data center we have a compute environment
if you know find Geo or Jupiter lab that is it   that's it there is nothing special we do to that
it's uh deployed panzero instance that we allow   we we have for everyone who wants to use it but
also we have recommendations so and I think Tom is   going to cover this to deploy their own instance
so they have full control of of of the compute and   there's no extra cost this is the running cost of
running the thing if you deploy it if you use our   type environment there is no no cost for that
I forgot to mention the metadata specification   is a stack specs positive polar asset
catalogs is an open standard and we basically   um we have a PG stack at other ways with all all
that stuff then the last layer is the applications   it's more or less what we do is also what users
and customers do with that data they can connect   directly to the files they connected to or to
the metadata API or to the compute environment   it's modular in that sense by Design so a bit of
an architecture diagram that hopefully helps you   um understand what we're doing what we're doing
is we have right now if I'm not mistaken on Tom   please correct me later if if that is not the case
we've made the painful work of going to 84 sources   some of them NASA or USCIS or NOAA or Sentinel
European all of those and got what is the in this   for now we are focusing on open data what is the
opening that you have let's ingested let's put it   if it's not in the cloud format let's do all of
that and we put in the same data center when we   do the ingestion we do the the we we use the stack
specification to then create the database of the   metadata so the the green arrows to the right
is the ways the context user can consume the   data so if you go through the authentication API
it's basically just giving you a token to access   the blob starts there's nothing nothing anyone
can access an account can can get an account   for this so you can get directly the the files
if you know the file and you know everything   go to that if you want to use the metadata API
you can also consume the metadata API asking for   questions like give me data for Iowa or whatever
thing it is we also have the the catalog of data   which actually a website that consumes the
metadata API to put it nicely and give you   some documentation and examples and things like
that we also have a Tyler which is this is more   for raster which basically is a piece of um it's
quality Tyler for those who know about it that   gets the data from blob storage to the metadata
API and the files and they'll throws it to your   browser to the Internet so you can consume
it quickly happy to do a demo right now but   the idea is that just the experience when you
go to Google Maps or Bing Maps or any other   render map and you see the you see what data in
this case you just select the metadata on the   characteristics of the source and the times and
the metadata and on the fly in less than in most   of the time less than half a second you will get
a map that looks like one of these online Maps but   it's rendered directly from Lab storage to that
through this detailer um architecture that's the   the data API and that's the PC Explorer which
is an application of the planetary computer if   you go to planetarycomputer.microsoft.com you'll
see all of those things and then on the same data   center but separated as I said before we have the
compute environment which is the panzero to give   you a sense of what the what we are doing right
now it's like 52 petabytes of started probably   a month we went one and a half petabytes of output
with a 2 billion calls to the metadata API that is   key that means that our users are not downloading
files which is great they are only downloading   the little bits the little bytes of information
they need for their requested so they get faster   outputs to do their work we get less output
that we have to pay for and everyone's happy   that's the trick of this metadata which I believe
is at the core of all of this architecture and we   also have the computer environment board that I
mean CPU hours that deserves so this is only on   the one we managed there's a lot of clients that
have deploying their own complete environment   the stack specification is key for that and
there is this recent Planet a blog post when   they they go through that and they see that we
are already the largest um database of metadata   assets which to me is okay great this is great but
at the same time it also makes me think that huh   most of the assets all of these people have in
their archives is the same everyone's indexing   on their own landsat everyone's indexing
on their own sending it and it's not   I wish we could coordinate ourselves better and
and share maybe a Federated stack environment   or network or whatever you want to call it
so that not only you know what it comes from   like one of the things we try to put in the
metadata API is the source of data what has   changed what's the provenance of the data but I
think we could we could do much better I'm I'm   hopeful we can talk about this later um later
on the call this is a little bit of the catalog   and explore I'm gonna I think at this point
it's better if I just um so you quickly damn   a little demo when I change that I don't
know P30 if there are questions so far I'm just going to the browser I'm saying again sir Microsoft Edge There You Go sir   see if you go to Peter you haven't said
anything I don't know or is there any questions there are some questions um I'll kind of leave
it up to you um let me let me do this demo   demo okay demo there we go on the
questions.com as I said before   pilot files metadata API compute environment
data catalog is how you see on the   um this pile of files every single this is 86
sources of course landsat is one of them every   single collection has an overview and then has the
assets we are hosting and then everyone each of   them has a notebook an example notebook that you
can start so you can start working with quickly   um the providers the license if there is a paper
related to that and then the spectral bands all of   this this is actually coming from the metadata API
itself it's just rendered nicely on the browser   and as you see here we have this launching
Explorer which is you can also get it here   um is what I talked about before which kind
of combines everything and this is what I'm   um gonna just show you quickly you can go to
Pakistan you know this has been floods uh recently   so I also know that radar sound radar it's
um it's really good for detecting flooded   areas we we do process Sentinel one grb into
radimetrical terrain corrected it's one of   the few data sets we also processed to create
a product which is released openly and then   um this is the latest one this Pro as you can
see already this is already rendered on the Fly   these are the results of the most recent Sentinel
one since the the flood happened kind of a couple   weeks ago we can go back and we select the end
date and these are the results this is the assets   that correspond to what I said here and I can
click on each of them and I'll get the metadata   for those I get assets to do whatever thing I want
I get the code to then copy paste and using my   um whatever analysis I'm doing but I also get here
the code of the call itself to say hey what do you   have of this region these date ranges for this
metadata specifications and you can copy paste   and then using in panzero and Jupiter lab and
what you see in here is what I mentioned before   this is rendered on the fly as I move the mouse
around and you see as less than you know half   a second I can also say I want to actually see
what is the comparison between that now and what is four clicks or just changing one line
of code and then I see that actually indeed   this is and you can plug in
in and out that the situation   um in the massive flood that happened there you
can go and do exactly the same thing for any of   our sources you can also then share this
link with anyone so that they can just ask   as I know this is these days you can
just very quickly share a status of the fast to do that I know there
are others and I know that the   Explorer NASA has spent a lot of time on
invested on on similar tools like this   what I really like of this one is this this
frictionless or helping you navigate from   exploration to analysis with uh with this little
tidbits of code um then the documentation as I   said before everything has document and how
to read from stack every data set has its   own uh example notebooks there's also tutorials
which I think Tom is going to cover in a second   but this is it this is the planetary computer
it's a collection it's a special platform built   modularly on a pile of files or metadata API
and a compute environment that is ready for   you to clone if you want and probably you don't
want to clone 54 petabytes of data and maintain   the metadata and that's why it becomes a service
for a commercial company like ours at Microsoft happy to answer questions the more technical side
starts afterwards when Tom goes over that but if   there are questions of why we're doing this
or the strategy of that happy to go to those   yeah wonderful I think that'd be great we do have
a couple questions that really fit to the why as   you said um if you were thinking about either
existing or anticipated user needs and business   needs that first question that you you showed
that's getting a lot of attention uh today on   the list of questions we're wondering if uh which
ones you would lift up and really highlight for   the purposes of this study as potentially guiding
NASA's science Management directorate in pursuit   of data and Computing infrastructure especially
infrastructure to support open source science I thought a lot about that because but I see more
and more hopefully or thankfully is that there is   more attention to climate change and specifically
to extreme events because they are the drive the   biggest driver of economy classes and um assessing
risks of climate events climate change in these   weather events is critical and is something that
is very much connected to science right because   if it's consumption of just your special data I
don't think it's it's car correct me if I'm wrong   I don't think it's called to the NASA science
Mission directorate and and to be the providers   of that data it's the data there you're producing
it you are and people are using it fantastic   but when it comes that we are not there yet
because we need better science and better   dissemination I think Climate Services climate
reservations is right there and specifically   I think we have a lot of science already but this
seems to be really hard to go from academic sets   settings and papers to then okay I'm company
X and I want to assess my club address what   it is it's flawed is it droughts what is it how
do I convert that knowledge of science into my   operations and that's why this idea that if you
build this output you know flood risk where like   we did with one of the data sets Global fraud risk
due to Tidal storm searches and sea level rise   then people are going to have questions it's
great to have the data set we're going to   have a question how you did it what if you did
it similarly so the idea that you could have   exactly the environment that produced that output
ready to be deployed and the customer side or the   government or the city that is doing this and then
adjust it to their own needs because they don't   assets or because they have a different source
of data that is very powerful that those com   Technologies like binder which I don't know Tom
you're gonna talk about it or not this basically   an idea that you can't deploy an environment
and combine our complete workbench to do that   not only the code but also the infrastructure
different structure as a code that to me is   extremely powerful and because that would scatter
the need of anyone who's a bit strategic you will   forget the need of everyone who's going to answer
to the regulations and disclosing these Planet   risks and this is something I would say the the
market is very immature carbon options is the   same thing I think we like to say there is lack
of meaning what it means these carbon offsets   or the additionality of carbon acids what is the
measure of all of these things which is science   missing directorate I think core to help the
world um answer what do they mean and what do   they measure and then mature those markets right
if I had to pick one it would be that one Peter   great thank you uh maybe one more and and then
I want to leave plenty of time for for Tom   um and I'm going to skip a question here that
really I think Tom's probably gonna going to speak   to because there are some questions here that are
a little on the technical um the side that I think   Tom is going to speak to but Bruno before uh you
you um kind of close this this piece um there's a   question around issues that a successful data and
Computing infrastructure may need to anticipate   um and related to the issues maybe opportunities
are there any particular issues that you   would suggest NASA really pay attention to in
thinking about data and Computing infrastructure   I think the principle of the fair principles is
is to the core of what I know NASA's already doing   right they um in the open science one I I've seen
many mentions both to to fair and also to Care   on on the indigenous communities and it's to be
it's we right now it is not findable for example   um and it's not interoperable and there's no
all of these acronyms are fair are not there yet   if you cannot fine divide unit easily if you
cannot connect your whatever environment it is   you need to adapt to that environment because
it's a close environment it's not a model or   a matter but it's not an open standard that
is not interoperable so what I would say is   please everything you do make it an openness
make it based on Open Standards and make it   I mean ensure that is findable and I would argue
discoverability of metadata goes a very long way very good thank you Bruno um and thank you
for that presentation that that well thank   you and I forgot to say that is also here from
the team I think it's on the participant list   but I just wanted to shout out to her
that she has also joined from the team   wonderful well welcome um and with that perhaps
um can we turn it over to Tom sure take it away   let me do a few things uh in the chat which
hopefully this will go out to everyone I just   put a link uh it's uh AKA dot Ms slash pc-nasa
that'll take you to a uh Jupiter Hub that I set   up for this and I'm gonna share my screen
and if you want you're more than welcome to   um go ahead and go there uh it's going to ask you
for a username and a password and I've forgotten   the password but I think it is NASA I'm just
gonna double check on a uh yeah so it's gonna   ask you for some username put put whatever
you want there uh and then I'm pretty sure   it's NASA is going to get you in yeah that should
do it um cool uh so use some unique username and   then uh the password again is NASA which I'll
put in the chat as well once I find that screen   WebEx has rearranged my windows sorry
uh where'd it go okay here we go chat   password is NASA while you're doing that thank
you um I'm gonna show just a couple things uh just   as your uh stuff is spinning up I should mention
briefly yeah so I mentioned this is a Jupiter Hub   um that I deployed for this uh it's a Jupiter
Hub running on uh Azure kubernetes service so   like any other uh kubernetes service out
there uh and then the idea is we're gonna   go through some kind of data analysis
type workloads uh fetching some data   from the planetary computer and then uh
just making some pretty pictures with them   um let's see a couple of things to mention uh
you know Bruno mentioned the um the kind of uh   components of the planetary computer uh we have
the data catalog so we'll be mainly interacting   with the stack API uh which is actually the
same thing that this HTML page is generated   from but we'll peek at the data catalog briefly
to understand what data is available I'm gonna   mention a bit of about kind of the setup that we
have here as yours is is spinning up so uh the   the main idea with our compute is like we we don't
really care how you do the compute um as long as   it's on Azure uh is essentially the big thing and
that's less for like you know like a make money   reason uh that's like just the flat most efficient
way to get to the bite so the bytes are all in a   storage container um it's like an S3 bucket uh in
Azure blob storage uh the bytes are all there in a   single data center uh and if you want to have the
fastest most efficient access to the data you're   going to want to put your compute in the exact
same data center so happens to be in the west   Europe Azure region so that's the kind of setup
that we're we have here that we're connecting to   uh so I'm here in my local browser uh you all
are going to be in your local browsers kind of   on your own home networks or whatever Network
you're on um and then we in running inside of   of Azure and the same region as the data we have
this Jupiter Hub that we're going to connect to   um and that's so when we're doing compute
when we're like downloading data from The   Blob storage containers it's going to have a
nice high bandwidth connection here for those   large data sets and then as we kind of uh bring
results back to our local client like a summary   statistic or a a an image a plot that's going to
be much much smaller so that's okay to go over   the public internet we also have um dasc here
Depending on time I don't actually know exactly   how much time we had so I'll be uh waiting for you
to interrupt me once I go too long but um what we   have to ask here for scalable Computing it's just
one of many many ways to do scalable Computing   um on Azure I happen to be most familiar with
it okay uh and like Bruno mentioned I guess I   can put this in the Chas you know we have this
Hub uh we're actually using a separate Hub just   because uh I didn't want to all of you to have to
deal with like um uh signing up for an account if   you are interested uh we'll share the link at
the end but um you can sign up for an account   and we can get you all approved but for now I just
set up a temporary one that you can all log into   um okay I think that's it for introductory
stuff hopefully uh everything's going okay   um I actually don't know if if you all can chat
but if you are having issues then uh somehow no uh   alert me that uh stuff is breaking uh and I'll try
and fix stuff but I'll assume uh some people have   successfully logged on um the just to mention
like this setup here you know one of the nice   things about the cloud is the kind of pay as you
go Computing uh scale down to zero so uh it might   take like a few minutes um I had it completely
scaled I had this completely scaled down but   as you request a uh request a a pod a notebook
server here it's gonna automatically start up   virtual machines and then uh then your thing will
come in okay we got some time great thanks Peter   um excellent so you should be seeing I
apologize for the unrendered markdown   um you're you're seeing a page that
looks like this um we're gonna go   through a uh an example here start off with
this one uh about uh using uh the stack API   um so if you all could click on the
folder icon here uh by default you're   in the planetary computer is examples repo
and then we'll go to Quick starts and then   you're gonna want I'm gonna make this just a
tiny bit smaller and you want uh reading stack there's a uh that reading stack Dash R example
we're gonna be using python uh just since that's   the pipe the environment I have selected here
um but uh stack is uh well as we'll see it's   a cross language um standard okay hopefully
things are working for people I'll assume it is   um great so like Bruno mentioned uh we have all
this data in in Azure blob storage which is great   you know it's a lot of work uh getting that data
there um great help from from our partners you   know helping us with that um but just having the
files there in Blob storage is not enough we think   um it's still too difficult to use that data
um just you know thinking about like the simple   example of like give me all the um all the landsat
images over uh Washington for 2020 you know like   you have to be very familiar with like usgs's
particular naming scheme for like how it does uh   the wrs path and rows and and like the various
levels and and processing and modes and things   like that to to kind of figure that out and so
what we use to to avoid that pain what we use   instead is stack um so stack it's this uh spatial
temporal asset catalog it's an open specification   for cataloging spatial temporal data uh kind of
previewing I saw a question about you know is uh   Planet well planetary computer focused on stack or
sorry on on just Earth data or can it be used for   other like lunar or Mars um people have uh kind of
hacked up stack to make it work for other uh other   bodies so uh I don't know exactly how that works
I think they have to do some uh hacky things with   uh coordinate reference systems to make this work
but in principle it could but currently all of our   data is Earth focused right now okay um so we'll
go ahead and go through this so the planetary   computer stack API um we're going to be using pi
stack clients and the stack you know it's it's a   whole standard for how to catalog spatial temporal
data but by far the most useful thing it does is   lets you search that data lets you actually
query that so that lets you do things like in   this case we are looking I think it's yeah areas
around Microsoft's campus in Redmond Washington   uh so we have that bounding box there and we're
interested in in scenes from 2020 and you know   that that query that I was posing earlier becomes
pretty straightforward to write in this case we're   using python code to do that so we make that
search and we get back the kind of eight items   that match our query for um landsat scenes over
Redmond Washington in December 2022.

Yeah if you   do have any questions uh throw them in the chat
and I'll try and answer them as we go through them um so we got those eight items matching it so uh
and you saw it returned really quickly like less   than a second or two um so we haven't actually
loaded any data what we've done is is use the   stack API to kind of query the metadata for
scenes that uh match our query if we look at   those items they're geojson features so they have
things like a geometry um you know all the other   stuff that you'd expect from a geojson feature
and then it's expanded with a whole bunch of   other information like what is the platform it was
on what date time or date range was it captured   for projection information all the all the useful
things that you would need to work with this data um including things like uh the data
provider in this case uh USGS includes   um an estimate of cloudiness for each of
these uh scenes and so you can do things   like very quickly filter out uh uh cloudy
images select the least cloudy one here just capture this but uh stack I I should
mention actually stack uh the idea is it's   um it's all a metadata standard that's all about
linking out to the actual data uh so the the items   that we got there's one or more assets on each
item and each asset is an individual individual   file uh in this case in the case of of landsat uh
collection two uh their collection two level two   they're all going to be um well most of the the
interesting data assets are the cloud optimized   geotiffs that we can really efficiently access
you also see things like the uh metadata files   uh linked there as well one of the assets that
we can use here is the rendered preview asset so   you can see it's a link and actually to a data
API so this is actually the same thing powering   the Explorer that Bruno used so that explores um
using the stack API to query you know it's looking   at where your window is over Pakistan getting
the bounding box in latitude longitude using   that to make the queries and then the data API is
responsible for returning the actual images that   match your search so in this case here's our
our least cloudy item over Redmond Washington we can also access the data after we do one thing
we need to sign the data so uh if you're the the   actual assets themselves are in private storage
containers uh just to kind of keep an eye on   egress but we do allow Anonymous signing of tokens
so you all didn't sign up for planetary computer   accounts yet we didn't provide any kind of API
key or anything here all we need to do is make   a request to the planetary computers uh SAS API
that gives us a a token that we can use to read   the actual data okay so we'll assign the item
or make an HTTP request in the background that   makes an HTTP request to the SAS API and it gets
us back a um I can show you it signed href a URL   that has the typical stuff for Azure blob storage
so this is a storage account container the path   to the cloud optimized geotiff and then everything
else is the um is a read-only token and so now you   can pass this off this URL off to anything that
uh can read data over HTTP so in this case we're   using Rio xra to read it into an x-ray uh data
array we can also use things like qgis um rest   area uh R uses uh gdol it's something built on top
of gdol I think it's Stars I can't raster I can't   keep up to date with our community but anything
that can speak HTTP can now access the data   and the last thing uh I think last thing uh
oh the scan through worth mentioning is that   um you can really efficiently uh make data cubes
out of uh out of these stack items so the items   themselves have enough metadata that we can uh
kind of um Mosaic them together uh in in space   and stack them through in time uh and we have
all our bands here to very quickly create the   data Cube uh which I I think is just great so
like we've gone from you know thinking about   low level details like what is the exact naming
scheme that USGS uses for for these files and   like how can I lay them out in space and time and
read them and understand their spatial extents we   don't have to worry about any of that instead we
can use these higher level things like searching   by space and time and things like cloudiness and
get back a bunch of Rich metadata that describes   the assets and based just on that metadata we
can get these really nice convenient high level   libraries data structures like a data set that
can we can work with to actually analyze the data um I'm going to skip through most of this just
for just for time because I do want to get back   to questions but you can search on additional
Fields like cloud cover so this is going to   vary data set to data set we'll see an example of
doing this in the next uh next notebook and then   it's worth mentioning uh yeah I'm gonna skip
through this uh all all stack metadata stuff   um and then I just briefly wanted
to show that stack works with   um uh it isn't specific to Cloud optimized
geotifs or raster data it also works uh for   um well yeah stack only cares about about
files so it's all about linking to assets   um so in this case we're we're using a daymet
uh data so daily North America daymet data   um and we can have a look at the link here it's
a link to a file in well actually a directory in   Azure blob storage uh that is a czar store
and so we can go ahead and load that up   very similar to how we did the other other example
so stack uh it's a very flexible metadata standard   um for the most part it's it's really just
focused on on spatial temporal data if you   have some some data that has like a spatial
footprint a temporal time stamp or range then   stack is a great way to catalog it okay I'll
pause there if there are any questions I can   answer now and then we'll jump on to a kind of
more fun example that'll take 10 or 15 minutes   very good and there are some questions
um folks I think have have been pretty   interested in some of the the
details that you've been covering   um but let me take a step back um actually and and
I think there's a question here that's a little   broad that you'd be able to speak to really
nicely and that has to do with best practices   and are there particular best practices you might
highlight for an open source science architecture   um data and Computing infrastructure for open
source science or open source science operation   um yeah that's awesome uh let's see so I would
say um my background is a in open source and   it's like you know woefully under maintained open
source maintainers always yeah super stressed out   burned out so I would say um as much as possible
and I understand people are busy uh but as much   as possible be involved with the open source
uh uh libraries that you're building on um and   you know I mean yeah it can be hard to justify it
especially in the short term uh spending time like   on open source not necessarily even uh you know
working on features or whatever uh but just being   involved with the discussions there uh I think can
be super valuable both for you because you like   understand where the projects are going uh but
also for the community sharing your feedback as   a a user who's like trying you know applying this
in practice at scale like NASA is doing um can be   super valuable and then the other thing is like
um the hardest part about open sources you know   you can do anything but the most successful open
source stories I've seen have always been around   uh people individuals who bring groups together
and can coordinate um coordinate groups who might   have uh you know different uh priorities but I'll
have some sort of shared um some overlap in those   priorities that they'd all benefit from coming
together so uh yeah be involved and then as much   as possible be especially involved on the kind of
coordination and uh coordination side of things   great um so kind of building on that let me shift
to existing approaches to open source science   um the data and Computing infrastructure
to support it any particular approaches   that you would see um suggest or really
want to flag for uh the NASA work here   um yeah I think the pangeo is is like the
go-to example right like a group of people   who are just trying to do geoscience on the
cloud uh hit upon this idea of a Jupiter Hub   deployment in the cloud uh using kubernetes
um that scales with desk uh and that you know   that was like uh I think the initial thing
was hacked up in a weekend by a few people   um at some I can't even remember which uh
conference or Workshop it was but uh that idea has   like uh you know gone a long ways to to the work
that they've been able to do um and so that's like   what we do here uh with the planetary computer
Hub that we provide and then lots of people are   deploying their own hubs you know to customize
the software environment or you know various   things around around that so I think that you
know that is like a good go-to example that said   um I guess yeah so the benefits are you don't
need to expose every single individual to like   um let's say challenges around like uh Cloud
subscriptions and like how do I get the billing   details right and like which service should I
use you you prevent that present them with a nice   login an easy way to to scale their compute um
that said like there are tons and tons of services   um that go beyond like the core uh you know a kind
of interactive Computing environment that Jupiter   have handles so well um and you know it's it's
strength and it's it's weaknesses like there's   fewer options around like uh job scheduling and
you know batch workflow type things within Jupiter   Hub itself how do you complement those with other
other open source or Cloud Technologies um that's   kind of a different uh yeah different can of
worms maybe so I don't know that's uh yeah PTO   I think is a good place to start great thank you
um maybe one more question kind of in the similar   vein of the of the previous two um you've worked
on open source science you know reproducibility is   a is a key aspect of Open Source science
um how is reproducibility of results   um and by extension decisions supported by the
Microsoft planetary computer uh yeah I think so   there's a couple of answers like at a very surface
level we can say it's like you know ideal right   you can have a notebook you can uh you know share
a link to it um like Bruno showed we have those   examples where you can you know click a button
launch it in the hub and be off and running and   like yeah so at a surface level uh and I I don't
want to undersell like that's that's not nothing   that's a a good accomplishment a good first step
but it is just a first step if you want to like   fully lock down like the software environment
and like their services on in the background   uh that are that are potentially being used that
like aren't necessarily encapsulated in that link   that shareable link uh there's the the data and
like what happens if uh you know USGS decides to   reprocess some scenes uh what happens to the data
do we update it to follow USGS do we make a new   version you know how do we do all that so like I
think there's tons and tons of questions around   um some of the trickier problems around
reproducibility that like I don't think   we or anyone else have a good answer for um
if you're interested in this I'm uh I'll I'll   bring it up here uh there's a interesting uh
discussion working group forming around this   on the pangeo discourse uh let's see if I can
yep uh I'm going to post this in the chat that   I think is a good summary of of where things
are um where things are at this one's uh maybe   a bit more focused on like education but I think
that's like a prime example of reproducibility great thank you um and I know you have a
couple more slides uh that you wanted to   get but let me ask one more question here um
that is really kind of kind of fascinating   thank you to the folks who put it in um if
we were to share NASA data and software on   um the PDC um how would this broaden accessibility
to communities who ordinary ordinarily would not   be able to use NASA data and tools yeah I
think the the biggest thing there is is the   um I guess there's a few things so first of all
there's just like the the bandwidth question is   like if you just have the data um on some server
FTP server HTTP server or whatever uh even if   it's publicly accessible uh bandwidth can be a
challenge um especially at scale uh so if you have   uh data that is in the cloud uh then there's at
least a potential uh for for anybody to to use it   um because there's that option for locating the
compute with the data um that kind of puts the   question a bit to like how do those people get
access to compute and like with the planetary   computer you just sign up for it uh similar with
lots of other services so um I think like that's   a a good first step towards uh towards broadening
that access I think Bruno might have something as   well you're happy to to chime in here I'm actually
now at my mom's place and it's almost a dial-up   connection if I didn't have the planetary computer
I would need to go do I know to one of the   providers download the file open qgis on rdis and
it would have taken me I don't know three hours if   at all I could do it versus the 10 seconds that
I did so I think people with slow bandwidths   over the cloud is far away still can access this
is because a lot happens most of it happens in   the cloud itself not in a closed sense but in
a helpful way the other the other side is that it becomes then Microsoft's own interest
now that we host the data we have a better   interest to make sure people use it and if
you think of all the clients the company our   company has is a tremendous platform to
make sure that all the data sets that we   host become used because now our incentive is
for them to be used one of the things there's   a lot of datasets we could put on board
and the the criteria we have so far is   Art is useful and because we don't know how to
message they are useful or not is are they used   so it is our metric of success to
make sure that these are used and we   our field teams are there's tons of people who are
now trying to figure out hey who can use this data   so I guess a long way to answer that it becomes
our incentive to make sure that this data is used yeah thank you for uh jumping in on that question
Bruno I think that additional perspective is is   really helpful um and and I love the example
of your mom's house right now I mean those are   those are uh really important comments and we
have a couple folks in the in the chat who are   agreeing with that um Tom I know you had a couple
of other things you wanted to to cover why don't   we turn it back to you for that sure and I'll try
and be quick just to get to some more examples uh   or sorry more questions um if you jump back up a
level so you were in quick starts if you go back   up a level uh we'll go to uh tutorials and then
there's a fun uh well uh pretty hurricane Florence   animation example so that's again under tutorials
and then this hurricane Florence animation uh you   can check out what we'll be making but we'll
actually do that uh live here um yeah so the   idea behind this one is um based off this example
from pie troll if you've used that Library um uh   it's loading some data from goes uh some mesoscale
data from goes to visualize hurricane Florence   um so first of all you kind of have to figure
out where the uh where the date the storm was   at and when so that's what this call is for
which uh hopefully uh this one's kind of   taken a while uh hopefully we get the data set
downloaded uh we do not yet have this data set   this uh best track data set in the plan third
computer so we have to hit Noah's servers for it   um which maybe this is kind of
a demonstration of why having   um all the data in the same place is a good
idea if this fails I do have the um the latitude   longitude stored in another notebook so I I might
uh I'm gonna bring that up if this fails entirely   then at least I'll be able to to do it and you all
can copy the latitude longitude but I gotta set   up another thing for that I'm gonna assume that
this failed and interrupt it maybe that's a bad   idea yeah it's just downloading the data okay well
I'm gonna oh wait a sec uh while this uh comes up and then we can avoid hitting these servers
that's the other nice thing um about uh Azure   uh blob storage any blob storage service really uh
is that they're built to scale uh built to handle   many concurrent requests uh so you don't have
to worry so much about a single user a few users   um a few users uh knocking the service over
uh like we appear to have done to Noah oops   give me one sec while this comes up I'm just I can actually show this over here here's
my other uh uh my other Jupiter Hub the real   one where I have an example from uh Noah's edmw
workshop and I'm just gonna copy paste this over to this window okay so you all can skip
this uh skip this example where you   download this stuff and instead so skip
cell uh what is that two I guess it was   and skip this one as well and apologies for this
uh I should have planned ahead we're gonna skip   all that stuff and we're just gonna skip to
get the imagery perfect and y'all will need   I'm gonna post it in uh in the chat here we'll
see how badly uh this gets formatted seems to   be okay hopefully the quotes are all like real
quotes and if you want to follow along otherwise   I'll just go through it pretty quick uh just for
time uh sorry about that uh but we somehow have   magically discovered the the bounding box uh in
the date time for where this storm was uh when   um so we're going to go ahead and uh again now
that we know where it's at we're gonna query   the planetary computer stack API but we don't
have to know about how goes organizes its data   its file names or things like that all we need to
do is uh query the go CMI uh database goes Cloud   moisture imagery collection for assets within
this bounding box over this date range and we're   interested in just the mesoscale images so goes
is capturing uh conus and full disk images kind   of at the same time we only want the meso scale
from when it was zoomed in on Hurricane Florence   um I don't think I kind that but you can
see it's already finished so it's a couple   of seconds and we've got back these these
items that match our query and if we very   quickly kind of check and and make sure that
we're in the right spot you can see make this   a bit smaller so we can see it you can see that
we're in the right spot okay um let's see goes   does not have a a sorry green band I think it
is so we're going to do a bit of X-ray stuff   to make a synthetic green band out of the uh
near infrared red and blue gonna do that here and then a bit of work to kind of like uh
well a bit of work to make the picture look   pretty I don't know how scientifically
accurate this is but some kind of gamma   correction to get a Time series of
RGB arrays that we can then uh plot   um I'm gonna just very briefly um show this so
if you copy paste this Ur to the desk dashboard   URL uh this is an example of uh Computing on
the data in parallel um so we're Computing in   parallel on a single machine uh using I think
for Threads or processes um dasc uh the setup   that we have here is a Das Gateway so you
can easily scale on a cluster of machines so that's uh kind of uh working through this
computation here reading data from blob storage   um doing the the linear combination to make that
green band doing the stacking things like that   and then we have a bit of matplotlit a lot of
matplotlib stuff here to to make the animation   and then embed it in the notebook um I'll just uh
stop there and then we'll go back to the original   uh animation up top and and play that so this
is um it's actually a bit longer than what we   were making there but um yeah hopefully as uh
well it looks really amazing I think hopefully   it's like scientifically useful and can be I
don't know you all can tell me whether or not   that scientific scientifically useful or not but
I think it's pretty cool okay again sorry for the   the issues there with the the NOAA servers um I'll
have to send them apology note afterwards uh and   get that data set onboarded so with that I think
we'll uh jump back to questions if there are any   uh yes there still are some uh but thank you
for that uh demonstration uh it certainly   does bring up just you know the speed
that you're capable of doing that and   also the you know in a sense the the AHA
wow at at the end um is quite helpful um   in a sense going back to that um if you
were so so you've got this you know amazing   tool that you can work with um how would
you suggest going about to design a data   and Computing infrastructure that's capable of
supporting the principles of Open Source science   um NASA has you know submission needs around
open source science transparency accessibility   inclusivity reproducibility which we talked about
earlier but this question is really about how   would you go about designing data and Computing
infrastructure to support those Mission needs   yeah um yeah so there's definitely like the kind
of you know low level things of like co-locating   data with compute and uh you know cloud or wait
I shouldn't assume Cloud but like uh efficient   uh access to data um which are you know important
but uh I think even more important uh than that is   is really um well-structured standardized metadata
and so for the uh planetary computer we're using   stack I know uh I think it's like NASA CMR also
uh has some stack uh things uh I don't know the   full details but uh there are people at Nasa
who are familiar with stack and involved with it   um so that's great and you know having that
metadata makes the data um actually like   searchable queryable discoverable by by your users
so that's been extremely important for us I think   it's important for any um kind of collection of of
data sets um and then I think uh maybe even more   important that is like the educational material
educational side of things so with the planetary   computer and and I think you know similar for
NASA maybe we're in kind of an interesting spot   of there's a tension between you know do I I
have this tutorial for for making this animation   um of hurricane Florence you know would that
be better suited for you know going in like   the matplotlib uh gallery or x-ray you know for
for all of its stuff like we have all of these   pieces these open source components that we're
building on that we're bringing together for   for the specific use case um and it can be hard
to know or how do you balance like improving the   documentation and examples for those libraries
those components that you're building off of   versus like building your own thing um so that's
like a a tension that we're facing and I think uh   NASA would face too as I'm guessing you all have
uh a bunch of documentation that's like specific   to your your Computing your your data analysis
platforms um that may or may not be I think is   open source a lot of it um and so like how do you
balance that versus improving the documentation of   the Upstream um the Upstream uh libraries and then
just like there there's absolutely that need for   credit cross-cutting high-level examples that use
all of these things and so where where does that   go so that's like a thing that we've been thinking
about and I think not completely solved yet wonderful and yeah it's really helpful to to
know the kinds of things that you're running   into and saying hey we haven't sold them yet
um you know that's always good for uh all of us   to keep our our eyes on um are there particular
advantages and I apologize if this seems like a   loaded question but particular advantages that you
see uh when you think of the planetary computer   um over Google Earth engine for example uh yeah
yeah so uh advantages and disadvantages for sure   um I so that a lot of the advantages and given
the Forum I think it's fair to say like uh open   source is of interest to like me personally
to the planetary computer team to uh to you   all presumably since that's like in the title
of this uh session so uh you know the planetary   computer is built on open source components
it is open source itself like all of our   um our our stack API and metadata Generation
all that things uh The Hub deployment if you   really want to look at that AKs deployment that's
all open source and so that's like a uh I think   an important component which gives you all the
flexibility if you know uh this example is using   uh Python and x-ray but if you want to use R and
uh sits and all these other libraries for doing   your analysis then then absolutely uh go for it
and like I mentioned lots of disadvantages like   Google Earth engine I'm not going to throw any
shade it's a really amazing product they do a ton   of things really well um so yeah Bruno I think I'm
guessing you want to say something here too they   basically use it first of all Google Earth engine
is a fantastic product that has has help Advance   tremendously what we can do with remote sensing
for like what 10 years so now throwing any say to   them it's it's a great product as some said some
differences or disadvantages the way you want to   call it I like to think that if they had to build
it today it would probably build something very   close to what the player the computer is today I
don't know maybe they have a different answer but   um they did it did not exist many of the things
that we are using now did not exist and that's   where they had to build it back then some things
we also like to highlight with I think answer some   of the questions is that if if there is something
we are not doing that you want to do you can   because it's operating it's modula hey there's
this data set you don't have and I really need   it put it put it in the same data center in your
own tenant stack ingest it with the stack and it's   gonna be 100 the same as if it was ingested by us
that covers also some questions that I saw on the   list if you can also use it for Mars or the moon
you will need to hack a little bit testification   I think there is I saw a talk at first 4D the
conference that we talk about is people using the   stack spec for other planets it's doable but again
because it's open source then you can just if we   don't do it you can do it it's exactly the same
you know what's going on there's no black box here very good um Bruno one of the questions here I may
may be something that that you want to speak to   um uh very directly so NASA has a space act
agreement with Microsoft um can you comment on   a possible way forward for collaboration between
NASA and Microsoft on the planetary computer we are already we have some conversations
with some of your colleagues to figure out   how to to how to leverage that coordination
uh into hosting or to doing projects together   on Pilots we I would say we are if you
have something specific in mind happy   to take it on and have another threat
but we are already doing some of that wonderful thank you there's another question
again it's kind of you know the the basics how   does Microsoft fund and sustain the activity
and that's the golden question and I think   it's also a golden question for NASA um
and I think also gets into is NASA wanting   to be in the business of disseminating
all of these data products for everyone   and I think the answer is probably not if there
are commercial customers who want to depend on   this data set there's probably an opportunity
for a company like ours or other Cloud providers   to say will host it and we'll provide it for
you of course we it we depend on you because   we are the providers but we will then cover
the the elasticity and the and the one-to-many   um needs and that's a little bit how we think
of it when you use the PC Hub defines you that   we just thought about which we think of it as
a reference implementation we think of it as   academic we think of it as NGO use but if you
are a commercial company who's using the planet   the repeater would really encourage you to deploy
your own computer your own panzero and that means   that you will pay for that there's no extra cost
for for use in Pangea because it's open source   but you will generate consumption so it becomes
part of the offering just like you can deploy   it on a Linux machine in Azure and then we it's
it's part of the business model of the cloud to   to have actually the majority of the BM Linux
and it's it's a business model on that that's   also when I made a comment before that then it
becomes our incentive to disseminate and make   this data useful because if it's not we are not
in the business of archival for archival sake we   can't right we got in the business on on figuring
out how the the resources we're putting in to pay   for uh this stuff is leverage into more revenue
for us and I genuinely believe that that is the   case otherwise they wouldn't exist Solutions
like this one or Google Earth engine and others very good um there are several folks
who who are interested in collaborating   um on on specific topics um we've heard about the
you know looking at other planets Moon Mars um uh   Etc um I'm interested in putting space weather
data on the planetary computer one person says   do you have any thoughts on how the capability
would be useful um as far as again looking up   or out as if uh as opposed to um yeah looking
more down we're gonna need to change the name   to you know Cosmic computer or something yeah no
it's um the answer is that is that if you want to of the metadata I think it's going to be
a bit tricky to do that I don't think it's   impossible I think that happens is the
majority of developers are looking down   from satellite style right but I
see no reason to when I was doing   um solar physics I'm losing my PhD we use also
some of the tooling that was meant for Earth   to for the mapping the the surface of the Sun I
would say let's do it and if you have questions   put it on the on the discussions we'd love
to see that's the power of being open like   we haven't thought of that use case just go go
through it and if you don't if you cannot use the   stack of specification that's fine just just any
other specification that's also the video being   modular if you put it in the same Data Center
and then you want to use the PC app do it it is   meant for these use cases but we love to figure
out how crazy hacky things people do with this very good so it sounds like you know it might be
a an interesting opportunity to kind of explore um   very good so we're getting close to the end of our
time um I I want to make sure uh we kind of close   things out uh nicely here to make sure that it's
three minutes left please stop us oh my god let's   pause for a second and uh have we answered the
question you asked us if not happy to try again foreign I think um as I've been listening to it I think
you've done a really nice job of of speaking to   those um in a really um uh very spot on um and
and thoughtful and and also succinct way um   which is always a challenge you guys have a have
a lot going um so I I really appreciate it myself   um and uh I know I would just offer an invitation
to to anyone uh whether it's on the on the panel   or or the audience um if there are some
follow-up questions uh we'd be um Happy um oh I'm seeing we we still
we still have an hour for do we   um let me just ask ask openly Hannah am I wrong
I thought we had I thought we were closing at   the at the bottom of the hour do we how much time
do we have probably to hang out did we did yeah   today so today was a longer discussion so we
have until 4 30 today if we want to I went   to facilitated discussions my apologies we have
plenty of time um yeah so anyone who was thinking   of a follow-up question now is a great opportunity
because we're not going to let Bruno and Tom   um directly Tom I think that was there that
compute on dusk yeah I was just gonna say that but let's let's be candid and let's
be open we have plenty of time   so I've got a question please this is Kevin um so you know we we see a lot of not a lot
like like one you have a pretty um pretty   impressive system right um and you know I think
that there are a lot of potential data sets for   these to address these types of questions related
to earth science or applications or climate   so my question is is like how do you
knit that together right like so the   the whole conversation here is like
how can NASA like be internally better   but I think part of being internally better is is
making our systems work better with other systems   um in an interoperable sort of way right I don't
know if ever everybody's ever going to be able to   put all their data at one spot and do all the
analysis in one location so as we work with   Issa you know we work with NSF we work with NOAA
USG all these people with with you with Google   like like how do we make that um a little bit
better it's a really good question giving and   maybe that's just that no it's maybe it's good
to have the most used data for the most amount   of people kind of like a CDN cash on one place but
there's gonna be data that I don't and that's when   that idea of Federated stack um like a ring of
servers of the stock endpoints could be helpful   because you might be searching for something we
don't have but if we have the metadata one it   gives you already the lead of where to go maybe
the answer is not the bytes maybe the answer is   email Kevin or maybe go to whatever another page
not that one but you can buy one right that maybe   it's I I dream that for example if NASA were
to provide the the data along with the stock   specification dub like static file would probably
make our lives much easier right Tom so we don't   have to to make the schema ourselves if we also
have an API with the data you might not even have   the the connected to the data itself but if we
can then if we get a photo for imagine we are   asking for something like that we don't find it
we might trigger ourselves to your API and maybe   the answer is that we have they have it it's not
online but they have it somewhere else that kind   of coordination among data providers and Cloud
companies it's probably beneficial for everyone   so we get the most blunt a demand for requests and
then you only get the ones that are specific to   the more Niche applications or the most needs data
sets that we it's harder for us to host so no I   I I'm gonna I'm probably gonna speak out
of term because I too do Powerpoints and   not technical stuff anymore but um I do think
uh that the API reference for CMR has a stack you know description in there so you can
you might want to take a look at that   yeah it is uh it yes that is correct it is a
bit out of date uh I don't know okay yeah it's   a little update uh not quite up to stack 1.0
but I'm guessing people are working on it yeah   it's it's great and it's fantastic to have
but but I think the bigger question that I   have is like Okay so we've got stack catalogs
and this type of thing and that type of thing   um uh but you know I think I think uh like how we
coordinate that strategy across the organization's   is an important point right like like you know
um you like we're we're talking to you but who's   talking with you and me and Noah and so you know
what I mean it's like that coordination activity   to say hey look maybe this is the way we should
structure this ecosystem a little bit you know   not necessarily be prescriptive but but to give
some options would be you know something helpful   we one thing we do is be opinionated we don't shy
away from being opinionated and we are Microsoft   you are NASA if you NASA is opinionated it would
help lean the weight on one particular direction   we decided that stack was a good uh standard that
the community was using and now it's not only the   community it's also Microsoft that is putting that
together which probably by us the the lack of that   specification to get even farther if NASA then
also Embraces that it goes in that direction so   I think maybe if we are a little bit more
opinionated at the risk of um of some some   other data sets that might be harder to put
in stock might help increase disability but   as I said it's a trade-off um it's it's hard
and this is what becomes a little bit tricky   it's hard to cover everyone's needs I do know
that cloud optimized your tips are great for   some things are not good for others or the
geoparque is root for things for others so   yeah sometimes we have really good discussions
and sometimes it's comfortable discussions of   choosing a winner on this open source
standards but I think it's still worth I was wrong about the stack somewhere it's not   out of date it's been recently
updated apparently so fantastic very good thank you um we still have a couple of
questions that have flowed into the to the chat   here um one that's kind of a follow-up to to
uh one that came up Tom while you were talking   um it talks about NASA's earth science data
to correctly use NASA's earth science data   for research it's really important that
researchers are familiar with the product   documentation and that users are aware of
the Quality fields and values as well as the   product metadata product version that kind
of thing what approaches are you using to   make this info information easily findable
and accessible for the users by the users   um yeah uh super important um and as we like uh
add these data sets we become pretty familiar   with all of them and I'm consistently amazed at
how complicated uh and just like intricate each   of these data sets are uh anyway uh it's it's
I think there are are really our only answer is   to do tons and tons of linking back up to the
Upstream providers um both in like whatever   prose narrative that we write in our example
notebooks um and then also in structured ways in   stack so stack has uh structured place to put the
scientific citations back to the original papers   um places to put the links and Licensing and all
of that so um I I think that's like at a minimum   uh that's necessary and that's what we're doing
and then if you all have suggestions on how to   better surface that critical information then I
am all ears to hear about how to do that better   so the thing is that we are not doing yet as far
as I know cover me from run Tom is this provenance   or with the data exactly comes from and I had this
idea maybe I don't know if people like this idea   is to add a metadata tag that provides the md5
has of the file at the source so you have a kind   of like a chain of Integrity okay you can have
them device of our file but you also have then the   device has all this of the source yep yep so uh we
uh are planning to add that at some point that's   one of our work items um that I think will help
a lot um especially when the Upstream providers   uh have stack metadata uh there's a again a
structured place to put that uh information   about the files themselves the nd5 hashes um
all sorts of things about the files themselves   um and in our stack catalog and we do this for um
uh landsat 8 because USGS has stack metadata we   have a it's like Avaya or uh some way to indicate
that this is the Upstream provider stack item so   you follow that and then that stack item has
links to their the assets on the USGS server   so you can perfectly track it back again you
know there's issues around like well what if   the data changes what if they update the data and
there's again stack extensions for versioning and   so there's it's infinite complexity but there's
uh I think a path forward Tom I see one of the   questions I'm sorry for coming in there and Peter
but it's a question that everyone seems to ask and   I haven't been able to answer it it's great to
have you Tom here live about kerchang oh yeah uh   actually let me jump over to that screen sorry for
a call full question uh career chunk okay uh yeah   yeah so like archival data uh there's so much of
it out there I don't necessarily want to convert   all of it to Cloud optimize just because of the um
cost of doing that and there's so much um existing   uh processes built on top of those files in their
current formats so for those not familiar kerchunk   um oh and um I heard about a similar project it's
uh it's something is it DDR plus plush HDR plus   plus uh some effort at Nasa that's very similar
to kerchunk where you kind of uh scan these files   these uh let's not call them Legacy files but
these not Cloud optimized uh files and figure   out where the assets within them are at like so a
single net CF file with many groups many variables   um that's chunked up uh where does this you
know temperature variable start at where does   precipitation start at in the file in the
byte stream um this is is so useful because   um the the uh performance of of a file system
like Azure blob storage is very different from   a local file system with a local file system you
can open up the files seek all over it's not going   to take that long but with a remote file system
like Azure blob storage it takes a long time to   figure out where in the file uh these different
pieces are so jumping back to kerchunk and DMR   plus plus thank you for that um what's what the
the idea behind these is to have a pre-processing   step where you scan the data scan each asset and
then write out the sidecar file that has kind   of the locations of each variable each chunk with
uh within that that netcdf file dollar grid files   byte stream then the idea is what if you hook up
you have that you end up with a Json file that's   like a URL offset and then length and we combine
that with the thing Bruno mentioned earlier about   HTTP range requests once you have all of those
you can make range requests and and fetch just   that data so you have kind of all the metadata
that you need to you know build your data cubes   you know exactly where in these netcdf files each
chunk is at then you can get Cloud optimized data   access to these netcdf or Grid 2 files files
that don't necessarily work well in the cloud so   um you know as far where that's the idea and
as for like the reality like her chunk is is   very new it's like kind of a project that some
folks from anaconda and the Pangea Community uh   are just kind of like you know hacking on it it's
it's and we want to be we think it's a very very   promising way for word it needs a bit of work and
we're you know we're working on that if you all   want to become involved in that then definitely
do examples and and fixing bugs and things like   that but it seems like such a promising way to to
get this uh uh again not legacy uh this non-cloud   optimized data exposed in a cloud optimized way um
I just want to share this quote from Paul Ramsey   who invented postgis um I don't want to mangle
it but it essentially says like uh it mentions   how there's all this focus on cloud optimized
formats when which is necessary but the really   really important thing and the really challenging
thing is getting clients Cloud optimized clients   to make uh efficient use of those well-organized
bytes and so that's really what kerchunk is kind   of taking to the extreme is what if the bytes
aren't well organized what if they're just as   is as they're written you know 20 years ago
or whatever uh but what if we have a super   sophisticated client who's able to kind of like
uh do these cloud my requests on the Fly that's   the basic idea behind kirchunk and In This Cloud
optimized access pattern and the status of the   plan they're completed is that we are looking
closely at those Technologies we are playing   um last I remember last time that you were saying
Tom that we're trying to figure out we have not   yet imported anything on karchunk or others
but we are actively seeking feedback from the   community so we come there again as I was saying
before the opinion I didn't say let's go with this   and I I believe that we could either compete or
others say hey let's just use kirchang it would   favor the odds of that standard but we wanted
to choose something that the community thinks   is the right one right yeah so we have a called
experimental uh reference files for one one data   set then uh NASA's uh next gddp cmap 6 data set
um where we made uh reference files for some of   those again software's young they're just like
uh uh bugs and errors if we tried to do it for   every single one like projections into the future
uh with like you know date times past some Val   invalid range so there's a lot of work to be done
but uh it's I think it's really really promising great thank you um Tom we wanted to follow up
on on some of your experience with with stack   um stack uh assets and x-ray Das Matt plot lib
they're still rather low level user interface   compared to some of the object models of GE
and open EO do you see a need for a higher   level interface something where more powerful
abstractions that might service more low code   users yeah um maybe maybe this is uh something
I struggle with it's like a blind spot of mine   because you know I'm I I like coding I think
it's like a very powerful expressive way to do   this kind of analysis assuming you know how to you
know code in python or whatever so I absolutely do   think yeah and I should say like there's the other
end of it it's like the Explorer where there's   um uh essentially no code like you're you're
manipulating the UI to generate the queries   and then it Returns the results um which is uh you
know even for you know uh me like it's extremely   useful for very quickly visually debugging
things and then it can kind of get you started   um on a path by you know showing you the
code it used essentially behind the scenes so   um I yeah I guess I'm not quite sure I think I
do think there's absolutely needs um within the   um I guess python Community for sure but
um others other communities as well for   um better ways to work with this type of raster
data um that uh well uh how do I say this like   uh yeah anyway I don't want to get too into the
details but uh you know x-ray data data sets data   Rays they're very focused around the ideas of uh
a regularly structured grid a kind of rectangular   rectilinear data Cube uh which is uh very nice for
many data sets but it doesn't kind of accurately   capture like the path of the Sentinel landsat as
it goes over the Earth like how do you represent   that is it more like a data frame or a fancy
list or like a tree of data so anyway there's   lots of uh I I completely failed to answer your
question because yes about low code I'm talking   about like other very complex code things so
anyway that's kind of kind of my thoughts there   great well it sounds like it's uh it's a challenge
on the horizon uh yes absolutely something that   may need may need some more attention um I am just
I apologize I'm jumping back and forth I'm I'm   moving some some URLs over um making sure folks
uh just we had a question come in about when uh   um when the the video might be available and
what else might be available and just real   quickly we are stocking these on our project
website I uh just provided the URL to that we   hope to have a video available within a few
days it kind of depends on how quickly we   can turn that and then based on that video
that that and the transcript is is helping   us put together a kind of a high level summary
as well of some of the Q a and some of the the   um the key points that were made and that will
be available a few days later again trying to   tie as much as of the conversation and discussion
back to some of these main framing questions but   also really trying to pick up um the questions
that have been coming in from the audience that   are kind of above and beyond those those framing
questions um so with that in mind let me turn to a   couple of those that are above and beyond um we've
got a question around support for non-python users   um how do you support that or or will
you if you don't at the moment yeah   um so if you're uh there's the Explorer if you're
not familiar with coding which is like a great   way to visually inspect and understand uh some
of the data sets and hopefully non-raster data   sets in the future there is uh so all of our
catalogs of as far as like uh discovery of   what data sets possible data sets are possible
are cataloged stack is an open specification and   there are clients in lots and lots of different
libraries so you can equally well use a stack   from any language really any language that can
do HTTP and then we've worked with the developers   of our stack from the Brazil uh I'm gonna it's
like inp their their space agency um to to make   sure that uh our stack their client library
for R works well with the planetary computer   um and we uh so that's like the the stack side
of things and then going up an additional level   of of like for the Hub specifically for compute
we do have um r r the programming language are   um profiles as well so you can start up our
kernels um and use all those libraries that   that uh geospatial analysis tool chain um we also
I guess maybe somewhat better answering your last   question about lower code we also have a qgis
profile where if you want you can go into that   um it starts a cujit server in Azure your browser
you're still accessing it from your browser   locally but like the cutest compute happens in
Azure close to the data so if you want to that   graphical user interface to the data we also have
a qgis profile you can start up so those are kind   of the non-python-centric ones that that users
have today and then again you know open source   is if you can work with the stack metadata
you can use whatever tool chain you want   I'll I'll also add that um again that is very
modular there is we have customers commercial   customers that just use the files and they don't
touch anything else but that's fine we have the   customers that use the metadata API through the
computer Hub that's great in the python kernel   or R kernel or the these qgis on the server
but if you have a Q yes you can also connect   to that totally fine either way did we also have
commercial customers who are then using the HTTP   request in their own virtual machines doing
whatever they they want to use that's kind   of the beauty of being um modular in that sense
and we have the PC Explorer which not only is   great for people who do not want to code or
don't know how to code it is fantastic also   for people who do code to quickly grabs that
snippet of code that I showed that shows the   the lines of code for the region and you can
change it from python from other things but   the idea is again to provide a bridge between the
no code the low code to the actual doing the work   I don't think this is a conversation I constantly
have with with a colleague Matt which is the one   developing the PC Explorer this is such a danger
for the PC Explorer to scope creep because there's   so many things you could do and you I think you
have to put a an uh limit on those otherwise   are you trying to make a qeis or rpis in the
browser now it's going to be needs always to   then do something in whatever two videos which
is one of the questions Katie just passed that   um on on the whatever tooling they they use we
do have a I would say to your question Katie   of the academy users roughly one third of our
users are Academia one-third are commercial and   ones that are mixed so it's not we're not
a research tool if anything we may be more   our Enterprise and that's also what Microsoft
likes an Enterprise level platform that is also   fantastic for academic and research use by Design
again so that to minimize research to operations thank you yeah it looks like Katie's that
your your response is really resonating with   Katie I would imagine others as well um kind of a
detailed question I mean a very specific question   um but probably important here um does the
planetary computer handle elevation altitude   data as well uh yep so it's with uh mentioning
though like um you know where we're uh using   stack for our uh cataloging um so we don't it
doesn't really matter what the data within it is   um it certainly works well for raster data but
it works very well for essentially any any type   of spatial temporal data um and so when if you
have data stored in tsar and that CDF that has   multiple levels that's totally totally doable
and you would access that you know you would   search it normally through the stack API
you would access it again normally through   you know x-ray or whatever in-dimensional array
Library you're using to work with that data and   then kind of the thing that you know maybe is
less figured out is like how do you visualize   that in something like the Explorer which
is again currently mostly focused around   um kind of a single spectral band for a single
um uh altitude but you could easily imagine uh   ways to like have a slider to adjust the adjust
the elevation or the altitude uh based on you   know your selection there so yep it works well
yeah it works it's totally doable it's just   like uh depending on the exact nature of the
data set uh other some things might not work very good um similar uh you
know NASA has some data sets   um specifically um they've they've got uh open
population raster data cdac um is what's the   best way to get data added to the archive so it's
available to all yeah um let's see two answers so   uh first of all like uh we recognize right now the
planetary computer team is kind of responsible for   um kind of maintaining relationships with
all the data providers with um uh depending   on the data set uh potentially doing the Cloud
optimization conversion potentially doing uh the   stack metadata creation and then ingestion into
our database so there's like a lot of things that   we become the bottleneck for so we're hoping to
improve a lot of the tooling around uh creating   stack metadata and and getting it ingested and
and all the stuff around the metadata side of   things and and sharing side of things to make
it easier for any group to share their data   um uh through stack on Azure blob storage so uh
that is like in a thing that we're we're working   on um for now the process is to kind of reach out
to us I'll put a link in the chat in a second um   we have like a data set request page that you can
uh fill out and and we can take it from there and   um it's just like a lot all these data sets
you know they're all unique and they take   um a good amount of effort to get them kind
of cataloged correctly so that they're usable   um through for as many people as possible and
just to add to that the best way to get your data   set on board because again it's a it's it's just
blocked by us it's just a long tail this is a very   long tail of data sets help us by making sure the
data is the format that is also clouds optimized   as possible if that is possible uh the metadata
fields are are clear ideally with a stock stack   geojson or schema Fields things like that that
makes their life easier and then most importantly   who would use this for what because if this
data set that is interesting that's great   but it's a data set that you have identified who
would be using this for what it really helps us   um prioritize it if it's in the order of petabytes
then the conversation might be a little bit   harder is less than that this the volume is not
really that much a problem also if it updates   periodically or is a one-off some datases we have
updated every year or just once some update every   few minutes so that also uh it's some criteria we
have for optimizing it as I said it's not really   that we are there's a secret Channel or something   it really is a matter of prioritizing
until we develop this juicer ingestion very good thank you um let me see if there are any
uh particular questions that might be coming from   um panelists who would like to to come online
and maybe ask a question we're we've worked   through all the questions that are in the queue
um thank you very much I mean we we work through   a long set of questions that really appreciate
the discussion um we do have time I just wanted   to see if there's anyone um with with the
panelists uh who who might want to come on I'm I will I will try to reciprocate and ask
everyone that if you can please share with   us directly on that email um any feedback
you have what you like the most what you   like the least anything really helps we are
we're building computer and we are building   it very openly because we want to be as useful
to you as possible to make a change and to make   the world with the place I'm literally it's
literally we're trying to do and we have an   amazing opportunity to to shape it so please
be candid and be to reach out with feedback very good that's a very that's a wonderful
offer um and uh I I know folks we you can   see in the in the chat folks are really
appreciate the time you've spent with us today   um and and really appreciate the really the two
complementary perspectives that that you brought   to the discussion and brought to the conversation
kind of the high picture or high level big picture   look at the whole platform and also the the
technical details of you know the how and the   why and the what um just a fascinating
and very very thoughtful uh combination   um thank you to you both for for bringing
those perspectives and you know working them   back and forth so elegantly it was very nice
um I'm not seeing any uh additional questions   um let me uh um Hannah can you bring us
to the to the closing couple of slides um sorry putting her right on the spot
thank you Hannah I see it happening well so building on this this series that we
we have we're we have some uh upcoming sessions   um scheduled we're looking at the San Diego
super computer center towards the end of of   September and video will be here early in October
um sandia's Center for computing research that's   a new one on our schedule that's coming uh
September 22nd Pittsburgh super computer   September 23rd and then esri coming in at the
end of the month in in October October 21st and   we also have the Texas Advanced Computing Center
coming in the 5th of October so we have a number   of sessions in the next week or two um and then
we'll go to esri towards the end of next month   um all of these are building on this this
series of questions that we've been asking   um and uh that that Tom and Bruno have been so
kind to to Really wrestle with um with us and and   and with each other um on on how to think through
some some questions that are particularly uh   challenging and timely for um the SDM uh data and
Computing infrastructure project and study and uh   the whole move towards uh open source science um
learning lessons from folks who've been doing this   um for quite a while uh both inside NASA and and
elsewhere and but also thinking down the road   um you know what's what's what should we be
anticipating what's coming at us um so thank   you Bruno and Tom for helping be a part of this uh
the study and this really important conversation   um with that yeah um we have a a couple of
nice um thank yous coming in in the chat   um I think we're all set Elena um
would you like to come in yes well   I just want to say thank you so much uh
for speaking with us today we really enjoy   your talk and and all of your insights
so um yeah thank you very much thanks all right with that um thank you
everyone uh for your time on this   Friday we we came close to 100 we
topped out about uh just shy of 90   um but not bad for for a Friday um and covering
uh so much of the Waterfront uh Coast to Coast   literally so thank you everyone have a great
weekend and we hope to see you at a future session

As found on YouTube

PEOPLE – SERVICES – IMPACT

NASA Data & Computing Architecture Study Workshop – Microsoft Planetary Computer

Cancel reply

Disclaimer

Terms & Conditions

Privacy Policy

Accessibility Statement

Disclaimer

Terms & Conditions

Privacy Policy

Accessibility Statement