so good afternoon everyone um uh my name is Elena
stepanitis I'm a program executive in the chief science data office at NASA headquarters and I
am co-chairing the science Mission directorates uh data and Computing architecture study with
Mike little who's also here in this meeting and um I'm really excited that we have two speakers
from Microsoft planetary computer here to speak with us today uh but first I just want to hand it
over to Kevin Murphy who's NASA's Chief science data officer for just a quick introduction to
what we're doing here in the study and with open science at Nasa hey everybody first really can you
hear me Elena just give me a thumbs up okay good I really do apologize for being late so thanks for
bearing with me um but I really did want to come here um and and uh kick this meeting off because
I think you know um what we're trying to do um uh uh is is really really going to be informed
by by how we work with Partners like Microsoft um and you guys are doing some really wonderful
work um with the planetary computer so um let's kick it off uh next slide please um so um
we are uh you know conducting a data and Computing architecture study um because we recognize that
you know kind of traditionally the NASA systems that we've been using need to be upgraded to
really take advantage of the the high data volumes and the new technologies and and capabilities
that exist in in kind of the commercial world um so um we're doing that in addition to uh or
or in alignment with our open source science activities that we kicked off about a couple well
kick off a couple years ago um and and now really is that critical time for us to really evaluate
from a fundamental level um how we're we're supporting open source science um with our data
and Computing infrastructure next slide please um the scope of this study is really in
everything that we do within the science mission directorate for science it includes
scientific modeling simulation activities uh data processing primarily from you know uh
satellites or or other non kind of earth-based systems including Mars and everything else um how
we take the data from Level zero to higher levels How We Do analytics on that information
and how we integrate a new uh process and techniques or analysis techniques like Ai and
ml to support those really large data volumes um the capabilities that we're considering include
we have a number of commercial Cloud environments that we support we also have high-end Computing
capabilities out at Ames in uh in the Bay Area along with our scientific Computing capability at
the cottage space flight center next slide please uh we try to do the uh one of the things that we
have with with open source science is this concept of open meetings or at least hybrid open meetings
so that people can participate in how we uh uh run our you know well can participate in
how we develop our systems and and and that will allow them to be better um uh positioned
to help us collaborate with them in the future um so um we do have a code of conduct
because we do have hybrid open meetings now um and those include you know being
respectful um and considerate of how people work evaluating the diversity be
considerate and respectful uh communicate openly with respect for others including uh
making sure you critique ideas not individuals um uh avoid personal attacks be mindful the
surroundings and your fellow participants um and alert us if you have any issues we
don't really um no we don't it's not well we don't really we do not accept harassment
intimidation or discrimination in any form um and uh verbal abuse and we have a
number of examples of unacceptable behavior um you know we have unfortunately run into
this in the past where we have disruptions um and sometimes we have to take care of this so
um thank you for listening to this part um I know that sometimes it can be a little cumbersome
to go through it but I think it's really important that we have these discussions up front
especially as we have other meetings next slide so at this point I'm going to hand it over
and let somebody else talk but before I do so um uh before I send it over to Peter um
I'd really like to say you know thanks again for for being patient with me
um as I came over and uh you know I'm really looking forward to hearing
and discussing what you guys are up to thank you Kevin um I'm Peter Williams I'm going
to be helping just to moderate the conversation um Fielding questions and and moving them
to uh both uh Bruno and Tom our guests today um if you're interested in asking a question
or you want to see the questions that are in the queue um uh there's the the best way to do
that is to submit to NASA's i o tool um your your questions go in anonymously um but you can upvote
questions submitted by yourself but also others um and uh I'll try and focus on the more popular
questions but I'll also do what I can to link questions that seem to be in uh kind of a similar
vein you can use the QR code um and go there I've also put the link in the chat um and if you prefer
uh you can fall back to the WebEx chat I'll do what I can to track those and uh either Hannah or
I will grab those from the chat and get them over to the i o tool um so again they can be a part
of that same upvoting process next slide please well it's my pleasure to uh welcome Tom Augsburger
and Bruno Sanchez Andrade Nuno who are going to be presenting today they are both involved
with the Microsoft planetary computer Bruno is director of Microsoft's planetary computer and
Tom is a geospatial software engineer part of the project as well they're going to provide us with
an overview of the project seen through somewhat of an open source science lens hopefully we'll
hear some about user needs as well as business needs that will likely be relevant to NASA's smd's
transition to open source science picking up and building on some work that has been happening
with SMG and elsewhere in NASA but also really leveraging the insights and and experience of
folks out outside of NASA including in this case Microsoft we also are as I guess as part of that
we're looking to hear about some best practices that might be identifiable through their work and
also some insights regarding data and Computing architecture which ultimately is kind of a primary
focus for this study we're trying to design or put together some proposals for a design of the open
source science architecture data and Computing architecture that again will build on good work
that's been happening to date but also set the stage for needed work in the future as I mentioned
Bruno is the director of Microsoft's planetary computer he has a PHD in astrophysics and a
rocket science post-doc he led Big Data innovation um at the World Bank Innovation labs and served
as the vice president for social impact at the satellite company saddle logic and chief scientist
at mapbox he has been awarded the resilient science policy fellow of the U.S national
academies of Science and also a young Global leader of the world economic Forum uh welcome
Bruno joining Bruno will be Tom Augsburger Tom um is a geospatial software engineer I
mentioned that working at Microsoft on the planetary computer he's a member of the
pangeo steering Council and a maintainer of several open source libraries in the Scientific
Python ecosystem including pandas and Das with that it is my pleasure to welcome both
Tom and Bruno and uh turn it over to you I mean there you go thank you Gary and
Peter and everyone else it is such a pleasure to be here uh we have we got some time
so hopefully we can answer all of your questions first I'll speak um more on the why we are
doing this as director of the program it means that most of my time is not spent on
coding is spent on Outlook and PowerPoint um making sure that then Tom and the rest of the
team um can actually deliver our promise Tom has the really cool title of your special architect
and I understand that this is gonna be a lot of the questions we will um we will get um and
as you were saying in Peter there I'm gonna answer my my presentation and while I do that
as you were sharing I leave I understand most of you who are in the headquarters of NASA in DC
that's what I did my postdoc at the naval research laboratory with uh with NASA funding um on the
sounding rockets for the solar chromosphere so it's kind of nice to be a little bit back in a
way so so glad to be here and as I was saying the structure is going to be we are going to
tell you why we are doing this and I think it's important because it's gonna hopefully
show you that one of the key values is to be completely transparent of how we are
building this hey and then I'm gonna do that presentation maybe half an hour or so we
can have some questions on this why and then a little bit of the how we are building it but
then Tom is the one that has um is going to do a more demo workshop and hopefully answer all of
your questions as technical as as they might be so let's go to those just from the start um when
I Was preparing for this talk I saw the the goals is to for the science mission directorate is
to coordinate a cloud-based high-end Computing to capture the the computer needs this all of
this if you could just take copy paste and it would be why or or when we Pro we propose to
build a brand to the computer so in a sense the best outcome of what I'm saying here is is to
this financial today actually the the planetary computer looks very similar to what we want to
do so we're going to choose the same technologies that you choose not because we chose them and we
we did not invent any language we did not imagine architecture the entire planet computer is open
source and the choices we made of the tools we use were because the community uh decided or we
understood that the community decided that was the way to go the standards we choose the Computing
vitamins we chose is all based on that and it's part of the reason we built it that way is to
ensure that it is the the least amount of friction between knowledge Creation with someone on the
academic side to application of that knowledge to operational dependencies because at the end
of the day I think we all agree that a lot of the things that you are doing and all the things
we are doing are extremely critical to face issues like climate change or biodiversity collapse
but it's not saying what's happening is not even putting numbers in those peer-reviewed articles
is then figuring out okay then how does the government use this how does clients commercial
clients non-profits ngos everyone Civil Society use this so and that's why we're building this
completely in the open and completely Trying to minimize those frictions I know some of the key
questions that you send me some business needs I already spoke to some of those I really want to go
into the summary of why we're doing this and then the the the environment where this comes from
our needs or the business needs is that we see pretty much everyone every single company and
every single NGO and every single government is looking at sustainability and issues like
extreme events driving because of climate change driving a lot of Damages or ESG and recommendous
framework there is a lot of need to understand what's happening and that in many ways means an
extremely hard computational environment that is hard to scale so our user needs is to make it
as simple as possible but at the same time as technology advanced as possible so you are working
with the latest available data and the latest available Frameworks so that's why we chose that
the architecture we're going to saw in a second we couldn't come back into these questions but
as I said at the end of the day what we want is to build this credible Earth and when I say we I
don't mean Microsoft I think we as Society we as the planet we should be able to figure out how
to do this credible Earth how to ask questions of what is where how much is there how much is
changing what could be there what should be there all of those things it is in our Collective
interest we figure out how to do it and in a way that is not is as open as possible it is not
intent that we build this and no one else does it is the opposite it is the intent that everyone
knows how to build these things and of course when it comes to large scales planetary scale is
going to be hard for um normal entities to to to service or to create this repositories so they're
gonna go back to institutions like Europe or the cloud providers like ours and we can coordinate
and build similar things to the beneficial for everyone these few slides is a little more on
the the why I don't think this audience which by the way is almost hitting 100 people so thank
you everyone for being here on Friday hopefully we hit that 100 number um and this is not really
needed in this audience why we better than this what I would mention though and then we just
go into the planet computer is what we see in environmental sustainability which I would
are you you could also say for science itself and a lot of understandability is underpinning
heavily on science so that is not surprising and when I say environmental sustainability thing
called Soft science we are saying that it is increasingly complicated that is increasingly
recognized as a dependency an opportunity for more stakeholders from governments to
commercial entities and it's all interrelated you cannot think of sustainability or science
and not think about nature people or livelihoods these are things that are happening and I would
argue with those three points it's increasingly complicated because we have where money where
way more data and is way more complex we have way more tools that are also more complex and
have hopefully we have also more questions and the questions are complex and all of these things
mean that it's really hard to find someone or an institution that is able to know all of those
things who knows about Ai and also about your special and also about sustainability I know
it's about a preservation it is getting harder and that's why focusing on making it as easy as
possible putting together the people are experts on these fields and combining all of the leads
into the same platform is critical the second point I said is that it's increasingly recognized
as a critical dependency I'm sure this audience knows about the sustainable development goals
and there's the ones agree that obviously in green art obviously related to sustainability but
you could argue that pretty much every other one is also related to sustainability from food to
energy to infrastructure Innovation if you like more talking about money well the world economic
Forum has identified the top risks by likelihood and by impact and half of them are also related
to environmental sustainability so this is not only about the moral call it's also about the
risks for our socio-economic way of doing things it is also seen as the amount of Attraction of
interest and funds that this is getting there's this uh quote from the economies that everything
in climate understandability is hard everything except a raising Capital there's more and more
funds that are willing to invest in these issues and it's great because we need all of this and the
third point is that it's interconnected this is um basically the top line of the highest report
between the top expression climate change the ipcc and the top experts on biodiversity at the ipvs
and again it's not only just making that point it's probably not needed for this audience it's
it's very much interrelated it's not only about nature it's also about people for example if you
protect a section of the ocean the Overflow of the species outside of the protective area yields
more uh fish more Cuts than if we didn't have a protection in place so natural resolutions and
protective environment is not only the right thing to do is also good for it's an opportunity for
Buddhism for others so why is Microsoft doing I'm going quickly through this because I know
you are all interested in how we will deprive computer but it's I think it's important to
know why we're doing this Microsoft cares about all of the sustainability commitments has four
top commitments for 2030 carbon negative zero waste water positive and building a planetary
computer it's that important for us so it's not that we want to build a planet as a product for
Microsoft this is like other products is that we believe there should be the Technologies for
um for addressing all of these issues and that is why we did the planetary computer so let's
go into the meat of it ideally you go from data to decisions and ideally you have some data to
storage and you process it you do some analytics you have applications and then you get results
that's a very simplistic view because in the end there's so many types of data there's so many
types of storage decisions and locations and formats and Analysis and you end up having an
incredibly scattered not interconnected array of both data storage and formats and and everything
you could try to put everything together you could try to standardize put everything in one place
and then that should allow you to have more frictionless process from data to knowledge and
that is the planetary computer that's will build it so we're building a foundational architecture
for aquarable Earth as I said before basically a digital twin this is something that is becoming
more and more um well as a as a namesake for what a lot of people are doing as a service and
basically a planet where Cloud native environment I should know that I've had this morning this
same conversation with another public entity that is the data provider um and I'm also having
this income conversation with the European Space Agency as they are thinking of destination Earth
destination e and this is the the meat of it is as I've been saying a few times already our hope
is that we can share all these businesses we've built the plant that computers and completely
open source with open data and by doing that I'm sharing what we're doing we we can coordinate
and build things together because it would be it extremely beneficial for everyone if for
example we could share across formats and data and minimize the work for that everyone is
doing to do their work so the bladder computer is four things you can think of them as
abstraction abstraction layers we have a data catalog which is basically file of files um
we we try to keep them as as untouched as possible but we we also try to have them as Cloud native so
you allow it allows to range streets for example if you have a big file you'll need to download
the file to get a small section you can just request the regular section in rasters this is the
cloud optimized geotifs inductors and and polygons is a bit more of a conversation we should have
that's that's the idea put all the files in the same Data Center and have this big pile of files
we then ingest all of those files are into one metadata database basically where every single
byte we have 50 52 petabytes of of data stored in the same this data center that's the data of
the big pile of files every single byte of those petabytes is stacked with the metadata of what is
it from when is it from what are the cloudiness or any other characteristics that they have so
that you you the core of the blender computer is that metadata API so you say give me data for
Washington DC when it's not cloudy in 2024 and this I know landsat or sending or whatever thing
it is then separate architecturally from that on the same data center we have a compute environment
if you know find Geo or Jupiter lab that is it that's it there is nothing special we do to that
it's uh deployed panzero instance that we allow we we have for everyone who wants to use it but
also we have recommendations so and I think Tom is going to cover this to deploy their own instance
so they have full control of of of the compute and there's no extra cost this is the running cost of
running the thing if you deploy it if you use our type environment there is no no cost for that
I forgot to mention the metadata specification is a stack specs positive polar asset
catalogs is an open standard and we basically um we have a PG stack at other ways with all all
that stuff then the last layer is the applications it's more or less what we do is also what users
and customers do with that data they can connect directly to the files they connected to or to
the metadata API or to the compute environment it's modular in that sense by Design so a bit of
an architecture diagram that hopefully helps you um understand what we're doing what we're doing
is we have right now if I'm not mistaken on Tom please correct me later if if that is not the case
we've made the painful work of going to 84 sources some of them NASA or USCIS or NOAA or Sentinel
European all of those and got what is the in this for now we are focusing on open data what is the
opening that you have let's ingested let's put it if it's not in the cloud format let's do all of
that and we put in the same data center when we do the ingestion we do the the we we use the stack
specification to then create the database of the metadata so the the green arrows to the right
is the ways the context user can consume the data so if you go through the authentication API
it's basically just giving you a token to access the blob starts there's nothing nothing anyone
can access an account can can get an account for this so you can get directly the the files
if you know the file and you know everything go to that if you want to use the metadata API
you can also consume the metadata API asking for questions like give me data for Iowa or whatever
thing it is we also have the the catalog of data which actually a website that consumes the
metadata API to put it nicely and give you some documentation and examples and things like
that we also have a Tyler which is this is more for raster which basically is a piece of um it's
quality Tyler for those who know about it that gets the data from blob storage to the metadata
API and the files and they'll throws it to your browser to the Internet so you can consume
it quickly happy to do a demo right now but the idea is that just the experience when you
go to Google Maps or Bing Maps or any other render map and you see the you see what data in
this case you just select the metadata on the characteristics of the source and the times and
the metadata and on the fly in less than in most of the time less than half a second you will get
a map that looks like one of these online Maps but it's rendered directly from Lab storage to that
through this detailer um architecture that's the the data API and that's the PC Explorer which
is an application of the planetary computer if you go to planetarycomputer.microsoft.com you'll
see all of those things and then on the same data center but separated as I said before we have the
compute environment which is the panzero to give you a sense of what the what we are doing right
now it's like 52 petabytes of started probably a month we went one and a half petabytes of output
with a 2 billion calls to the metadata API that is key that means that our users are not downloading
files which is great they are only downloading the little bits the little bytes of information
they need for their requested so they get faster outputs to do their work we get less output
that we have to pay for and everyone's happy that's the trick of this metadata which I believe
is at the core of all of this architecture and we also have the computer environment board that I
mean CPU hours that deserves so this is only on the one we managed there's a lot of clients that
have deploying their own complete environment the stack specification is key for that and
there is this recent Planet a blog post when they they go through that and they see that we
are already the largest um database of metadata assets which to me is okay great this is great but
at the same time it also makes me think that huh most of the assets all of these people have in
their archives is the same everyone's indexing on their own landsat everyone's indexing
on their own sending it and it's not I wish we could coordinate ourselves better and
and share maybe a Federated stack environment or network or whatever you want to call it
so that not only you know what it comes from like one of the things we try to put in the
metadata API is the source of data what has changed what's the provenance of the data but I
think we could we could do much better I'm I'm hopeful we can talk about this later um later
on the call this is a little bit of the catalog and explore I'm gonna I think at this point
it's better if I just um so you quickly damn a little demo when I change that I don't
know P30 if there are questions so far I'm just going to the browser I'm saying again sir Microsoft Edge There You Go sir see if you go to Peter you haven't said
anything I don't know or is there any questions there are some questions um I'll kind of leave
it up to you um let me let me do this demo demo okay demo there we go on the
questions.com as I said before pilot files metadata API compute environment
data catalog is how you see on the um this pile of files every single this is 86
sources of course landsat is one of them every single collection has an overview and then has the
assets we are hosting and then everyone each of them has a notebook an example notebook that you
can start so you can start working with quickly um the providers the license if there is a paper
related to that and then the spectral bands all of this this is actually coming from the metadata API
itself it's just rendered nicely on the browser and as you see here we have this launching
Explorer which is you can also get it here um is what I talked about before which kind
of combines everything and this is what I'm um gonna just show you quickly you can go to
Pakistan you know this has been floods uh recently so I also know that radar sound radar it's
um it's really good for detecting flooded areas we we do process Sentinel one grb into
radimetrical terrain corrected it's one of the few data sets we also processed to create
a product which is released openly and then um this is the latest one this Pro as you can
see already this is already rendered on the Fly these are the results of the most recent Sentinel
one since the the flood happened kind of a couple weeks ago we can go back and we select the end
date and these are the results this is the assets that correspond to what I said here and I can
click on each of them and I'll get the metadata for those I get assets to do whatever thing I want
I get the code to then copy paste and using my um whatever analysis I'm doing but I also get here
the code of the call itself to say hey what do you have of this region these date ranges for this
metadata specifications and you can copy paste and then using in panzero and Jupiter lab and
what you see in here is what I mentioned before this is rendered on the fly as I move the mouse
around and you see as less than you know half a second I can also say I want to actually see
what is the comparison between that now and what is four clicks or just changing one line
of code and then I see that actually indeed this is and you can plug in
in and out that the situation um in the massive flood that happened there you
can go and do exactly the same thing for any of our sources you can also then share this
link with anyone so that they can just ask as I know this is these days you can
just very quickly share a status of the fast to do that I know there
are others and I know that the Explorer NASA has spent a lot of time on
invested on on similar tools like this what I really like of this one is this this
frictionless or helping you navigate from exploration to analysis with uh with this little
tidbits of code um then the documentation as I said before everything has document and how
to read from stack every data set has its own uh example notebooks there's also tutorials
which I think Tom is going to cover in a second but this is it this is the planetary computer
it's a collection it's a special platform built modularly on a pile of files or metadata API
and a compute environment that is ready for you to clone if you want and probably you don't
want to clone 54 petabytes of data and maintain the metadata and that's why it becomes a service
for a commercial company like ours at Microsoft happy to answer questions the more technical side
starts afterwards when Tom goes over that but if there are questions of why we're doing this
or the strategy of that happy to go to those yeah wonderful I think that'd be great we do have
a couple questions that really fit to the why as you said um if you were thinking about either
existing or anticipated user needs and business needs that first question that you you showed
that's getting a lot of attention uh today on the list of questions we're wondering if uh which
ones you would lift up and really highlight for the purposes of this study as potentially guiding
NASA's science Management directorate in pursuit of data and Computing infrastructure especially
infrastructure to support open source science I thought a lot about that because but I see more
and more hopefully or thankfully is that there is more attention to climate change and specifically
to extreme events because they are the drive the biggest driver of economy classes and um assessing
risks of climate events climate change in these weather events is critical and is something that
is very much connected to science right because if it's consumption of just your special data I
don't think it's it's car correct me if I'm wrong I don't think it's called to the NASA science
Mission directorate and and to be the providers of that data it's the data there you're producing
it you are and people are using it fantastic but when it comes that we are not there yet
because we need better science and better dissemination I think Climate Services climate
reservations is right there and specifically I think we have a lot of science already but this
seems to be really hard to go from academic sets settings and papers to then okay I'm company
X and I want to assess my club address what it is it's flawed is it droughts what is it how
do I convert that knowledge of science into my operations and that's why this idea that if you
build this output you know flood risk where like we did with one of the data sets Global fraud risk
due to Tidal storm searches and sea level rise then people are going to have questions it's
great to have the data set we're going to have a question how you did it what if you did
it similarly so the idea that you could have exactly the environment that produced that output
ready to be deployed and the customer side or the government or the city that is doing this and then
adjust it to their own needs because they don't assets or because they have a different source
of data that is very powerful that those com Technologies like binder which I don't know Tom
you're gonna talk about it or not this basically an idea that you can't deploy an environment
and combine our complete workbench to do that not only the code but also the infrastructure
different structure as a code that to me is extremely powerful and because that would scatter
the need of anyone who's a bit strategic you will forget the need of everyone who's going to answer
to the regulations and disclosing these Planet risks and this is something I would say the the
market is very immature carbon options is the same thing I think we like to say there is lack
of meaning what it means these carbon offsets or the additionality of carbon acids what is the
measure of all of these things which is science missing directorate I think core to help the
world um answer what do they mean and what do they measure and then mature those markets right
if I had to pick one it would be that one Peter great thank you uh maybe one more and and then
I want to leave plenty of time for for Tom um and I'm going to skip a question here that
really I think Tom's probably gonna going to speak to because there are some questions here that are
a little on the technical um the side that I think Tom is going to speak to but Bruno before uh you
you um kind of close this this piece um there's a question around issues that a successful data and
Computing infrastructure may need to anticipate um and related to the issues maybe opportunities
are there any particular issues that you would suggest NASA really pay attention to in
thinking about data and Computing infrastructure I think the principle of the fair principles is
is to the core of what I know NASA's already doing right they um in the open science one I I've seen
many mentions both to to fair and also to Care on on the indigenous communities and it's to be
it's we right now it is not findable for example um and it's not interoperable and there's no
all of these acronyms are fair are not there yet if you cannot fine divide unit easily if you
cannot connect your whatever environment it is you need to adapt to that environment because
it's a close environment it's not a model or a matter but it's not an open standard that
is not interoperable so what I would say is please everything you do make it an openness
make it based on Open Standards and make it I mean ensure that is findable and I would argue
discoverability of metadata goes a very long way very good thank you Bruno um and thank you
for that presentation that that well thank you and I forgot to say that is also here from
the team I think it's on the participant list but I just wanted to shout out to her
that she has also joined from the team wonderful well welcome um and with that perhaps
um can we turn it over to Tom sure take it away let me do a few things uh in the chat which
hopefully this will go out to everyone I just put a link uh it's uh AKA dot Ms slash pc-nasa
that'll take you to a uh Jupiter Hub that I set up for this and I'm gonna share my screen
and if you want you're more than welcome to um go ahead and go there uh it's going to ask you
for a username and a password and I've forgotten the password but I think it is NASA I'm just
gonna double check on a uh yeah so it's gonna ask you for some username put put whatever
you want there uh and then I'm pretty sure it's NASA is going to get you in yeah that should
do it um cool uh so use some unique username and then uh the password again is NASA which I'll
put in the chat as well once I find that screen WebEx has rearranged my windows sorry
uh where'd it go okay here we go chat password is NASA while you're doing that thank
you um I'm gonna show just a couple things uh just as your uh stuff is spinning up I should mention
briefly yeah so I mentioned this is a Jupiter Hub um that I deployed for this uh it's a Jupiter
Hub running on uh Azure kubernetes service so like any other uh kubernetes service out
there uh and then the idea is we're gonna go through some kind of data analysis
type workloads uh fetching some data from the planetary computer and then uh
just making some pretty pictures with them um let's see a couple of things to mention uh
you know Bruno mentioned the um the kind of uh components of the planetary computer uh we have
the data catalog so we'll be mainly interacting with the stack API uh which is actually the
same thing that this HTML page is generated from but we'll peek at the data catalog briefly
to understand what data is available I'm gonna mention a bit of about kind of the setup that we
have here as yours is is spinning up so uh the the main idea with our compute is like we we don't
really care how you do the compute um as long as it's on Azure uh is essentially the big thing and
that's less for like you know like a make money reason uh that's like just the flat most efficient
way to get to the bite so the bytes are all in a storage container um it's like an S3 bucket uh in
Azure blob storage uh the bytes are all there in a single data center uh and if you want to have the
fastest most efficient access to the data you're going to want to put your compute in the exact
same data center so happens to be in the west Europe Azure region so that's the kind of setup
that we're we have here that we're connecting to uh so I'm here in my local browser uh you all
are going to be in your local browsers kind of on your own home networks or whatever Network
you're on um and then we in running inside of of Azure and the same region as the data we have
this Jupiter Hub that we're going to connect to um and that's so when we're doing compute
when we're like downloading data from The Blob storage containers it's going to have a
nice high bandwidth connection here for those large data sets and then as we kind of uh bring
results back to our local client like a summary statistic or a a an image a plot that's going to
be much much smaller so that's okay to go over the public internet we also have um dasc here
Depending on time I don't actually know exactly how much time we had so I'll be uh waiting for you
to interrupt me once I go too long but um what we have to ask here for scalable Computing it's just
one of many many ways to do scalable Computing um on Azure I happen to be most familiar with
it okay uh and like Bruno mentioned I guess I can put this in the Chas you know we have this
Hub uh we're actually using a separate Hub just because uh I didn't want to all of you to have to
deal with like um uh signing up for an account if you are interested uh we'll share the link at
the end but um you can sign up for an account and we can get you all approved but for now I just
set up a temporary one that you can all log into um okay I think that's it for introductory
stuff hopefully uh everything's going okay um I actually don't know if if you all can chat
but if you are having issues then uh somehow no uh alert me that uh stuff is breaking uh and I'll try
and fix stuff but I'll assume uh some people have successfully logged on um the just to mention
like this setup here you know one of the nice things about the cloud is the kind of pay as you
go Computing uh scale down to zero so uh it might take like a few minutes um I had it completely
scaled I had this completely scaled down but as you request a uh request a a pod a notebook
server here it's gonna automatically start up virtual machines and then uh then your thing will
come in okay we got some time great thanks Peter um excellent so you should be seeing I
apologize for the unrendered markdown um you're you're seeing a page that
looks like this um we're gonna go through a uh an example here start off with
this one uh about uh using uh the stack API um so if you all could click on the
folder icon here uh by default you're in the planetary computer is examples repo
and then we'll go to Quick starts and then you're gonna want I'm gonna make this just a
tiny bit smaller and you want uh reading stack there's a uh that reading stack Dash R example
we're gonna be using python uh just since that's the pipe the environment I have selected here
um but uh stack is uh well as we'll see it's a cross language um standard okay hopefully
things are working for people I'll assume it is um great so like Bruno mentioned uh we have all
this data in in Azure blob storage which is great you know it's a lot of work uh getting that data
there um great help from from our partners you know helping us with that um but just having the
files there in Blob storage is not enough we think um it's still too difficult to use that data
um just you know thinking about like the simple example of like give me all the um all the landsat
images over uh Washington for 2020 you know like you have to be very familiar with like usgs's
particular naming scheme for like how it does uh the wrs path and rows and and like the various
levels and and processing and modes and things like that to to kind of figure that out and so
what we use to to avoid that pain what we use instead is stack um so stack it's this uh spatial
temporal asset catalog it's an open specification for cataloging spatial temporal data uh kind of
previewing I saw a question about you know is uh Planet well planetary computer focused on stack or
sorry on on just Earth data or can it be used for other like lunar or Mars um people have uh kind of
hacked up stack to make it work for other uh other bodies so uh I don't know exactly how that works
I think they have to do some uh hacky things with uh coordinate reference systems to make this work
but in principle it could but currently all of our data is Earth focused right now okay um so we'll
go ahead and go through this so the planetary computer stack API um we're going to be using pi
stack clients and the stack you know it's it's a whole standard for how to catalog spatial temporal
data but by far the most useful thing it does is lets you search that data lets you actually
query that so that lets you do things like in this case we are looking I think it's yeah areas
around Microsoft's campus in Redmond Washington uh so we have that bounding box there and we're
interested in in scenes from 2020 and you know that that query that I was posing earlier becomes
pretty straightforward to write in this case we're using python code to do that so we make that
search and we get back the kind of eight items that match our query for um landsat scenes over
Redmond Washington in December 2022.
Yeah if you do have any questions uh throw them in the chat
and I'll try and answer them as we go through them um so we got those eight items matching it so uh
and you saw it returned really quickly like less than a second or two um so we haven't actually
loaded any data what we've done is is use the stack API to kind of query the metadata for
scenes that uh match our query if we look at those items they're geojson features so they have
things like a geometry um you know all the other stuff that you'd expect from a geojson feature
and then it's expanded with a whole bunch of other information like what is the platform it was
on what date time or date range was it captured for projection information all the all the useful
things that you would need to work with this data um including things like uh the data
provider in this case uh USGS includes um an estimate of cloudiness for each of
these uh scenes and so you can do things like very quickly filter out uh uh cloudy
images select the least cloudy one here just capture this but uh stack I I should
mention actually stack uh the idea is it's um it's all a metadata standard that's all about
linking out to the actual data uh so the the items that we got there's one or more assets on each
item and each asset is an individual individual file uh in this case in the case of of landsat uh
collection two uh their collection two level two they're all going to be um well most of the the
interesting data assets are the cloud optimized geotiffs that we can really efficiently access
you also see things like the uh metadata files uh linked there as well one of the assets that
we can use here is the rendered preview asset so you can see it's a link and actually to a data
API so this is actually the same thing powering the Explorer that Bruno used so that explores um
using the stack API to query you know it's looking at where your window is over Pakistan getting
the bounding box in latitude longitude using that to make the queries and then the data API is
responsible for returning the actual images that match your search so in this case here's our
our least cloudy item over Redmond Washington we can also access the data after we do one thing
we need to sign the data so uh if you're the the actual assets themselves are in private storage
containers uh just to kind of keep an eye on egress but we do allow Anonymous signing of tokens
so you all didn't sign up for planetary computer accounts yet we didn't provide any kind of API
key or anything here all we need to do is make a request to the planetary computers uh SAS API
that gives us a a token that we can use to read the actual data okay so we'll assign the item
or make an HTTP request in the background that makes an HTTP request to the SAS API and it gets
us back a um I can show you it signed href a URL that has the typical stuff for Azure blob storage
so this is a storage account container the path to the cloud optimized geotiff and then everything
else is the um is a read-only token and so now you can pass this off this URL off to anything that
uh can read data over HTTP so in this case we're using Rio xra to read it into an x-ray uh data
array we can also use things like qgis um rest area uh R uses uh gdol it's something built on top
of gdol I think it's Stars I can't raster I can't keep up to date with our community but anything
that can speak HTTP can now access the data and the last thing uh I think last thing uh
oh the scan through worth mentioning is that um you can really efficiently uh make data cubes
out of uh out of these stack items so the items themselves have enough metadata that we can uh
kind of um Mosaic them together uh in in space and stack them through in time uh and we have
all our bands here to very quickly create the data Cube uh which I I think is just great so
like we've gone from you know thinking about low level details like what is the exact naming
scheme that USGS uses for for these files and like how can I lay them out in space and time and
read them and understand their spatial extents we don't have to worry about any of that instead we
can use these higher level things like searching by space and time and things like cloudiness and
get back a bunch of Rich metadata that describes the assets and based just on that metadata we
can get these really nice convenient high level libraries data structures like a data set that
can we can work with to actually analyze the data um I'm going to skip through most of this just
for just for time because I do want to get back to questions but you can search on additional
Fields like cloud cover so this is going to vary data set to data set we'll see an example of
doing this in the next uh next notebook and then it's worth mentioning uh yeah I'm gonna skip
through this uh all all stack metadata stuff um and then I just briefly wanted
to show that stack works with um uh it isn't specific to Cloud optimized
geotifs or raster data it also works uh for um well yeah stack only cares about about
files so it's all about linking to assets um so in this case we're we're using a daymet
uh data so daily North America daymet data um and we can have a look at the link here it's
a link to a file in well actually a directory in Azure blob storage uh that is a czar store
and so we can go ahead and load that up very similar to how we did the other other example
so stack uh it's a very flexible metadata standard um for the most part it's it's really just
focused on on spatial temporal data if you have some some data that has like a spatial
footprint a temporal time stamp or range then stack is a great way to catalog it okay I'll
pause there if there are any questions I can answer now and then we'll jump on to a kind of
more fun example that'll take 10 or 15 minutes very good and there are some questions
um folks I think have have been pretty interested in some of the the
details that you've been covering um but let me take a step back um actually and and
I think there's a question here that's a little broad that you'd be able to speak to really
nicely and that has to do with best practices and are there particular best practices you might
highlight for an open source science architecture um data and Computing infrastructure for open
source science or open source science operation um yeah that's awesome uh let's see so I would
say um my background is a in open source and it's like you know woefully under maintained open
source maintainers always yeah super stressed out burned out so I would say um as much as possible
and I understand people are busy uh but as much as possible be involved with the open source
uh uh libraries that you're building on um and you know I mean yeah it can be hard to justify it
especially in the short term uh spending time like on open source not necessarily even uh you know
working on features or whatever uh but just being involved with the discussions there uh I think can
be super valuable both for you because you like understand where the projects are going uh but
also for the community sharing your feedback as a a user who's like trying you know applying this
in practice at scale like NASA is doing um can be super valuable and then the other thing is like
um the hardest part about open sources you know you can do anything but the most successful open
source stories I've seen have always been around uh people individuals who bring groups together
and can coordinate um coordinate groups who might have uh you know different uh priorities but I'll
have some sort of shared um some overlap in those priorities that they'd all benefit from coming
together so uh yeah be involved and then as much as possible be especially involved on the kind of
coordination and uh coordination side of things great um so kind of building on that let me shift
to existing approaches to open source science um the data and Computing infrastructure
to support it any particular approaches that you would see um suggest or really
want to flag for uh the NASA work here um yeah I think the pangeo is is like the
go-to example right like a group of people who are just trying to do geoscience on the
cloud uh hit upon this idea of a Jupiter Hub deployment in the cloud uh using kubernetes
um that scales with desk uh and that you know that was like uh I think the initial thing
was hacked up in a weekend by a few people um at some I can't even remember which uh
conference or Workshop it was but uh that idea has like uh you know gone a long ways to to the work
that they've been able to do um and so that's like what we do here uh with the planetary computer
Hub that we provide and then lots of people are deploying their own hubs you know to customize
the software environment or you know various things around around that so I think that you
know that is like a good go-to example that said um I guess yeah so the benefits are you don't
need to expose every single individual to like um let's say challenges around like uh Cloud
subscriptions and like how do I get the billing details right and like which service should I
use you you prevent that present them with a nice login an easy way to to scale their compute um
that said like there are tons and tons of services um that go beyond like the core uh you know a kind
of interactive Computing environment that Jupiter have handles so well um and you know it's it's
strength and it's it's weaknesses like there's fewer options around like uh job scheduling and
you know batch workflow type things within Jupiter Hub itself how do you complement those with other
other open source or Cloud Technologies um that's kind of a different uh yeah different can of
worms maybe so I don't know that's uh yeah PTO I think is a good place to start great thank you
um maybe one more question kind of in the similar vein of the of the previous two um you've worked
on open source science you know reproducibility is a is a key aspect of Open Source science
um how is reproducibility of results um and by extension decisions supported by the
Microsoft planetary computer uh yeah I think so there's a couple of answers like at a very surface
level we can say it's like you know ideal right you can have a notebook you can uh you know share
a link to it um like Bruno showed we have those examples where you can you know click a button
launch it in the hub and be off and running and like yeah so at a surface level uh and I I don't
want to undersell like that's that's not nothing that's a a good accomplishment a good first step
but it is just a first step if you want to like fully lock down like the software environment
and like their services on in the background uh that are that are potentially being used that
like aren't necessarily encapsulated in that link that shareable link uh there's the the data and
like what happens if uh you know USGS decides to reprocess some scenes uh what happens to the data
do we update it to follow USGS do we make a new version you know how do we do all that so like I
think there's tons and tons of questions around um some of the trickier problems around
reproducibility that like I don't think we or anyone else have a good answer for um
if you're interested in this I'm uh I'll I'll bring it up here uh there's a interesting uh
discussion working group forming around this on the pangeo discourse uh let's see if I can
yep uh I'm going to post this in the chat that I think is a good summary of of where things
are um where things are at this one's uh maybe a bit more focused on like education but I think
that's like a prime example of reproducibility great thank you um and I know you have a
couple more slides uh that you wanted to get but let me ask one more question here um
that is really kind of kind of fascinating thank you to the folks who put it in um if
we were to share NASA data and software on um the PDC um how would this broaden accessibility
to communities who ordinary ordinarily would not be able to use NASA data and tools yeah I
think the the biggest thing there is is the um I guess there's a few things so first of all
there's just like the the bandwidth question is like if you just have the data um on some server
FTP server HTTP server or whatever uh even if it's publicly accessible uh bandwidth can be a
challenge um especially at scale uh so if you have uh data that is in the cloud uh then there's at
least a potential uh for for anybody to to use it um because there's that option for locating the
compute with the data um that kind of puts the question a bit to like how do those people get
access to compute and like with the planetary computer you just sign up for it uh similar with
lots of other services so um I think like that's a a good first step towards uh towards broadening
that access I think Bruno might have something as well you're happy to to chime in here I'm actually
now at my mom's place and it's almost a dial-up connection if I didn't have the planetary computer
I would need to go do I know to one of the providers download the file open qgis on rdis and
it would have taken me I don't know three hours if at all I could do it versus the 10 seconds that
I did so I think people with slow bandwidths over the cloud is far away still can access this
is because a lot happens most of it happens in the cloud itself not in a closed sense but in
a helpful way the other the other side is that it becomes then Microsoft's own interest
now that we host the data we have a better interest to make sure people use it and if
you think of all the clients the company our company has is a tremendous platform to
make sure that all the data sets that we host become used because now our incentive is
for them to be used one of the things there's a lot of datasets we could put on board
and the the criteria we have so far is Art is useful and because we don't know how to
message they are useful or not is are they used so it is our metric of success to
make sure that these are used and we our field teams are there's tons of people who are
now trying to figure out hey who can use this data so I guess a long way to answer that it becomes
our incentive to make sure that this data is used yeah thank you for uh jumping in on that question
Bruno I think that additional perspective is is really helpful um and and I love the example
of your mom's house right now I mean those are those are uh really important comments and we
have a couple folks in the in the chat who are agreeing with that um Tom I know you had a couple
of other things you wanted to to cover why don't we turn it back to you for that sure and I'll try
and be quick just to get to some more examples uh or sorry more questions um if you jump back up a
level so you were in quick starts if you go back up a level uh we'll go to uh tutorials and then
there's a fun uh well uh pretty hurricane Florence animation example so that's again under tutorials
and then this hurricane Florence animation uh you can check out what we'll be making but we'll
actually do that uh live here um yeah so the idea behind this one is um based off this example
from pie troll if you've used that Library um uh it's loading some data from goes uh some mesoscale
data from goes to visualize hurricane Florence um so first of all you kind of have to figure
out where the uh where the date the storm was at and when so that's what this call is for
which uh hopefully uh this one's kind of taken a while uh hopefully we get the data set
downloaded uh we do not yet have this data set this uh best track data set in the plan third
computer so we have to hit Noah's servers for it um which maybe this is kind of
a demonstration of why having um all the data in the same place is a good
idea if this fails I do have the um the latitude longitude stored in another notebook so I I might
uh I'm gonna bring that up if this fails entirely then at least I'll be able to to do it and you all
can copy the latitude longitude but I gotta set up another thing for that I'm gonna assume that
this failed and interrupt it maybe that's a bad idea yeah it's just downloading the data okay well
I'm gonna oh wait a sec uh while this uh comes up and then we can avoid hitting these servers
that's the other nice thing um about uh Azure uh blob storage any blob storage service really uh
is that they're built to scale uh built to handle many concurrent requests uh so you don't have
to worry so much about a single user a few users um a few users uh knocking the service over
uh like we appear to have done to Noah oops give me one sec while this comes up I'm just I can actually show this over here here's
my other uh uh my other Jupiter Hub the real one where I have an example from uh Noah's edmw
workshop and I'm just gonna copy paste this over to this window okay so you all can skip
this uh skip this example where you download this stuff and instead so skip
cell uh what is that two I guess it was and skip this one as well and apologies for this
uh I should have planned ahead we're gonna skip all that stuff and we're just gonna skip to
get the imagery perfect and y'all will need I'm gonna post it in uh in the chat here we'll
see how badly uh this gets formatted seems to be okay hopefully the quotes are all like real
quotes and if you want to follow along otherwise I'll just go through it pretty quick uh just for
time uh sorry about that uh but we somehow have magically discovered the the bounding box uh in
the date time for where this storm was uh when um so we're going to go ahead and uh again now
that we know where it's at we're gonna query the planetary computer stack API but we don't
have to know about how goes organizes its data its file names or things like that all we need to
do is uh query the go CMI uh database goes Cloud moisture imagery collection for assets within
this bounding box over this date range and we're interested in just the mesoscale images so goes
is capturing uh conus and full disk images kind of at the same time we only want the meso scale
from when it was zoomed in on Hurricane Florence um I don't think I kind that but you can
see it's already finished so it's a couple of seconds and we've got back these these
items that match our query and if we very quickly kind of check and and make sure that
we're in the right spot you can see make this a bit smaller so we can see it you can see that
we're in the right spot okay um let's see goes does not have a a sorry green band I think it
is so we're going to do a bit of X-ray stuff to make a synthetic green band out of the uh
near infrared red and blue gonna do that here and then a bit of work to kind of like uh
well a bit of work to make the picture look pretty I don't know how scientifically
accurate this is but some kind of gamma correction to get a Time series of
RGB arrays that we can then uh plot um I'm gonna just very briefly um show this so
if you copy paste this Ur to the desk dashboard URL uh this is an example of uh Computing on
the data in parallel um so we're Computing in parallel on a single machine uh using I think
for Threads or processes um dasc uh the setup that we have here is a Das Gateway so you
can easily scale on a cluster of machines so that's uh kind of uh working through this
computation here reading data from blob storage um doing the the linear combination to make that
green band doing the stacking things like that and then we have a bit of matplotlit a lot of
matplotlib stuff here to to make the animation and then embed it in the notebook um I'll just uh
stop there and then we'll go back to the original uh animation up top and and play that so this
is um it's actually a bit longer than what we were making there but um yeah hopefully as uh
well it looks really amazing I think hopefully it's like scientifically useful and can be I
don't know you all can tell me whether or not that scientific scientifically useful or not but
I think it's pretty cool okay again sorry for the the issues there with the the NOAA servers um I'll
have to send them apology note afterwards uh and get that data set onboarded so with that I think
we'll uh jump back to questions if there are any uh yes there still are some uh but thank you
for that uh demonstration uh it certainly does bring up just you know the speed
that you're capable of doing that and also the you know in a sense the the AHA
wow at at the end um is quite helpful um in a sense going back to that um if you
were so so you've got this you know amazing tool that you can work with um how would
you suggest going about to design a data and Computing infrastructure that's capable of
supporting the principles of Open Source science um NASA has you know submission needs around
open source science transparency accessibility inclusivity reproducibility which we talked about
earlier but this question is really about how would you go about designing data and Computing
infrastructure to support those Mission needs yeah um yeah so there's definitely like the kind
of you know low level things of like co-locating data with compute and uh you know cloud or wait
I shouldn't assume Cloud but like uh efficient uh access to data um which are you know important
but uh I think even more important uh than that is is really um well-structured standardized metadata
and so for the uh planetary computer we're using stack I know uh I think it's like NASA CMR also
uh has some stack uh things uh I don't know the full details but uh there are people at Nasa
who are familiar with stack and involved with it um so that's great and you know having that
metadata makes the data um actually like searchable queryable discoverable by by your users
so that's been extremely important for us I think it's important for any um kind of collection of of
data sets um and then I think uh maybe even more important that is like the educational material
educational side of things so with the planetary computer and and I think you know similar for
NASA maybe we're in kind of an interesting spot of there's a tension between you know do I I
have this tutorial for for making this animation um of hurricane Florence you know would that
be better suited for you know going in like the matplotlib uh gallery or x-ray you know for
for all of its stuff like we have all of these pieces these open source components that we're
building on that we're bringing together for for the specific use case um and it can be hard
to know or how do you balance like improving the documentation and examples for those libraries
those components that you're building off of versus like building your own thing um so that's
like a a tension that we're facing and I think uh NASA would face too as I'm guessing you all have
uh a bunch of documentation that's like specific to your your Computing your your data analysis
platforms um that may or may not be I think is open source a lot of it um and so like how do you
balance that versus improving the documentation of the Upstream um the Upstream uh libraries and then
just like there there's absolutely that need for credit cross-cutting high-level examples that use
all of these things and so where where does that go so that's like a thing that we've been thinking
about and I think not completely solved yet wonderful and yeah it's really helpful to to
know the kinds of things that you're running into and saying hey we haven't sold them yet
um you know that's always good for uh all of us to keep our our eyes on um are there particular
advantages and I apologize if this seems like a loaded question but particular advantages that you
see uh when you think of the planetary computer um over Google Earth engine for example uh yeah
yeah so uh advantages and disadvantages for sure um I so that a lot of the advantages and given
the Forum I think it's fair to say like uh open source is of interest to like me personally
to the planetary computer team to uh to you all presumably since that's like in the title
of this uh session so uh you know the planetary computer is built on open source components
it is open source itself like all of our um our our stack API and metadata Generation
all that things uh The Hub deployment if you really want to look at that AKs deployment that's
all open source and so that's like a uh I think an important component which gives you all the
flexibility if you know uh this example is using uh Python and x-ray but if you want to use R and
uh sits and all these other libraries for doing your analysis then then absolutely uh go for it
and like I mentioned lots of disadvantages like Google Earth engine I'm not going to throw any
shade it's a really amazing product they do a ton of things really well um so yeah Bruno I think I'm
guessing you want to say something here too they basically use it first of all Google Earth engine
is a fantastic product that has has help Advance tremendously what we can do with remote sensing
for like what 10 years so now throwing any say to them it's it's a great product as some said some
differences or disadvantages the way you want to call it I like to think that if they had to build
it today it would probably build something very close to what the player the computer is today I
don't know maybe they have a different answer but um they did it did not exist many of the things
that we are using now did not exist and that's where they had to build it back then some things
we also like to highlight with I think answer some of the questions is that if if there is something
we are not doing that you want to do you can because it's operating it's modula hey there's
this data set you don't have and I really need it put it put it in the same data center in your
own tenant stack ingest it with the stack and it's gonna be 100 the same as if it was ingested by us
that covers also some questions that I saw on the list if you can also use it for Mars or the moon
you will need to hack a little bit testification I think there is I saw a talk at first 4D the
conference that we talk about is people using the stack spec for other planets it's doable but again
because it's open source then you can just if we don't do it you can do it it's exactly the same
you know what's going on there's no black box here very good um Bruno one of the questions here I may
may be something that that you want to speak to um uh very directly so NASA has a space act
agreement with Microsoft um can you comment on a possible way forward for collaboration between
NASA and Microsoft on the planetary computer we are already we have some conversations
with some of your colleagues to figure out how to to how to leverage that coordination
uh into hosting or to doing projects together on Pilots we I would say we are if you
have something specific in mind happy to take it on and have another threat
but we are already doing some of that wonderful thank you there's another question
again it's kind of you know the the basics how does Microsoft fund and sustain the activity
and that's the golden question and I think it's also a golden question for NASA um
and I think also gets into is NASA wanting to be in the business of disseminating
all of these data products for everyone and I think the answer is probably not if there
are commercial customers who want to depend on this data set there's probably an opportunity
for a company like ours or other Cloud providers to say will host it and we'll provide it for
you of course we it we depend on you because we are the providers but we will then cover
the the elasticity and the and the one-to-many um needs and that's a little bit how we think
of it when you use the PC Hub defines you that we just thought about which we think of it as
a reference implementation we think of it as academic we think of it as NGO use but if you
are a commercial company who's using the planet the repeater would really encourage you to deploy
your own computer your own panzero and that means that you will pay for that there's no extra cost
for for use in Pangea because it's open source but you will generate consumption so it becomes
part of the offering just like you can deploy it on a Linux machine in Azure and then we it's
it's part of the business model of the cloud to to have actually the majority of the BM Linux
and it's it's a business model on that that's also when I made a comment before that then it
becomes our incentive to disseminate and make this data useful because if it's not we are not
in the business of archival for archival sake we can't right we got in the business on on figuring
out how the the resources we're putting in to pay for uh this stuff is leverage into more revenue
for us and I genuinely believe that that is the case otherwise they wouldn't exist Solutions
like this one or Google Earth engine and others very good um there are several folks
who who are interested in collaborating um on on specific topics um we've heard about the
you know looking at other planets Moon Mars um uh Etc um I'm interested in putting space weather
data on the planetary computer one person says do you have any thoughts on how the capability
would be useful um as far as again looking up or out as if uh as opposed to um yeah looking
more down we're gonna need to change the name to you know Cosmic computer or something yeah no
it's um the answer is that is that if you want to of the metadata I think it's going to be
a bit tricky to do that I don't think it's impossible I think that happens is the
majority of developers are looking down from satellite style right but I
see no reason to when I was doing um solar physics I'm losing my PhD we use also
some of the tooling that was meant for Earth to for the mapping the the surface of the Sun I
would say let's do it and if you have questions put it on the on the discussions we'd love
to see that's the power of being open like we haven't thought of that use case just go go
through it and if you don't if you cannot use the stack of specification that's fine just just any
other specification that's also the video being modular if you put it in the same Data Center
and then you want to use the PC app do it it is meant for these use cases but we love to figure
out how crazy hacky things people do with this very good so it sounds like you know it might be
a an interesting opportunity to kind of explore um very good so we're getting close to the end of our
time um I I want to make sure uh we kind of close things out uh nicely here to make sure that it's
three minutes left please stop us oh my god let's pause for a second and uh have we answered the
question you asked us if not happy to try again foreign I think um as I've been listening to it I think
you've done a really nice job of of speaking to those um in a really um uh very spot on um and
and thoughtful and and also succinct way um which is always a challenge you guys have a have
a lot going um so I I really appreciate it myself um and uh I know I would just offer an invitation
to to anyone uh whether it's on the on the panel or or the audience um if there are some
follow-up questions uh we'd be um Happy um oh I'm seeing we we still
we still have an hour for do we um let me just ask ask openly Hannah am I wrong
I thought we had I thought we were closing at the at the bottom of the hour do we how much time
do we have probably to hang out did we did yeah today so today was a longer discussion so we
have until 4 30 today if we want to I went to facilitated discussions my apologies we have
plenty of time um yeah so anyone who was thinking of a follow-up question now is a great opportunity
because we're not going to let Bruno and Tom um directly Tom I think that was there that
compute on dusk yeah I was just gonna say that but let's let's be candid and let's
be open we have plenty of time so I've got a question please this is Kevin um so you know we we see a lot of not a lot
like like one you have a pretty um pretty impressive system right um and you know I think
that there are a lot of potential data sets for these to address these types of questions related
to earth science or applications or climate so my question is is like how do you
knit that together right like so the the whole conversation here is like
how can NASA like be internally better but I think part of being internally better is is
making our systems work better with other systems um in an interoperable sort of way right I don't
know if ever everybody's ever going to be able to put all their data at one spot and do all the
analysis in one location so as we work with Issa you know we work with NSF we work with NOAA
USG all these people with with you with Google like like how do we make that um a little bit
better it's a really good question giving and maybe that's just that no it's maybe it's good
to have the most used data for the most amount of people kind of like a CDN cash on one place but
there's gonna be data that I don't and that's when that idea of Federated stack um like a ring of
servers of the stock endpoints could be helpful because you might be searching for something we
don't have but if we have the metadata one it gives you already the lead of where to go maybe
the answer is not the bytes maybe the answer is email Kevin or maybe go to whatever another page
not that one but you can buy one right that maybe it's I I dream that for example if NASA were
to provide the the data along with the stock specification dub like static file would probably
make our lives much easier right Tom so we don't have to to make the schema ourselves if we also
have an API with the data you might not even have the the connected to the data itself but if we
can then if we get a photo for imagine we are asking for something like that we don't find it
we might trigger ourselves to your API and maybe the answer is that we have they have it it's not
online but they have it somewhere else that kind of coordination among data providers and Cloud
companies it's probably beneficial for everyone so we get the most blunt a demand for requests and
then you only get the ones that are specific to the more Niche applications or the most needs data
sets that we it's harder for us to host so no I I I'm gonna I'm probably gonna speak out
of term because I too do Powerpoints and not technical stuff anymore but um I do think
uh that the API reference for CMR has a stack you know description in there so you can
you might want to take a look at that yeah it is uh it yes that is correct it is a
bit out of date uh I don't know okay yeah it's a little update uh not quite up to stack 1.0
but I'm guessing people are working on it yeah it's it's great and it's fantastic to have
but but I think the bigger question that I have is like Okay so we've got stack catalogs
and this type of thing and that type of thing um uh but you know I think I think uh like how we
coordinate that strategy across the organization's is an important point right like like you know
um you like we're we're talking to you but who's talking with you and me and Noah and so you know
what I mean it's like that coordination activity to say hey look maybe this is the way we should
structure this ecosystem a little bit you know not necessarily be prescriptive but but to give
some options would be you know something helpful we one thing we do is be opinionated we don't shy
away from being opinionated and we are Microsoft you are NASA if you NASA is opinionated it would
help lean the weight on one particular direction we decided that stack was a good uh standard that
the community was using and now it's not only the community it's also Microsoft that is putting that
together which probably by us the the lack of that specification to get even farther if NASA then
also Embraces that it goes in that direction so I think maybe if we are a little bit more
opinionated at the risk of um of some some other data sets that might be harder to put
in stock might help increase disability but as I said it's a trade-off um it's it's hard
and this is what becomes a little bit tricky it's hard to cover everyone's needs I do know
that cloud optimized your tips are great for some things are not good for others or the
geoparque is root for things for others so yeah sometimes we have really good discussions
and sometimes it's comfortable discussions of choosing a winner on this open source
standards but I think it's still worth I was wrong about the stack somewhere it's not out of date it's been recently
updated apparently so fantastic very good thank you um we still have a couple of
questions that have flowed into the to the chat here um one that's kind of a follow-up to to
uh one that came up Tom while you were talking um it talks about NASA's earth science data
to correctly use NASA's earth science data for research it's really important that
researchers are familiar with the product documentation and that users are aware of
the Quality fields and values as well as the product metadata product version that kind
of thing what approaches are you using to make this info information easily findable
and accessible for the users by the users um yeah uh super important um and as we like uh
add these data sets we become pretty familiar with all of them and I'm consistently amazed at
how complicated uh and just like intricate each of these data sets are uh anyway uh it's it's
I think there are are really our only answer is to do tons and tons of linking back up to the
Upstream providers um both in like whatever prose narrative that we write in our example
notebooks um and then also in structured ways in stack so stack has uh structured place to put the
scientific citations back to the original papers um places to put the links and Licensing and all
of that so um I I think that's like at a minimum uh that's necessary and that's what we're doing
and then if you all have suggestions on how to better surface that critical information then I
am all ears to hear about how to do that better so the thing is that we are not doing yet as far
as I know cover me from run Tom is this provenance or with the data exactly comes from and I had this
idea maybe I don't know if people like this idea is to add a metadata tag that provides the md5
has of the file at the source so you have a kind of like a chain of Integrity okay you can have
them device of our file but you also have then the device has all this of the source yep yep so uh we
uh are planning to add that at some point that's one of our work items um that I think will help
a lot um especially when the Upstream providers uh have stack metadata uh there's a again a
structured place to put that uh information about the files themselves the nd5 hashes um
all sorts of things about the files themselves um and in our stack catalog and we do this for um
uh landsat 8 because USGS has stack metadata we have a it's like Avaya or uh some way to indicate
that this is the Upstream provider stack item so you follow that and then that stack item has
links to their the assets on the USGS server so you can perfectly track it back again you
know there's issues around like well what if the data changes what if they update the data and
there's again stack extensions for versioning and so there's it's infinite complexity but there's
uh I think a path forward Tom I see one of the questions I'm sorry for coming in there and Peter
but it's a question that everyone seems to ask and I haven't been able to answer it it's great to
have you Tom here live about kerchang oh yeah uh actually let me jump over to that screen sorry for
a call full question uh career chunk okay uh yeah yeah so like archival data uh there's so much of
it out there I don't necessarily want to convert all of it to Cloud optimize just because of the um
cost of doing that and there's so much um existing uh processes built on top of those files in their
current formats so for those not familiar kerchunk um oh and um I heard about a similar project it's
uh it's something is it DDR plus plush HDR plus plus uh some effort at Nasa that's very similar
to kerchunk where you kind of uh scan these files these uh let's not call them Legacy files but
these not Cloud optimized uh files and figure out where the assets within them are at like so a
single net CF file with many groups many variables um that's chunked up uh where does this you
know temperature variable start at where does precipitation start at in the file in the
byte stream um this is is so useful because um the the uh performance of of a file system
like Azure blob storage is very different from a local file system with a local file system you
can open up the files seek all over it's not going to take that long but with a remote file system
like Azure blob storage it takes a long time to figure out where in the file uh these different
pieces are so jumping back to kerchunk and DMR plus plus thank you for that um what's what the
the idea behind these is to have a pre-processing step where you scan the data scan each asset and
then write out the sidecar file that has kind of the locations of each variable each chunk with
uh within that that netcdf file dollar grid files byte stream then the idea is what if you hook up
you have that you end up with a Json file that's like a URL offset and then length and we combine
that with the thing Bruno mentioned earlier about HTTP range requests once you have all of those
you can make range requests and and fetch just that data so you have kind of all the metadata
that you need to you know build your data cubes you know exactly where in these netcdf files each
chunk is at then you can get Cloud optimized data access to these netcdf or Grid 2 files files
that don't necessarily work well in the cloud so um you know as far where that's the idea and
as for like the reality like her chunk is is very new it's like kind of a project that some
folks from anaconda and the Pangea Community uh are just kind of like you know hacking on it it's
it's and we want to be we think it's a very very promising way for word it needs a bit of work and
we're you know we're working on that if you all want to become involved in that then definitely
do examples and and fixing bugs and things like that but it seems like such a promising way to to
get this uh uh again not legacy uh this non-cloud optimized data exposed in a cloud optimized way um
I just want to share this quote from Paul Ramsey who invented postgis um I don't want to mangle
it but it essentially says like uh it mentions how there's all this focus on cloud optimized
formats when which is necessary but the really really important thing and the really challenging
thing is getting clients Cloud optimized clients to make uh efficient use of those well-organized
bytes and so that's really what kerchunk is kind of taking to the extreme is what if the bytes
aren't well organized what if they're just as is as they're written you know 20 years ago
or whatever uh but what if we have a super sophisticated client who's able to kind of like
uh do these cloud my requests on the Fly that's the basic idea behind kirchunk and In This Cloud
optimized access pattern and the status of the plan they're completed is that we are looking
closely at those Technologies we are playing um last I remember last time that you were saying
Tom that we're trying to figure out we have not yet imported anything on karchunk or others
but we are actively seeking feedback from the community so we come there again as I was saying
before the opinion I didn't say let's go with this and I I believe that we could either compete or
others say hey let's just use kirchang it would favor the odds of that standard but we wanted
to choose something that the community thinks is the right one right yeah so we have a called
experimental uh reference files for one one data set then uh NASA's uh next gddp cmap 6 data set
um where we made uh reference files for some of those again software's young they're just like
uh uh bugs and errors if we tried to do it for every single one like projections into the future
uh with like you know date times past some Val invalid range so there's a lot of work to be done
but uh it's I think it's really really promising great thank you um Tom we wanted to follow up
on on some of your experience with with stack um stack uh assets and x-ray Das Matt plot lib
they're still rather low level user interface compared to some of the object models of GE
and open EO do you see a need for a higher level interface something where more powerful
abstractions that might service more low code users yeah um maybe maybe this is uh something
I struggle with it's like a blind spot of mine because you know I'm I I like coding I think
it's like a very powerful expressive way to do this kind of analysis assuming you know how to you
know code in python or whatever so I absolutely do think yeah and I should say like there's the other
end of it it's like the Explorer where there's um uh essentially no code like you're you're
manipulating the UI to generate the queries and then it Returns the results um which is uh you
know even for you know uh me like it's extremely useful for very quickly visually debugging
things and then it can kind of get you started um on a path by you know showing you the
code it used essentially behind the scenes so um I yeah I guess I'm not quite sure I think I
do think there's absolutely needs um within the um I guess python Community for sure but
um others other communities as well for um better ways to work with this type of raster
data um that uh well uh how do I say this like uh yeah anyway I don't want to get too into the
details but uh you know x-ray data data sets data Rays they're very focused around the ideas of uh
a regularly structured grid a kind of rectangular rectilinear data Cube uh which is uh very nice for
many data sets but it doesn't kind of accurately capture like the path of the Sentinel landsat as
it goes over the Earth like how do you represent that is it more like a data frame or a fancy
list or like a tree of data so anyway there's lots of uh I I completely failed to answer your
question because yes about low code I'm talking about like other very complex code things so
anyway that's kind of kind of my thoughts there great well it sounds like it's uh it's a challenge
on the horizon uh yes absolutely something that may need may need some more attention um I am just
I apologize I'm jumping back and forth I'm I'm moving some some URLs over um making sure folks
uh just we had a question come in about when uh um when the the video might be available and
what else might be available and just real quickly we are stocking these on our project
website I uh just provided the URL to that we hope to have a video available within a few
days it kind of depends on how quickly we can turn that and then based on that video
that that and the transcript is is helping us put together a kind of a high level summary
as well of some of the Q a and some of the the um the key points that were made and that will
be available a few days later again trying to tie as much as of the conversation and discussion
back to some of these main framing questions but also really trying to pick up um the questions
that have been coming in from the audience that are kind of above and beyond those those framing
questions um so with that in mind let me turn to a couple of those that are above and beyond um we've
got a question around support for non-python users um how do you support that or or will
you if you don't at the moment yeah um so if you're uh there's the Explorer if you're
not familiar with coding which is like a great way to visually inspect and understand uh some
of the data sets and hopefully non-raster data sets in the future there is uh so all of our
catalogs of as far as like uh discovery of what data sets possible data sets are possible
are cataloged stack is an open specification and there are clients in lots and lots of different
libraries so you can equally well use a stack from any language really any language that can
do HTTP and then we've worked with the developers of our stack from the Brazil uh I'm gonna it's
like inp their their space agency um to to make sure that uh our stack their client library
for R works well with the planetary computer um and we uh so that's like the the stack side
of things and then going up an additional level of of like for the Hub specifically for compute
we do have um r r the programming language are um profiles as well so you can start up our
kernels um and use all those libraries that that uh geospatial analysis tool chain um we also
I guess maybe somewhat better answering your last question about lower code we also have a qgis
profile where if you want you can go into that um it starts a cujit server in Azure your browser
you're still accessing it from your browser locally but like the cutest compute happens in
Azure close to the data so if you want to that graphical user interface to the data we also have
a qgis profile you can start up so those are kind of the non-python-centric ones that that users
have today and then again you know open source is if you can work with the stack metadata
you can use whatever tool chain you want I'll I'll also add that um again that is very
modular there is we have customers commercial customers that just use the files and they don't
touch anything else but that's fine we have the customers that use the metadata API through the
computer Hub that's great in the python kernel or R kernel or the these qgis on the server
but if you have a Q yes you can also connect to that totally fine either way did we also have
commercial customers who are then using the HTTP request in their own virtual machines doing
whatever they they want to use that's kind of the beauty of being um modular in that sense
and we have the PC Explorer which not only is great for people who do not want to code or
don't know how to code it is fantastic also for people who do code to quickly grabs that
snippet of code that I showed that shows the the lines of code for the region and you can
change it from python from other things but the idea is again to provide a bridge between the
no code the low code to the actual doing the work I don't think this is a conversation I constantly
have with with a colleague Matt which is the one developing the PC Explorer this is such a danger
for the PC Explorer to scope creep because there's so many things you could do and you I think you
have to put a an uh limit on those otherwise are you trying to make a qeis or rpis in the
browser now it's going to be needs always to then do something in whatever two videos which
is one of the questions Katie just passed that um on on the whatever tooling they they use we
do have a I would say to your question Katie of the academy users roughly one third of our
users are Academia one-third are commercial and ones that are mixed so it's not we're not
a research tool if anything we may be more our Enterprise and that's also what Microsoft
likes an Enterprise level platform that is also fantastic for academic and research use by Design
again so that to minimize research to operations thank you yeah it looks like Katie's that
your your response is really resonating with Katie I would imagine others as well um kind of a
detailed question I mean a very specific question um but probably important here um does the
planetary computer handle elevation altitude data as well uh yep so it's with uh mentioning
though like um you know where we're uh using stack for our uh cataloging um so we don't it
doesn't really matter what the data within it is um it certainly works well for raster data but
it works very well for essentially any any type of spatial temporal data um and so when if you
have data stored in tsar and that CDF that has multiple levels that's totally totally doable
and you would access that you know you would search it normally through the stack API
you would access it again normally through you know x-ray or whatever in-dimensional array
Library you're using to work with that data and then kind of the thing that you know maybe is
less figured out is like how do you visualize that in something like the Explorer which
is again currently mostly focused around um kind of a single spectral band for a single
um uh altitude but you could easily imagine uh ways to like have a slider to adjust the adjust
the elevation or the altitude uh based on you know your selection there so yep it works well
yeah it works it's totally doable it's just like uh depending on the exact nature of the
data set uh other some things might not work very good um similar uh you
know NASA has some data sets um specifically um they've they've got uh open
population raster data cdac um is what's the best way to get data added to the archive so it's
available to all yeah um let's see two answers so uh first of all like uh we recognize right now the
planetary computer team is kind of responsible for um kind of maintaining relationships with
all the data providers with um uh depending on the data set uh potentially doing the Cloud
optimization conversion potentially doing uh the stack metadata creation and then ingestion into
our database so there's like a lot of things that we become the bottleneck for so we're hoping to
improve a lot of the tooling around uh creating stack metadata and and getting it ingested and
and all the stuff around the metadata side of things and and sharing side of things to make
it easier for any group to share their data um uh through stack on Azure blob storage so uh
that is like in a thing that we're we're working on um for now the process is to kind of reach out
to us I'll put a link in the chat in a second um we have like a data set request page that you can
uh fill out and and we can take it from there and um it's just like a lot all these data sets
you know they're all unique and they take um a good amount of effort to get them kind
of cataloged correctly so that they're usable um through for as many people as possible and
just to add to that the best way to get your data set on board because again it's a it's it's just
blocked by us it's just a long tail this is a very long tail of data sets help us by making sure the
data is the format that is also clouds optimized as possible if that is possible uh the metadata
fields are are clear ideally with a stock stack geojson or schema Fields things like that that
makes their life easier and then most importantly who would use this for what because if this
data set that is interesting that's great but it's a data set that you have identified who
would be using this for what it really helps us um prioritize it if it's in the order of petabytes
then the conversation might be a little bit harder is less than that this the volume is not
really that much a problem also if it updates periodically or is a one-off some datases we have
updated every year or just once some update every few minutes so that also uh it's some criteria we
have for optimizing it as I said it's not really that we are there's a secret Channel or something it really is a matter of prioritizing
until we develop this juicer ingestion very good thank you um let me see if there are any
uh particular questions that might be coming from um panelists who would like to to come online
and maybe ask a question we're we've worked through all the questions that are in the queue
um thank you very much I mean we we work through a long set of questions that really appreciate
the discussion um we do have time I just wanted to see if there's anyone um with with the
panelists uh who who might want to come on I'm I will I will try to reciprocate and ask
everyone that if you can please share with us directly on that email um any feedback
you have what you like the most what you like the least anything really helps we are
we're building computer and we are building it very openly because we want to be as useful
to you as possible to make a change and to make the world with the place I'm literally it's
literally we're trying to do and we have an amazing opportunity to to shape it so please
be candid and be to reach out with feedback very good that's a very that's a wonderful
offer um and uh I I know folks we you can see in the in the chat folks are really
appreciate the time you've spent with us today um and and really appreciate the really the two
complementary perspectives that that you brought to the discussion and brought to the conversation
kind of the high picture or high level big picture look at the whole platform and also the the
technical details of you know the how and the why and the what um just a fascinating
and very very thoughtful uh combination um thank you to you both for for bringing
those perspectives and you know working them back and forth so elegantly it was very nice
um I'm not seeing any uh additional questions um let me uh um Hannah can you bring us
to the to the closing couple of slides um sorry putting her right on the spot
thank you Hannah I see it happening well so building on this this series that we
we have we're we have some uh upcoming sessions um scheduled we're looking at the San Diego
super computer center towards the end of of September and video will be here early in October
um sandia's Center for computing research that's a new one on our schedule that's coming uh
September 22nd Pittsburgh super computer September 23rd and then esri coming in at the
end of the month in in October October 21st and we also have the Texas Advanced Computing Center
coming in the 5th of October so we have a number of sessions in the next week or two um and then
we'll go to esri towards the end of next month um all of these are building on this this
series of questions that we've been asking um and uh that that Tom and Bruno have been so
kind to to Really wrestle with um with us and and and with each other um on on how to think through
some some questions that are particularly uh challenging and timely for um the SDM uh data and
Computing infrastructure project and study and uh the whole move towards uh open source science um
learning lessons from folks who've been doing this um for quite a while uh both inside NASA and and
elsewhere and but also thinking down the road um you know what's what's what should we be
anticipating what's coming at us um so thank you Bruno and Tom for helping be a part of this uh
the study and this really important conversation um with that yeah um we have a a couple of
nice um thank yous coming in in the chat um I think we're all set Elena um
would you like to come in yes well I just want to say thank you so much uh
for speaking with us today we really enjoy your talk and and all of your insights
so um yeah thank you very much thanks all right with that um thank you
everyone uh for your time on this Friday we we came close to 100 we
topped out about uh just shy of 90 um but not bad for for a Friday um and covering
uh so much of the Waterfront uh Coast to Coast literally so thank you everyone have a great
weekend and we hope to see you at a future session