System News GPUs now available in Bianca <p>UPPMAX is happy to announce that Bianca now provides GPUs available for all its users! GPUs are a type of accelerator that may significantly increase performance for certain workloads, for instance, workloads that contain large amounts of identical operations on independent pieces of data (so called data parallellism).</p> Thu, 09 Sep 2021 12:12:00 GMT 2021-09-09T12:12:00Z Issue with 'interactive' and creating slurm.out files Tue, 27 Mar 2018 14:07:00 GMT 2018-03-27T14:07:00Z Problem with Slurm on Rackham and Milou <p>There is currently a problem with the Slurm master node which affects users on Rackham and Milou. We are investigating.</p> Mon, 26 Mar 2018 09:46:00 GMT 2018-03-26T09:46:00Z Configuration problems on Milou and Irma - SOLVED Thu, 22 Mar 2018 16:16:00 GMT 2018-03-22T16:16:00Z March maintenance day -- UPDATED Thursday 07:00 <p>Wednesday 7th of March UPPMAX began our monthly service window. Systems and services may become unreachable during the day.</p> Wed, 07 Mar 2018 06:27:00 GMT 2018-03-07T06:27:00Z Slow home directories <p>Home directories have occasionally been extremely slow today. Nothing seems broken but the system is under a lot of pressure from time to time.</p> Wed, 28 Feb 2018 15:25:00 GMT 2018-02-28T15:25:00Z Rackham login issues -- SOLVED <p>We are currently seeing and receiving reports on login issues on Rackham.</p> Mon, 26 Feb 2018 07:40:00 GMT 2018-02-26T07:40:00Z Files and directories may be hidden on Bianca -- SOLVED Wednesday <p>We have received reports of missing files and directories inside the /proj and /proj/nobackup directories on Bianca. Upon inspection the files are actually there, but are not shown by the &quot;ls&quot; command. If you are working on Bianca, you should be aware of this as for example jobs of type &ldquo;process all files in directory X and compile the result&rdquo; might finish fine but create false results due to missing input, thus risking incorrect results and conclusions.</p> <p>A workaround was implemented on Wednesday 2018-03-08 that mitigates this issue.</p> Wed, 21 Feb 2018 13:43:00 GMT 2018-02-21T13:43:00Z The fat (256GB) Rackham nodes is currently unavailable -- SOLVED <p>The fat (256GB) Rackham nodes is currently unavailable due to an issue with Slurm. We are investigating this issue.</p> Mon, 19 Feb 2018 16:15:00 GMT 2018-02-19T16:15:00Z Rackham's storage system -- MONDAY: Queues released <p>Due to an issue with the storage system Crex the Slurm queue on Rackham is currently on hold. This is a summary of the problem.</p> Mon, 12 Feb 2018 14:16:00 GMT 2018-02-12T14:16:00Z No new jobs on Rackham 2018-02-09 11:15 <p>We are experiencing problems with crex, the file system on Rackham. In order to not put more strain on the filesystems we will not allow new jobs to start at the moment. If you submit jobs they will be held in the queue.</p> <p></p> Fri, 09 Feb 2018 10:16:00 GMT 2018-02-09T10:16:00Z Fysast1 online <p>The fysast1 cluster is back online</p> Fri, 09 Feb 2018 08:39:00 GMT 2018-02-09T08:39:00Z Milou online <p>The Milou cluster is back on line<br /> &nbsp;</p> Fri, 09 Feb 2018 08:24:00 GMT 2018-02-09T08:24:00Z Rackham online <p>The Rackham cluster is back online</p> Fri, 09 Feb 2018 07:54:00 GMT 2018-02-09T07:54:00Z Bianca online again <p>The Bianca cluster is back online following our service window.</p> Thu, 08 Feb 2018 15:10:00 GMT 2018-02-08T15:10:00Z Maintenance window Wednesday 2018-02-07 -- CLOSED <p>For the February service we will install our new UPS, update Slurm on all clusters, extend the capacity of the storage system Lupus (for Irma), and of course perform the standard kernel and security updates.</p> <p></p> <p></p> <p></p> <p></p> Tue, 06 Feb 2018 12:39:00 GMT 2018-02-06T12:39:00Z The UPPMAX Cloud region will be unavailable Thursday 17:00-20:00 CET <p>A central switch will be restarted tomorrow Thursday 2018-01-31. The cloud will become temporarily unavailable from the outside i.e. Internet.</p> Wed, 31 Jan 2018 13:37:00 GMT 2018-01-31T13:37:00Z Problems with the 'interactive' and Slurm commands on Rackham <p>The Slurm master on Rackham is currently overloaded and you may experience sluggish Slurm behavior or timeout issues when running commands such as interactive, jobinfo and squeue. We are investigating this issue.</p> Mon, 29 Jan 2018 14:16:00 GMT 2018-01-29T14:16:00Z Some projects volumes on pica are slow <p>Some projects volumes on pica are slow, this may also possibly affect home directories.</p> Wed, 24 Jan 2018 08:24:00 GMT 2018-01-24T08:24:00Z Login issue for new Bianca projects -- FIXED <p>A network problem has been detected on Bianca causing logins to fail for a few of the most recent Bianca projects . We are working on fixing the problem, and expect to Bianca fully working again very soon.<br /> &nbsp;</p> Tue, 16 Jan 2018 08:23:00 GMT 2018-01-16T08:23:00Z Maintenance window -- COMPLETED <p>Wednesday 2018-01-10</p> <p></p> <p>Monthly maintenance window begins at 0900 hours on the first Wednesday of the month. (That is&nbsp; today.)</p> <p>This time we will:</p> <ul> <li>Upgrade Slurm, Linux kernel and other system software on Bianca, Dis, Fysast1, Irma, Milou, and Rackham.</li> <li>Upgrade Linux kernel and other system software on Castor and Grus.</li> </ul> <p>Bianca and Grus will be unavailable while we service them.</p> <p>We will restart all login nodes of Fysast1, Irma, Milou and Rackham, probably only once.</p> <p>Slurm queues on Fysast1, Irma, Milou and Rackham will be stopped, but access to Slurm commands will mostly work&nbsp;during the day.</p> <p>Slurm queues on Bianca will be stopped and, most of the day, logins to Bianca will not be possible.</p> <p>We plan to keep you informed about out progress with the maintenance with updates here.</p> <p>UPDATE 2018-01-10, 16:00</p> <p>Irma is up and running. Bianca, Milou, Rackham and Fysast1 are still down. We will continue security upgrades tomorrow (Thursday) morning.</p> <p>UPDATE 2018-01-11, 15:15</p> <p>Irma, Milou, Rackham and Fysast1&nbsp;are&nbsp;up and running. Bianca is still being tested. Hopefully Bianca will be back today.</p> <p>UPDATE 2018-01-11, 16:00</p> <p>Bianca is now up, however, graphical login is not working right now. Text login works fine (</p> <p>We're still working on Dis and expect it to be up by tomorrow.</p> <p></p> <p></p> Wed, 10 Jan 2018 08:41:00 GMT 2018-01-10T08:41:00Z Extension of lupus <p>The vendor visited us last week and did the physical installation of the lupus extension. Unfortunately, some parts were not correct and we're currently waiting for exchanges that are expected to arrive this week.</p> Mon, 08 Jan 2018 14:24:00 GMT 2018-01-08T14:24:00Z UPPMAX staff back after the holidays <p>We hope 2018 has been good to you so far! UPPMAX staff is back after the holidays and we're focusing on support tickets that have built up over the holidays.</p> Mon, 08 Jan 2018 10:14:00 GMT 2018-01-08T10:14:00Z Reduced staff availability over the coming holidays combined with lots of tickets <div> <p>Most of our staff is on vacation over the coming holidays. You can contact us using regular channels, but response times for support questions might be longer than normal. We are sorry for the inconvenience.</p> <p>First week of January, most of us are back again.</p> <p>We also have a lot of tickets about transfer from Milou to Rackham/Bianca and we think there might be hundreds of last minute requests in January. Be prepared the process of getting a transfer project takes some time.</p> <p>If you want to continue your Milou project, make sure you have applied for a storage project and compute project on Rackham (for non-sensitive data), or a project on Bianca (for sensitive data).</p> </div> Mon, 18 Dec 2017 08:54:00 GMT 2017-12-18T08:54:00Z Creation of new Bianca projects currently on hold -- FIXED <p>The creation of new Bianca project are currently on hold. If your project is scheduled to start today you will be unable to login.<br /> &nbsp;</p> Thu, 14 Dec 2017 12:57:00 GMT 2017-12-14T12:57:00Z milou2 rebooted on Friday 2017-12-08 at 03:52 <p>milou2 rebooted on Friday 2017-12-08 at 03:52</p> Fri, 08 Dec 2017 07:35:00 GMT 2017-12-08T07:35:00Z milou2 rebooted on Wednesday 2017-12-06 at 03:58 <p>milou2 rebooted on Wednesday 2017-12-06 at 03:58</p> Wed, 06 Dec 2017 07:30:00 GMT 2017-12-06T07:30:00Z Updates from SUPR are temporarily disabled <p>We are performing a change in our infrastructure today starting at 13:00. This change will temporarily stop updates from SUPR reaching UPPMAX. If you have for example recently joined or added a member to a project, you will have to wait before the change becomes visible at UPPMAX.<br /> &nbsp;</p> Mon, 04 Dec 2017 12:42:00 GMT 2017-12-04T12:42:00Z Fix for broken SSH-connections to the UPPMAX Cloud <p>If you regularly end up with broken SSH-connections (&quot;broken pipe&quot;) to your virtual machine in the UPPMAX region, please use the SSH option ServerAliveInterval. See below for an example.</p> Mon, 04 Dec 2017 09:42:00 GMT 2017-12-04T09:42:00Z Issue with the Intel License server <p>At this moment there is an issue with the Intel license server. You will be unable to use the icc compiler and Intel tools until this issue is resolved.<br /> &nbsp;</p> Mon, 27 Nov 2017 07:09:00 GMT 2017-11-27T07:09:00Z UPPMAX support low on staff Monday 20/11 <p>The UPPMAX support will be low on staff on Monday 2017-11-20 due to conference.</p> Fri, 17 Nov 2017 15:36:00 GMT 2017-11-17T15:36:00Z How to get a high job priority on Bianca Thu, 16 Nov 2017 15:31:00 GMT 2017-11-16T15:31:00Z Support ticket system temporarily down --FIXED <p>Our support email address was down for a couple of hours, but is back in service again.</p> <p><br /> &nbsp;</p> Wed, 15 Nov 2017 12:28:00 GMT 2017-11-15T12:28:00Z Logging in to Bianca without Rackham <p>Bianca users outside of SUNET will be unable to login using We have created a temporary workaround.</p> Fri, 10 Nov 2017 09:54:00 GMT 2017-11-10T09:54:00Z Rackham unavailable -- SOLVED: Rackham available <p>2017-11-17 09:35 Rackham is now back in regular service.</p> <p>Login nodes are now open on Rackham, and jobs are expected to run as usual on Friday morning.</p> <p>It was decided to temporarily close down the Rackham cluster last Thursday when several disks on Crex reported themselves broken. The problems now seems solved, and we're awaiting results from the last tests before Rackham is fully back in service.<br /> &nbsp;</p> Thu, 09 Nov 2017 07:59:00 GMT 2017-11-09T07:59:00Z UPPMAX power outage -- FIXED <p>UPPMAX experienced a power outage in the server hall on Tuesday.</p> Tue, 07 Nov 2017 14:23:00 GMT 2017-11-07T14:23:00Z Problems with /sw on Bianca (now fixed) <p>The /sw part of Bianca was lost around 07:30 this morning due to an issue with the storage system. This may have caused failed jobs. The system was fixed 08:40.</p> Fri, 03 Nov 2017 08:27:00 GMT 2017-11-03T08:27:00Z Quick upgrade of Slurm 2017-11-02 -- COMPLETED Thu, 02 Nov 2017 15:06:00 GMT 2017-11-02T15:06:00Z Maintenance window Wednesday 2017-11-01 -- COMPLETED <p>Monthly maintenance window begins at 09:00 hours on the first Wednesday of the month. (That is&nbsp; today.)</p> Wed, 01 Nov 2017 06:59:00 GMT 2017-11-01T06:59:00Z Issues with /sw/data during the week end <p>/sw/data from pica may have been unavailable for some jobs during the week end and some jobs may have failed because of this.</p> Mon, 23 Oct 2017 10:16:00 GMT 2017-10-23T10:16:00Z UPPMAX support system is down -- SOLVED <p>RT, the support system UPPMAX and all the rest of SNIC is using, is down.<br /> It is located at NSC at&nbsp;&nbsp;Link&ouml;ping University and the whole university has network problems.</p> <p>This will make all email to and from;delayed until the network problem is fixed. So answers to Your support tickets will be delayed.</p> <p>UPDATE: 11:45</p> <p>We now have contact with our support system and emails to are reaching us again.</p> Fri, 20 Oct 2017 08:12:00 GMT 2017-10-20T08:12:00Z Slow home direcotories <p>Someone seems to be running something very I/O-heavy from the home directories. We are looking for these jobs and will terminate them if found, but it's less than certain that we'll find them.</p> <p>UPDATE: 14:45&nbsp;<br /> We found the guilty jobs and are termintating them and have notified the user not to do that again.</p> Fri, 06 Oct 2017 12:13:00 GMT 2017-10-06T12:13:00Z Accident on Irma caused jobs to fail with status NODE_FAIL <p>We sadly inform you that today at 17:02:37 a human error caused the compute nodes on Irma to reboot. The jobs running was canceled and will show up with status NODE_FAIL. The accident occured while investigating an issue with the storage network. We are very sorry about this.</p> Wed, 04 Oct 2017 15:42:00 GMT 2017-10-04T15:42:00Z UPPMAX shutdown due to cooling failure -- FIXED <p></p> <p></p> Tue, 26 Sep 2017 18:32:00 GMT 2017-09-26T18:32:00Z lupus failover issue -- FIXED Mon, 25 Sep 2017 10:44:00 GMT 2017-09-25T10:44:00Z Maintenance indication in output from command jobinfo <p>UPPMAX made a small change in &quot;jobinfo&quot; output.</p> <p>In the REASON column for waiting jobs, &quot;(Maintenance)&quot; is shown for jobs that can not start before the next maintenance reservation.</p> <p>Please note that maintenance reservations many times are moved forward to next month before the actual maintenance window.</p> Mon, 25 Sep 2017 10:35:00 GMT 2017-09-25T10:35:00Z Many Irma compute nodes lost electric power -- FIXED <p>Three racks of Irma's compute nodes lost power,because an automatic fuse shut down.</p> <p>Some jobs were lost due to this.&nbsp; We are very sorry about that. Please rerun those jobs that were affected.</p> <p>It looks like nodes i[167-250] were affected.<br /> &nbsp;</p> <p>So what was the reason? It looks like an ethernet switch diied, possibly short circuited, so the automatic fuse shut down, getting more switches and the compute nodes to go down.</p> <p>We have error reported to our support vendor. Until the bad ethernet switch has been repaired or replaced, Irma runs with a fewer number of compute nodes.</p> <h3>Update at 0950 hours</h3> <p>Now only nodes i[179-226] are down.</p> Mon, 25 Sep 2017 07:05:00 GMT 2017-09-25T07:05:00Z Maintenance window Wednesday 2017-09-06" -- FINISHED Wed, 06 Sep 2017 06:09:00 GMT 2017-09-06T06:09:00Z milou2 rebooted August 28 <p>milou2 rebooted Monday 2017-08-28 at 19:51.</p> Tue, 29 Aug 2017 07:22:00 GMT 2017-08-29T07:22:00Z Replacing (nearly) all disks on Irma's compute nodes -- DONE Wed, 23 Aug 2017 05:44:00 GMT 2017-08-23T05:44:00Z Restarting irma-q <p>We're restarting irma-q for technical reasons. The slurm queue system may be unavailable for submitting/verifying job status for a few minutes.</p> Tue, 22 Aug 2017 06:12:00 GMT 2017-08-22T06:12:00Z milou2 rebooted August 19 <p>milou2 rebooted on Saturday 2017-08-19.</p> Mon, 21 Aug 2017 07:13:00 GMT 2017-08-21T07:13:00Z Bianca's storage system Castor had a hiccup yesterday Thursday -- FIXED Fri, 04 Aug 2017 08:14:00 GMT 2017-08-04T08:14:00Z Maintenance window Wednesday 2017-08-02 -- FINISHED Wed, 02 Aug 2017 06:37:00 GMT 2017-08-02T06:37:00Z Unexpected reboot of Pica at Monday morning. Mon, 24 Jul 2017 08:06:00 GMT 2017-07-24T08:06:00Z Restart of two Milou login servers today Thursday Thu, 20 Jul 2017 14:01:00 GMT 2017-07-20T14:01:00Z Lower service level during UPPMAX holidays Thu, 13 Jul 2017 08:54:00 GMT 2017-07-13T08:54:00Z Part of storage system Pica is still very slow Wed, 12 Jul 2017 18:00:00 GMT 2017-07-12T18:00:00Z Pica was partly restarted just now, please look for problems in your job output <p>UPPMAX had to restart part of storage system Pica, because it worked too slowly with nearly no read/write traffic.</p> <p>The restart was done a little after 1300 hours.</p> <p>For Rackham users, this meant that you might have had problems with reading and writing to your home directory.</p> <p>For Milou users, this meant that you also might have had problems with reading and writing to your home directory. But for Milou users, also reading from /sw (where the modules live) and reading and writing to some project directories were affected.</p> <p>Please look one extra time for problems in your job output, for jobs running at this time.</p> <p>We are sorry for the inconvenience.<br /> &nbsp;</p> Tue, 11 Jul 2017 11:14:00 GMT 2017-07-11T11:14:00Z On Milou and Rackham, very difficult to login or otherwise use /home directories -- FIXED <p>UPPMAX has problem with an extremely slow access to /sw (where e.g. modules live) and home directories on&nbsp; Milou, and to home directories on Rackham.</p> <p>Because of that, it is very difficult to login to Milou and Rackham.</p> <p>We will investigate the source of this problem, and will report any success as updates here.</p> <h3>Update at 1310 hours</h3> <p>We restarted part of Pica, and&nbsp; that solved the problem</p> <p>Hopefully your jobs will continue without problems, but please be careful and look once extra time for errors in your job output.</p> Tue, 11 Jul 2017 09:26:00 GMT 2017-07-11T09:26:00Z SUPR and C3SE website down <p>SUPR and C3SE websites are down at the moment. This prevents you from using SUPR at the moment. Please try again later</p> Tue, 11 Jul 2017 06:47:00 GMT 2017-07-11T06:47:00Z No maintenance planned for today's maintenance window <p>First (non-holiday) Wednesday of each month is UPPMAX's normal, planned maintenance window.</p> <p>But today we will do no maintenance.</p> <p>Happy computing!</p> <p>Next maintenance window is 2nd of August.</p> Wed, 05 Jul 2017 06:57:00 GMT 2017-07-05T06:57:00Z Restart of login server milou-f Tuesday morning -- FINISHED <p>File system mounts of Pica volumes was not working correctly.</p> <p>This was fixed by a restart of the server. Now it works much better.</p> <p>We are sorry about any inconvenience for you due to this.</p> Tue, 04 Jul 2017 08:55:00 GMT 2017-07-04T08:55:00Z Lost contact with Milou nodes m[1-48] for an hour this morning -- FIXED <p>From approximately 0800 hours to 0910 hours this morning, an ethernet switch in Milou lost power, making 48 nodes unavailable.</p> <p>Two jobs got NODE_FAIL when trying to start, and interactive work on these nodes was denied. Otherwise, we seem to have had no problems with the temporary network loss.</p> Mon, 03 Jul 2017 07:35:00 GMT 2017-07-03T07:35:00Z Singularity is available <p>The container engine&nbsp;<a href="">Singularity</a> is available on UPPMAX systems now. See the&nbsp;<a href="" title="Singularity user guide">Singularity users guide</a> for more information.</p> Tue, 27 Jun 2017 06:11:00 GMT 2017-06-27T06:11:00Z Urgent kernel upgrade -- FINISHED <p>Today we are performing an urgent kernel upgrade on Milou, Fysast1, Rackham, Irma, and Bianca. Login nodes will be restarted during the day. No running jorbs or queues are stopped. We will update on the progress here in System News during the day.</p> <p>UPDATE 16:00 - Update completed.</p> Wed, 21 Jun 2017 07:01:00 GMT 2017-06-21T07:01:00Z Bianca graphical login now working <p>Uses Thinlinc Web Access. Not X-forwardning.</p> Mon, 12 Jun 2017 09:33:00 GMT 2017-06-12T09:33:00Z Bianca's storage system Castor has problems -- FIXED Sun, 11 Jun 2017 17:47:00 GMT 2017-06-11T17:47:00Z Maintenance window Wednesday 2017-06-07 -- FINISHED Wed, 07 Jun 2017 06:06:00 GMT 2017-06-07T06:06:00Z